Big Data & Data Analytics

Amazon EMR - Elastic Map Reduce

  • Helps creating Hadoop clusters (Big Data) to analyze and process vast amount of data.
  • Map Reduce
    • Massive Parallel Processing technique
    • Steps
      1. Map with user defined mappers: Get data as key value
      2. Sort handled by framework
      3. Reduce with user defined reducers: Aggregates data
    • Example
      • Documents have words, map counts animal words, framework sort sorts, and reduce sums total occurrences of animal words.
      • Map => cat:10, dog:3 & cat:24, dog:5
      • Reduce => cat: 34 , dog: 8
  • The clusters can be made of hundreds of EC2 instances.
  • EMR takes care of all the provisioning and configuration
  • Also supports Apache Spark, HBase, Presto, Flink…
  • Auto-scaling and integrated with Spot instances for lower price.
  • Nodes
  • Master node: One node that manages cluster e.g. track status & monitor health
    • 💡 You can SSH into master node and from there to task nodes.
  • Core nodes: Data (HDFS), compute.
    • Input splits: Smaller data chunks of input data split by Hadoop
      • Controls control number of Mapper in Map/Reduce
    • Blocks: Physical splitting of data
  • Task nodes: Optional, only compute
    • Usually runs on spot instances
    • Amazon handles if they get executed through running master operations in core nodes.
  • How it works:
    1. Upload your data and processing application to S3
    2. Configure and create your cluster by specifying data inputs, outputs, cluster size, security settings, etc.
      • Launch modes:
        • Cluster: Long running
        • Step execution: Do analysis & shut-down cluster
      • Instance size & number of instances
      • Select EC2 key pair & IAM roles to EC2 instances
    3. Monitor the health and progress of your cluster. Retrieve the output in S3
  • 💡 Use cases: data processing, machine learning, web indexing, big data

Query Engines

  • Comparison

    Attribute S3 / Glacier Select Athena Redshift Spectrum
    Functionality SQL SELECT on S3 objects Query & extract data from S3 (ETL) SQL queries on S3 using your Redshift cluster
    S3 Glacier Support
    Read compressed files  
    Integrations S3, Glacier AWS Glue, QuickSight, VPC Flow logs, ELB logs, CloudFront & CloudTrail logs S3, Redshift cluster
    Infrastructure Serverless Serverless Serverless
    Pricing Per query: Data Scanned + Data Returned + S3 (data transfer + requests) Per query: Data Scanned + (data transfer + requests) Data scanned + S3 transfer & request + Redshift cluster
    When to use Simple SELECT ETL, Logs, Complex queries (ANSI SQL Compliant e.g. group by, having, window and geo functions) When using Redshift cluster
  • Athena vs Spectrum

    • Athena for ad hoc data discovery and SQL querying
    • Redshift Spectrum for
      • More complex queries and scenarios
      • A large number of data lake users want to run concurrent BI and reporting workloads

AWS Glue

  • Fully managed 📝ETL (Extract, Transform & Load) service
    • Automates time consuming steps of data preparation for analytics.
    • E.g. sorting the data by client timestamp, compressing and storing in a read optimized format.
  • Serverless, pay as you go, fully managed, provisions Apache Spark
  • Crawls data sources and identifies data formats (schema inference)
  • Discover, categorize, save in a catalog from connected sources
    • Sources: Amazon Aurora, RDS, Redshift & S3
  • Automated Code Generation
    • It’s a customizable Apache Spark code
  • Sinks: S3, Redshift, etc..
  • Glue Data Catalog
    • Metadata (definition & schema) of the Source Tables
    • Can be used by EMR
    • Crawlers: Programs that run through data to infer schemas and partitions from e.g. S3, Redshift, RDS.
  • E.g. scenario =>
    • Kinesis (ingest & process) -> S3 (through Kinesis Data Firehose) -> Spark job on AWS glue (sort, optimize, crawl and catalog) -> Batch process catalog using Amazon MR (Elastic Map Reduce), query with Athena.

AWS Data Pipeline

  • Move data between different AWS compute and storage services, as well as on-premises data sources, at specified intervals.
  • Output to Amazon S3, Amazon RDS, Amazon DynamoDB, and Amazon EMR.
  • A pipeline consists of
    • Data nodes for source or destination storages (s3, mysql, dynamodb…)
    • E.g. import, copy, export.
  • Differences from ETL
    • ETL pipelines are a subset of data pipelines.
    • ETL may modify data, data pipeline won’t.
    • ETLs end in warehouses, built for warehousing.

AWS Batch

  • Managed HPC (high performance computing)
  • Batch plans, schedules, and executes your batch computing workloads using Amazon EC2 and Spot Instances.

Licenses and Attributions


Speak Your Mind