Databases - Redshift

  • Redshift = Data warehouse = Analytics / BI
  • Columnar database e.g. stores columns of data instead of rows.
    • Good for OLAP (online analytical processing), see RDBMS for OLTP (online transaction processing)
    • Makes compression very easier
  • For analytics and data warehousing
  • Used to pull in very large and complex data sets.
  • Has a SQL interface for performing the queries
  • Heavily modified version of PostgreSQL
  • Scaling
    • Can scale to PBs of data
    • From 1 node to 128 nodes, up to 160 GB of space per node
      • Leader node: for query planning, results aggregation
      • Compute node: for performing the queries, send results to leader
  • It’s a Massively Parallel Query Execution (MPP) database.
    • Coordinated processing of a single task by multiple processors, each processor using its own OS and memory and communicating with each other using some form of messaging interface.
    • Data and query is automatically distributed and balanced.
  • Pay as you go based on the instances provisioned (not serverless)
  • 💡 Choose right cluster size & use as much as you can
  • 💡 Great to use with BI tools on top of it such as AWS Quicksight, SQL Server Reporting Services,Tableau integrate with it
  • Data is loaded from S3, DynamoDB, DMS (Database Migration Services), other DBs…
  • Good practice -> Create read replica from RDS and pull the data from the read replica and load it into Redshift
  • Copy between regions: Take snapshot => Copy snapshot to new region => Create cluster from snapshot
  • Redshift Spectrum: serverless service to perform queries directly against S3 (no need to load)
  • Security: VPC / IAM / KMS & CloudWatch Monitoring
    • Redshift Enhanced VPC Routing: COPY / UNLOAD goes through VPC
    • Encrypted in transit using SSL
  • Pricing: Pay per Compute Node hours (not leader node), backup and data transfer (only within a VPC)
  • Cluster subnet group allows you to specify set of subnets in your VPC for your instances.

Availability

  • No multi AZ, only available in 1 AZ
    • Snapshots can be deployed to new AZ for e.g. DR.
  • Backups are enabled by default, max 35 days.
  • Attempts to maintain at least three copies of your data.
    • Original, replica on the compute nodes, and a backup in S3.
  • Have functionality for Automated Cross-Region Snapshot.
  • Always keeps three copies of your data

Licenses and Attributions


Speak Your Mind