Books / AWS in Bullet Points - Exam Prep Study Notes / Chapter 33

Queues - Amazon Kinesis

Managed alternative to Apache Kafka
Use-cases
- “Real-time” big data e.g. application logs, metrics, IoT, clickstreams.
- Streaming processing frameworks (Spark, NIFI, etc…)
High availability through data replication to 3 AZ.
Has 4 different products:
- Kinesis Streams (data stream): Ingest & process streaming data
- Kinesis Firehouse (delivery stream): load streams into S3, Redshift, ElasticSearch
- Kinesis Analytics: perform real-time analytics on streams using SQL
- Kinesis video streams: Process & analyze streaming media in real-time.
💡 Common integration: Feed Kinesis Streams -> Analyze data in real time using Kinesis Analytics -> Load streams into Kinesis Firehose -> S3/Redshift for long time retention.
Security
- Control access / authorization using IAM policies
- Encryption in flight using HTTPS endpoints
- Encryption at rest using KMS
- Possibility to encrypt / decrypt data client side (harder)
- VPC Endpoints available for Kinesis to access within VPC

Comparison

Attribute	Data Streams	Firehouse	Analytics	Video streams
Description	Blob (max 1 MB) ingestion	Convert & load blob	Query engine	Video ingestion
Throughput/Limits	Per shard and message size/amount per read/write, max 1 MB blob	None: source data (max 1MB blob)	None: source stream	stream level (5TPS/H), connection-level (1/s), bandwidth MB/s, fragment-level
Latency	Real time	Near real time (60 secs latency)	Real time	Rendering + encryption (if stored)
Data retention	1 day (default), up to 7 days	Retries during 24-hours then fails	Can create streams based on queries	Min 1 hour, max 10 years (in S3)
Scaling	Add shards	Auto-scales	Auto-scales	Auto-scales
Concepts	Shards, Producers, Consumers	Delivery stream, Record, Destination	Input, Code, Application	Video stream, Fragment, Producer, Consumer, Chunk
Use-cases	Streams	Convert & load streaming data into e.g. Redshift, S3, ElasticSearch, Splunk	Real-time analytics using SQL	Machine learning & analytics

Kinesis Data Streams

Low latency streaming ingest at scale like Kafka.
Streams are divided in ordered Shards / Partitions
Data goes to any shard & consumers consume from any shard
- Producers -> Stream (Shard 1 / Shard 2 / Shard 3) -> Consumers
- Pub/sub through streams (topic)
- Multiple applications can consume the same stream
- You get sequence number (checkpoint offset, should be saved) back for each data added
To scale: Add shards -> Increased throughput
Pricing is per shard provisioned, can have as many shard as possible
Data retention is 1 day by default, can go up to 7 days
Data
- The message is base64-encoded blob
- ❗ Data (before encoding) cannot exceed 1 MB
- Ability reprocess / replay data as data is not removed after message is handled.
- Immutable: Once data is inserted in Kinesis, it cannot be deleted
- Batching available or per message calls
  - 💡 Use batching with PutRecords API to reduce costs and increase throughput
Security: You can enable server-side encryption with an AWS KMS master key
Monitoring
- You can capture shard level metrics with CloudWatch at additional cost
  - For e.g. IncomingBytes, IncomingRecords, OutgoingBytes, WriteProvisionedThroughputExceeded, ReadProvisionedThroughputExceeded.
Shards
- One stream is made of many different shards
- The number of shards can evolve over time (re-shard / merge)
- Records are ordered per shard
- Partition key
  - Partition key gets hashed to determine the shard ID
  - Ensures ordering in a shard and same key always goes to same shard.
    - 💡 Choose a partition key that’s highly distributed
      - 📝 Helps preventing hot partition or hot shard
- Throughput
  - ❗ 1 MB/s or 1000 messages/s at write PER SHARD
  - ❗ 2 MB/s at read PER SHARD
  - ProvisionedThroughputExceeded if you go over the limits
  - 💡 Solution
    - Use retries with exponential back-off
    - Increase shards (scaling)
    - Ensure your partition key is a good one e.g. you don’t have a hot shard
SDKs
- Normal consumer (CLI, SDK, etc…)
- Kinesis Client Library (in Java, Node, Python, Ruby, .NET)
  - Uses DynamoDB to checkpoint offsets
  - KCL uses DynamoDB to track other workers and share the work amongst shards

Kinesis Firehose

Fully managed service, no administration, automatic scaling
Near real time (60 seconds latency)
Load data into Redshift / Amazon S3 / ElasticSearch / Splunk
Support many data format (pay for conversion)
Pay for the amount of data going through Firehose
Concepts
- Delivery stream: Create & send data into it
- Record: Data of interest your data producer sends to a delivery stream
  - ❗ Max size before base-64 encoding is 1000KB
- Destination: data storage, S3, Amazon Redshift, Amazon Elasticsearch Service, Splunk.

Kinesis Data Analytics

Perform real-time analytics on Kinesis Streams using SQL
Fully managed with auto scaling to match source stream throughput.
Pay for actual consumption rate
Can create streams out of the real-time queries

Kinesis Video Streams

Ingest video streams for e.g. analytics and machine learning.
HTTP Live Streaming (HLS) enables you to playback video for live and on-demand viewing
Uses Amazon S3 as the underlying data store.
Data at rest using KMS, IAM roles, and data in transit using TLS.
Concepts
- Video stream: AWS resource that encrypts & time-stamps & stores.
- Fragment: self-contained sequence of frames
- Source: e.g. video-generating device, such as a security camera, a body-worn camera
- Consumer: consume and process data in Kinesis video streams e.g. EC2 / AWS AI services / 3rd party.
- Chunk: Stored data, consists of the actual media fragment

Licenses and Attributions

Speak Your Mind Cancel reply

-->

Related Books

Books / AWS in Bullet Points - Exam Prep Study Notes / Chapter 33

Queues - Amazon Kinesis

Comparison

Kinesis Data Streams

Kinesis Firehose

Kinesis Data Analytics

Kinesis Video Streams

Licenses and Attributions

Original Content License

Speak Your Mind Cancel reply