Queues - Amazon Kinesis
- Managed alternative to Apache Kafka
- “Real-time” big data e.g. application logs, metrics, IoT, clickstreams.
- Streaming processing frameworks (Spark, NIFI, etc…)
- High availability through data replication to 3 AZ.
- Has 4 different products:
- Kinesis Streams (data stream): Ingest & process streaming data
- Kinesis Firehouse (delivery stream): load streams into S3, Redshift, ElasticSearch
- Kinesis Analytics: perform real-time analytics on streams using SQL
- Kinesis video streams: Process & analyze streaming media in real-time.
- 💡 Common integration: Feed Kinesis Streams -> Analyze data in real time using Kinesis Analytics -> Load streams into Kinesis Firehose -> S3/Redshift for long time retention.
- Control access / authorization using IAM policies
- Encryption in flight using HTTPS endpoints
- Encryption at rest using KMS
- Possibility to encrypt / decrypt data client side (harder)
- VPC Endpoints available for Kinesis to access within VPC
||Blob (max 1 MB) ingestion
||Convert & load blob
||Per shard and message size/amount per read/write, max 1 MB blob
||None: source data (max 1MB blob)
||None: source stream
||stream level (5TPS/H), connection-level (1/s), bandwidth MB/s, fragment-level
||Near real time (60 secs latency)
||Rendering + encryption (if stored)
||1 day (default), up to 7 days
||Retries during 24-hours then fails
||Can create streams based on queries
||Min 1 hour, max 10 years (in S3)
||Shards, Producers, Consumers
||Delivery stream, Record, Destination
||Input, Code, Application
||Video stream, Fragment, Producer, Consumer, Chunk
||Convert & load streaming data into e.g. Redshift, S3, ElasticSearch, Splunk
||Real-time analytics using SQL
||Machine learning & analytics
Kinesis Data Streams
- Low latency streaming ingest at scale like Kafka.
- Streams are divided in ordered Shards / Partitions
- Data goes to any shard & consumers consume from any shard
- Producers -> Stream (Shard 1 / Shard 2 / Shard 3) -> Consumers
- Pub/sub through streams (topic)
- Multiple applications can consume the same stream
- You get sequence number (checkpoint offset, should be saved) back for each data added
- To scale: Add shards -> Increased throughput
- Pricing is per shard provisioned, can have as many shard as possible
- Data retention is 1 day by default, can go up to 7 days
- The message is base64-encoded blob
- ❗ Data (before encoding) cannot exceed 1 MB
- Ability reprocess / replay data as data is not removed after message is handled.
- Immutable: Once data is inserted in Kinesis, it cannot be deleted
- Batching available or per message calls
- 💡 Use batching with
PutRecords API to reduce costs and increase throughput
- Security: You can enable server-side encryption with an AWS KMS master key
- You can capture shard level metrics with CloudWatch at additional cost
- For e.g.
- One stream is made of many different shards
- The number of shards can evolve over time (re-shard / merge)
- Records are ordered per shard
- Partition key
- Partition key gets hashed to determine the shard ID
- Ensures ordering in a shard and same key always goes to same shard.
- 💡 Choose a partition key that’s highly distributed
- 📝 Helps preventing hot partition or hot shard
- ❗ 1 MB/s or 1000 messages/s at write PER SHARD
- ❗ 2 MB/s at read PER SHARD
ProvisionedThroughputExceeded if you go over the limits
- 💡 Solution
- Use retries with exponential back-off
- Increase shards (scaling)
- Ensure your partition key is a good one e.g. you don’t have a hot shard
- Normal consumer (CLI, SDK, etc…)
- Kinesis Client Library (in Java, Node, Python, Ruby, .NET)
- Uses DynamoDB to checkpoint offsets
- KCL uses DynamoDB to track other workers and share the work amongst shards
- Fully managed service, no administration, automatic scaling
- Near real time (60 seconds latency)
- Load data into Redshift / Amazon S3 / ElasticSearch / Splunk
- Support many data format (pay for conversion)
- Pay for the amount of data going through Firehose
- Delivery stream: Create & send data into it
- Record: Data of interest your data producer sends to a delivery stream
- ❗ Max size before base-64 encoding is 1000KB
- Destination: data storage, S3, Amazon Redshift, Amazon Elasticsearch Service, Splunk.
Kinesis Data Analytics
- Perform real-time analytics on Kinesis Streams using SQL
- Fully managed with auto scaling to match source stream throughput.
- Pay for actual consumption rate
- Can create streams out of the real-time queries
Kinesis Video Streams
- Ingest video streams for e.g. analytics and machine learning.
- HTTP Live Streaming (HLS) enables you to playback video for live and on-demand viewing
- Uses Amazon S3 as the underlying data store.
- Data at rest using KMS, IAM roles, and data in transit using TLS.
- Video stream: AWS resource that encrypts & time-stamps & stores.
- Fragment: self-contained sequence of frames
- Source: e.g. video-generating device, such as a security camera, a body-worn camera
- Consumer: consume and process data in Kinesis video streams e.g. EC2 / AWS AI services / 3rd party.
- Chunk: Stored data, consists of the actual media fragment
Licenses and Attributions