使用 Amazon Kinesis Firehose 和 Amazon Redshift 进行数据流分析 李君,AWS 高级技术讲师 Amy Li, Technical Trainer, Amazon Web Services
今日议程 AWS 上大数据应用体系介绍 Amazon Kinesis Firehose & Amazon Redshift 一起动手搭建日志分析流处理解决方案 Q & A Step 1: 建立 Redshift 集群和表 Step 2: 创建 Firehose Delivery Stream, 并且配置数据转换 Step 3: 发送数据给 Firehose Delivery Stream Step 4: 查询和分析 Redshift 中的数据 Step 5: 监控流数据处理
大数据的特点 Variety 多样 Value Velocity 高速 价值 Volume 海量
大数据管道 "Pipeline" 数据 收集 处理 分析 展现 洞见 存储 应答时间 ( 延迟 ) - 吞吐量和成本的平衡
AWS 大数据应用体系 收集 存储 处理 分析 展现 Near Real-time Amazon Kinesis Firehose Data Import Amazon Import/Export Snowball Message Queuing Amazon SQS Web/app Servers Amazon EC2 Object Storage Amazon S3 Amazon Glacier Near Real-time Amazon Kinesis Streams RDBMS Amazon RDS NoSQL DynamoDB Hadoop Ecosystem Amazon EMR Near Real-time AWS Lambda Amazon Kinesis Analytics Data Warehousing Amazon Redshift Machine Learning Amazon Machine Learning Business Intelligence and Data Visualization Amazon QuickSight Elastic Search Analytics Amazon Elasticsearch Service Search Amazon CloudSearch Elastic Search Analytics Amazon Elasticsearch Service Internet of Things (IoT) Amazon IoT Process and Move Data AWS Data Pipeline Ad Hoc Analytics Amazon Athena
数据处理面临哪些挑战?
大部分数据是持续产生的 [Wed Oct 11 14:32:52 2000] [error] [client 127.0.0.1] client denied by server configuration: /export/home/live/ap/ht docs/test 移动端应用网页点击流应用程序日志 测量记录 IoT 传感器智能建筑
数据价值递减曲线 最近的数据非常有价值 如果你及时采取行动 转瞬即逝的洞察力 (M. Gualtieri, Forrester) 旧的 + 最近的数据更有价值 如果你有办法将他们结合起来
数据处理速度是关键 批处理 流处理 每小时日志收集每周 每月的账单每日用户访问数据每日金融欺诈报告 应用系统实时指标此时此刻哪里出了问题? 实时消费预警 / 封顶防止大手大脚实时点击流分析现在能为用户做点啥? 实时监测屏蔽可能欺诈的使用
Amazon Kinesis 接收流数据 实时处理数据 每小时储存数 TB 的数据
Amazon Kinesis Streams
Amazon Kinesis Streams 易于管理 : 创建流 设置初始分片数量, 之后动态扩展或缩减分片数量以匹配你的数据吞吐量 构建实时应用程序 : 使用 Kinesis Client Library (KCL), Apache Spark/Storm, AWS Lambda,... 等构建数据使用者应用程序 费用低廉 : 任何规模的工作负载都具有成本效益
AWS Endpoint Amazon Kinesis 流处理基本架构 Data Sources Data Sources Availability Zone Availability Zone Availability Zone App.1 [Aggregate & De-Duplicate] App.2 Data Sources Shard 1 Shard 2 Shard N [Metric Extraction] Data Sources App.3 [Sliding Window Analysis] Data Sources App.4 [Machine Learning]
Amazon Kinesis Firehose
AWS endpoint Amazon Kinesis Firehose Data sources Data sources Amazon S3 Data sources No Partition Keys No Provisioning End-to-End Elastic Data sources Amazon Redshift Data sources Amazon Elasticsearch Service
Kinesis Firehose 主要概念 Delivery Stream:Kinesis Firehose 的基础实体 ; 通过创建 delivery stream 并向其发送数据来使用 Kinesis Firehose Record: 从 Producer 发送到 delivery stream 的数据 缓冲区大小和缓冲间隔 : Kinesis Firehose 先缓存特定大小或特定时长的传入数据, 然后将它们发送到目标 Buffer Size 的单位是 MB;Buffer Interval 的单位是秒记录最大可达 1000 KB
Firehose 数据流到 S3
Firehose 数据流到 Redshift
Firehose 数据流到 Elasticsearch
Kinesis Firehose: 数据输入 AWS SDK PutRecord() PutRecordBatch() Kinesis Agent 持续监控文件, 并将新数据发送到 Firehose delivery stream 处理文件轮换 检查点操作并在失败时重试 具备格式转换和日志分析等与数据预处理能力 发送 Amazon CloudWatch 指标, 以便监控流处理过程并排除故障
Kinesis Firehose: 数据变换 Kinesis Firehose AWS Lambda 异步调用 Lambda
Amazon Redshift PB 级数据仓库 MPP 架构 完全托管, 树分钟内完成预置 内置安全性
一起动手搭建日志分析流处理 解决方案
Step 1: 建立 Redshift 集群和表
连接到 Redshift 数据库
创建数据表
Step 2: 创建 Firehose Delivery Stream, 并配置数据转换
Step 3: 发送数据给 Firehose Delivery Stream
Sample Data 219.134.32.117 - - [16/Feb/2017:09:38:20-0800] "GET /wp-content HTTP/1.1" 200 4521 "-" "Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; Trident/5.1;.NET CLR 3.8.23015.5)" 95.169.41.62 - - [16/Feb/2017:09:38:20-0800] "PUT /app/main/posts HTTP/1.1" 200 3883 "-" "Mozilla/5.0 (Windows NT 6.2; Trident/7.0; rv:11.0) like Gecko" 221.147.191.247 - - [16/Feb/2017:09:38:20-0800] "GET /explore HTTP/1.1" 200 6579 "-" "Mozilla/5.0 (Windows; U; Windows NT 5.1) AppleWebKit/538.0.1 (KHTML, like Gecko) Chrome/38.0.895.0 Safari/538.0.1" 179.96.123.130 - - [16/Feb/2017:09:38:20-0800] "GET /list HTTP/1.1" 200 560 "-" "Mozilla/5.0 (Windows NT 6.3; Win64; x64; rv:5.4) Gecko/20100101 Firefox/5.4.6" 132.119.12.76 - - [16/Feb/2017:09:38:20-0800] "PUT /explore HTTP/1.1" 200 3131 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_0 rv:5.0; AZ) AppleWebKit/535.1.0 (KHTML, like Gecko) Version/4.0.3 Safari/535.1.0" 74.113.56.92 - - [16/Feb/2017:09:38:20-0800] "DELETE /app/main/posts HTTP/1.1" 200 7069 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_9) AppleWebKit/532.1.0 (KHTML, like Gecko) Chrome/15.0.877.0 Safari/532.1.0"
After Data Transformation {"host":"26.56.11.130","ident":"-","authuser":"-","request":"get /wp-content HTTP/1.1","response":200,"bytes":4582,"verb":"GET","@timestamp":"2017-04- 04T11:32:29.000Z","timezone":"-0700","@timestamp_utc":"2017-04-04T18:32:29.000Z"} {"host":"180.153.215.216","ident":"-","authuser":"-","request":"put /search/tag/list HTTP/1.1","response":200,"bytes":1461,"verb":"PUT","@timestamp":"2017-04- 04T11:32:29.000Z","timezone":"-0700","@timestamp_utc":"2017-04-04T18:32:29.000Z"} {"host":"155.233.163.37","ident":"-","authuser":"-","request":"get /explore HTTP/1.1","response":500,"bytes":326,"verb":"GET","@timestamp":"2017-04- 04T11:32:29.000Z","timezone":"-0700","@timestamp_utc":"2017-04-04T18:32:29.000Z"} {"host":"189.176.106.5","ident":"-","authuser":"-","request":"post /search/tag/list HTTP/1.1","response":200,"bytes":3059,"verb":"POST","@timestamp":"2017-04- 04T11:32:29.000Z","timezone":"-0700","@timestamp_utc":"2017-04-04T18:32:29.000Z"}
Step 4: 查询和分析 Redshift 中的数据 Step 5: 监控流数据处理
课程总结 AWS 上大数据应用体系介绍大数据处理面临的挑战 Amazon Kinesis Firehose & Amazon Redshift 演示 : 快速构建 Apache 日志实时处理分析的解决方案然后呢?
AWS 培训与认证路径图
谢谢大家欢迎提问 aws.amazon.com/training aws.amazon.com/certification AWS 高级培训讲师 - 李君 amyli@amazon.com