Big data analytics as augmented by Artificial Intelligence 人工智能协助的大数据分析 Dr. Alex Liu Chief Data Scientist at IBM Analytics 刘永川博士 - IBM 大数据分析首席数据科学家 Nov 5, 2017 ~ 美国华裔教授专家协会 2017 年会 RMDS-02-12-2015-1
Self Introduction Alex Liu 刘永川个人介绍 one of IBM's experts for big data analytics & a chief data scientist for IBM analytics services IBM 大数据分析首席数据科学家. before joined IBM, served as a chief data scientist for a few corporations including TRG, Yapstone and Retention Science 加入 IBM 之前, 曾任多家公司的首席数据科学家. Taught advanced analytics for USC and UC Irvine as adjunct professor 曾任南加州大学和尔湾加州大学客座教授. a Ph.D. of Quantitative Sociology and a M.S. of Statistical Computing from Stanford University 斯坦福大学硕士和博士. 2
大家都问 : 研究人员会被机器代替吗? 数据分析会被机器代替吗? 3
1,DATA ANALYTICS DEFINITION 数据分析过程定义 2,BIG DATA ANALYTICS NEED AI 大数据分析少不了智能协助 3,AI AUGMENTATION EXAMPLES 机器智能协助举例 4,INTEGRATED PLATFORM 整合平台的方案 4
Data Analytics Process I 数据分析过程定义 Data Sources Data Storage Data Cleaning Feature Extraction MODELS Regression Decision Tree Bayesian & Causality Time Series ALGORITHMS & COMPUTING MLE RMS ITERATIVE (MapReduce & Spark) R SPSS STATISTICS & Visualization RMSE Confusion Matrix ROC Curve Business Acumen Subject Knowledge Communica tion RM4Es Data Equation Estimation Evaluation Explanation /Execution 5
Data Scientist Workflow 数据分析过程详细定义 Ingestion Selection Preparation Generation Transform Model Execution Retrieval Storage Formatting Data Source Selection Data Composition Data Linkage Concept Extraction Filtering Missing Values Smoothing Normalization Aggregation Construction Labelling Data Augmentation Feature selection Feature space transformation Regression Classification (Re)-Deployment, Re- Training, Monitor Explanations Written Report Best-Worst case scenarios Oil Rig Monitoring (e.g. ConocoPhillips) Noisy Sensor Streams Cleaned sensor streams Model 4 IBM Research
Too Many Choices at Model Building Stage 模型建立阶段的太多选择 More than 50 different models: SVM, Neural Net, Decision Trees/Forests, Naïve Bayes, Regression, SMO, k-nearest Neighbor, Clustering, Rules, 50 或更多的模型选择 Combinatorially explosive number of parameter choices per algorithm: kernel type, pruning strategy, number of trees in a forest, learning rate, 相关几十种或更多算法 Wide variation in performance across different algorithm implementations (e.g., SPSS vs Python vs WEKA vs SPARK ) 许多不同的软件系统的执行 User-Defined algorithms 许多不同的人机互动方法 Substantial cost in user and compute time User spends time on trying new combinations and parameters Computational cost for training a single SVM can exceed 24h Selection commonly based on data scientist bias Each additional pipeline stage increases complexity dramatically! 7 IBM Research
Challenges for Researchers 研究人员 ( 数据分析人员 ) 的挑战小结 Too much data to import & manage Too much data cleaning to complete Too many analytical methods to select Too many algorithms to select Too many computing tools to select Too many IT systems to select
Need AI to automate and augment 机器智能的协助必不可少 ~ 数据分析自动化系统不断出现 AI to automate some research flows AI to augment all researchers MIT s automated data-analysis system outperforms 615 of 906 human teams.
Model Selection via Data Allocation using Upper Bounds (DAUB) 人工智能协助的模型选择举例 [ Selecting Near-Optimal Learners via Incremental Data Allocation, AAAI, 2016] Training Data ------------------ Built Model ------------------ Prediction Accuracy versus #Data Points Logistic Regression Random Forest SVM A3 Ranking based on upper bound estimate on performance of each pipeline ( slope of learning curve) 10
Data Scientist Workflow --- Automating Data Generation 人工智能协助的变量产生举例 Ingestion Selection Preparation Generation Transform Model Execution Retrieval Storage Formatting Data Source Selection Data Composition Data Linkage Concept Extraction Filtering Missing Values Smoothing Normalization Aggregation Construction Labelling Data Augmentation Feature selection Feature space transformation Regression Classification Explanations Written Report Best-Worst case scenarios Largely Automated Feature Generation Automated Feedback Often very time-consuming (e.g., 70% of end-to-end completion) Requires domain knowledge Depends on Data Scientist s bias 2015 International Business Machines Corporation
12 Work station approach is needed for new research methods ~ 整合平台是发展趋势
Integration into Data Science Experience (DSX) / WDP IBM 平台介绍 I IBM Data Science Experience DSX 13 IBM Research
Core Attributes of the Data Scientist Experience IBM 平台介绍 II IBM Data Science Experience Community Open Source IBM Added Value Find tutorials and datasets Connect with data scientists Ask questions Read articles and papers Fork and share projects Code in Scala/Python/R/SQL Jupyter and Zeppelin* Notebooks RStudio IDE and Shiny apps Apache Spark Your favorite libraries Data Shaping/Pipeline UI * Auto-data preparation and modeling* Advanced Visualizations* Model management and deployment* Documented Model APIs* Powered by IBM Next Generation Platform in the Cloud Spark as a Service * DSX product roadmap items
15