大规模机器学习平台的架构与实际应用

Cartel

2017/10/18 发布于 技术 分类

大规模机器学习平台的架构与实际应用

QCon  QCon2017 

文字内容
1. Designing Machine Learning Platform
6. Agenda ●  How ML works ●  ML platform ●  ML Case Study ●  Key Challenges: Deep dive ●  System Architectures
7. The Life Cycle of a Model ●  Data ○  Data sources, batch or streaming, data aggregation... ●  Training in various environments ○  MAKE PREDICTIONS Batch predictions vs online predictions, scaling out, SLA ... ●  Monitoring ○  DEPLOY MODELS Versioning, production ●  Inference ○  EVALUATE MODELS Standard evaluation vs. customized evaluation ●  Model deployment ○  TRAIN MODELS Fast iteration, traditional ML vs DL, training environments ... ●  Model evaluations ○  GET DATA Signal selection MONITOR PREDICTIONS
8. ML PLATFORM MISSION Enable engineers and data scientists across the company to easily build and deploy machine learning solutions at scale
9. Design of ML Platform ●  ML as a Service ●  Scalable infrastructure for training and serving ●  Workflow tools for prototyping, iteration, and productionization ●  Model and data serving with full monitoring for batch and real-time ●  Scope ○  Traditional ML & Deep Learning ○  Supervised, Unsupervised and Semi-supervised ○  Online learning

11. Example Use Case: UberEATS
12. MEAL DELIVERY TIME
13. Uber EATs Delivery Time Models ●  Features ○  Curated features ○  Request Level Features – user’s current location ●  Models ○  Several models for different stages of order ○  GBDT Regression ○  Different versions of each for experimentation
14. Key Challenges ●  Guarantee same data for batch training and online scoring ●  Train and deploy separate model per city ●  One-click deploy & easy scale out ●  Live monitoring of model performance 

15. Challenge 1: Same data for train & predict
16. Data Sources: Problems Request Level Features! Aggregated Feature! Near Realtime Aggregated Feature! Training! Batch! Batch! Batch! Online scoring! Given by user! Curated in batch! Consumed by query! Curated in streaming, consumed by query! Generation Pattern:'>Pattern: Batch and streaming! Consuming Pattern:'>Pattern: Batch and query!
17. Data Sources (Solutions) ●  Data Storage ○  Spark for batch jobs ○  Cassandra for online jobs ○  ●  ○  Streaming jobs Online Prediction Job (Spark) Own DSL Access basis features, curated features, and column stats Data Transformation ○  DSL Training Algo Data Accessors ○  ●  Batch Training Job (Spark) Standard transformation functions + UDFs DSL Trained Model
18. Challenge 2: Partition Model
19. PROBLEM Often you want to train a model per city 
 Hard to train and deploy a few hundred individual models
 
 
 SOLUTION Let users define hierarchical partitioning scheme
 Automatically train model per partition
 Manage and deploy as single logical model

20. 1 Define partition scheme GLOBAL COUNTRY CITY
21. 2 Make train / test split GLOBAL COUNTRY CITY
22. 3 Keep same split and partition for each level GLOBAL COUNTRY CITY
23. Train model for every node 4 GLOBAL M! M! M! M! M! M! M! M! COUNTRY M! CITY
24. Prune bad models 5 GLOBAL M! M! M! M! M! M! M! COUNTRY M! CITY
25. At serving time, route to best model for each node 6 GLOBAL M! M! M! M! M! M! M! COUNTRY M! CITY
26. Challenge 3: One-click deploy and scale out
27. REALTIME PREDICT SERVICE ●  Predict service ○  RPC service container for one or more models ○  Scale out in Docker on Mesos ○  Single- or multi-tenant deployments ○  Connection management and batched/parallelized queries to Cassandra ○  Monitoring & alerting 
 ●  Deployment ○  Each model is packed individually ○  One click deploy across DCs via standard deployment infrastructure ○  Health checks and rollback
28. Challenge 4: Live model performance monitoring
29. LIVE PREDICTION MONITORING Problem •  Ensure deployed model is making good predictions Solution •  Log predictions •  Join logged predictions to actual outcomes •  Publish metrics for monitoring and alerting •  Optionally hold back logged predictions
30. System Architecture
31. GET DATA TRAIN MODELS EVAL MODELS Batch Training Job (Spark) Training Algo Data Prep Job Spark / SQL Outcomes (Training Set) Features HADOOP / YARN (Batch)! Trained Models DEPLOY, PREDICT & MONITOR
32. GET DATA TRAIN MODELS EVAL MODELS DEPLOY, PREDICT & MONITOR Batch Training Job (Spark) Training Algo Data Prep Job Spark / SQL Trained Models Deploy Outcomes (Training Set) Batch Predict Job (Spark) Trained Model Features HADOOP / YARN (Batch)! Predictions To Hive & Kafka
33. GET DATA ETL TRAIN MODELS Hive! Feature Store! Spark / SQL DEPLOY, PREDICT & MONITOR Batch Training Job (Spark) Training Algo Data Prep Job EVAL MODELS Trained Models Deploy Outcomes (Training Set) Batch Predict Job (Spark) Trained Model Features HADOOP / YARN (Batch)! Predictions To Hive & Kafka
34. GET DATA TRAIN MODELS EVAL MODELS DEPLOY, PREDICT & MONITOR Realtime Predict Service Trained Model << Features Client Prediction >> Service MESOS / DOCKER (Realtime)! ETL Hive! Feature Store! Batch Training Job (Spark) Training Algo Data Prep Job Spark / SQL Trained Models Deploy Outcomes (Training Set) Batch Predict Job (Spark) Trained Model Features HADOOP / YARN (Batch)! Predictions To Hive & Kafka
35. GET DATA TRAIN MODELS EVAL MODELS DEPLOY, PREDICT & MONITOR Realtime Predict Service Stream Engine Trained Model Cassandra Feature Store! << Features Client Prediction >> Service MESOS / DOCKER (Realtime)! ETL Hive! Feature Store! Batch Training Job (Spark) Training Algo Data Prep Job Spark / SQL Trained Models Deploy Outcomes (Training Set) Batch Predict Job (Spark) Trained Model Features HADOOP / YARN (Batch)! Predictions To Hive & Kafka
36. GET DATA TRAIN MODELS EVAL MODELS DEPLOY, PREDICT & MONITOR Realtime Predict Service Stream Engine Trained Model Cassandra Feature Store! << Features Client Prediction >> Service MESOS / DOCKER (Realtime)! ETL Hive! Feature Store! Batch Training Job (Spark) Training Algo Data Prep Job Spark / SQL Trained Models Deploy Outcomes (Training Set) Batch Predict Job (Spark) Trained Model Features Sampled Predictions HADOOP / YARN (Batch)! Predictions To Hive & Kafka
37. GET DATA TRAIN MODELS EVAL MODELS DEPLOY, PREDICT & MONITOR Realtime Predict Service Stream Engine Trained Model Cassandra Feature Store! << Features Client Prediction >> Service MESOS / DOCKER (Realtime)! ETL Hive! Feature Store! Batch Training Job (Spark ) Training Algo Data Prep Job Spark / SQL Trained Models Deploy Outcomes (Training Set) Batch Predict Job (Spark) Features Trained Model Predictions Performance Monitor Job (Spark) Metrics To Hive & Kafka Sampled Predictions HADOOP / YARN (Batch)! To Monitor System
38. GET DATA TRAIN MODELS EVAL MODELS DEPLOY, PREDICT & MONITOR Realtime Predict Service Stream Engine! Trained Model Cassandra Feature Store! << Features Client Prediction >> Service MESOS / DOCKER (Realtime)! ETL Hive! Feature Store! Batch Training Job (Spark) Training Algo Data Prep Job Spark / SQL Trained Models Deploy Outcomes (Training Set) Batch Predict Job (Spark) Features Trained Model Predictions Performance Monitor Job (Spark) Metrics To Hive & Kafka Sampled Predictions HADOOP / YARN (Batch)! To Monitor System
39. Monitor API Python / Java Management!