ML pipeline architecture that powers apps like Spotify, Airbnb, and Twitter
Tasks and tools to move your research experiments in Jupyter notebooks to production pipelines using TFX.
What drives Machine Learning at Twitter that displays the most relevant tweets on top of my timeline?
How did Spotify scale their ML platform to improve their user-favorite features like recommendations and Discover Weekly?
How does Airbus detect anomalies within the real-time telemetry data stream using an LSTM Autoencoder model trained over 5 trillion data points?
These questions are intriguing because all of these organizations (and the likes) have done an outstanding job of building state-of-the-art ML systems. What’s common in all of these cases is how they optimized their ML infrastructure by adopting TensorFlow Extended(TFX) as the core of their ML pipelines.
Building an ML pipeline is an overwhelming piece of business that requires many different components to be integrated in a seamless manner.
Tl;dr for people who are already developing models at scale and want to learn how to build such production-ready pipelines, scroll down to the announcement section.
A brief introduction to ML pipelines
Every data-driven organisation that has ML integrated into their product/platform uses ML pipelines to streamline the development and deployment of their evolving models in conjunction with incoming new data.
Simply put, an ML pipeline is a sequence of tasks that are performed to move ML model(s) from an experimental Jupyter Notebook(or Google Colab) to a robust application in production.
The bigger the project the more teams are involved and the harder it gets to set up the whole process to handle the scale which has given rise to a new engineering discipline called MLOps.
Data Ingestion
There are a number of ways to ingest data into the machine learning pipeline. You can consume data from your local disk or any database. TFX converts the ingested data records to tf.Example
(in TFRecord files — for storing binary records) for the consumption of the downstream components and these binary files which make our work with huge datasets easy and super fast.
Main Tasks:
Connecting to a data source, files, or cloud service to retrieve data efficiently.
Split the data set into training and testing subsets.
Spanning and versioning datasets using tools like DVC(talked to the creator).
On top of this, you’d need dedicated ingestion methods for structured, text, and imagery datasets.
Tools & technologies used:
Using Tensorflow Extended
tf.Example, TFRecord — Can connect Cloud bigquery, Cloud SQL or S3
DVC for versioning datasets.
Can upload any type of data — CSV, images, text, etc.
Data Validation
An early advantage of using TFRecord/tf.Example was the easy support of Tensorflow Data Validation (TFDV) — one of the first components open-sourced by Google from their TFX paper. TFDV allowed our ML engineers to understand their data better during model development, and easily detect common problems like skew, erroneous values, or too many nulls in production pipelines and services. — Spotify Team
The data validation steps in the pipeline check for any anomalies and underlines any failures. You can curate new datasets after running them through TFDV and then addressing them separately.
TFX offers a library called TFDV that can assist you in data validation. TFDV takes in TFRecords(or CSV files) and then allows you to perform data slicing, data comparison, checking for skewness, and other types of analyses.
You can also visualize the results of validation on the Google PAIR project Facets.
Main Tasks:
Check datasets for anomalies.
Check for any changes in the data schema.
It also highlights the changes in the statistics of the new data in comparison with the training data.
TFDV assists in comparing multiple datasets.
Tools used:
TensorFlow Data Validation(TFDV)
Feature Transformation
Adding transformations like one-hot encoding, normalizing quantitative features, renaming features, batch preprocessing, and many others.
TFX offers libraries like TFT(TensorFlow Transform)to preprocess data within the TF ecosystem.
TFT processes the data and returns two artifacts:
Transformed training and testing datasets in the TFRecord format.
Exported transformation graph.
Main Tasks:
Processing feature names, data types, scaling, encoding, PCA, bucketizing, TFIDF, etc.
Processing data using tf.Transform.
Writing preprocessing functions.
Integrating steps into TFX pipeline.
Tools used:
tf.Transform
Model Training
Training a model in a pipeline has an import upside that eliminates any source of error by exporting all the transformational steps and model training as one graph.
Main Tasks:
Track your entire model development and experimentation process. — Automating the process using the TFX pipeline.
Tune hyperparameters in a pipeline.
Saving not only the trained model weights but also the data processing steps and maintaining coherence.
Tools used:
Sklearn / tf.Keras / XGBoost
TFX pipeline
Model Evaluation — Analysis and Validation
TensorFlow Model Analysis(TFMA) helps in visualizing the model’s performance, fairness(What-If tool), get metrics for distinct groups in the data, comparison with previously deployed models, and tune hyperparameters in the pipeline itself.
Main Tasks:
Define a number of metrics derived from the KPIs set in the beginning.
Get detailed performance metrics using Tensorflow Model Analysis(TFMA)
Check the model fairness indicators.
Tools used:
TensorFLow Model Analysis(TFMA) — tf.ModelAnalysis
What-IF tool
TensorFlow serving and the Google Cloud AI Platform
TensorFlow serving offers a simple and consistent way of deploying models through a model server. On top of that, you can use web UI to configure your model endpoints on the AI platform.
Main Tasks:
Three ways of model deployment: model server, user’s browser, or on an edge device. Identify the best option for your application.
Set up tensorflow serving for consistent deployment of your models.
Settle for the communication option that serves your purpose: REST vs gRPC.
Choose the cloud provider.
Deployment using the TFX pipeline.
Tools & technologies used:
Tensorflow serving
REST
gRPC
GCP/ AWS
Pipeline Orchestration
Pipeline orchestrators underpin the aforementioned components. The orchestration tool checks when one task/component has finished, knows when to trigger the next task of the workflow, schedules pipeline runs, and more.
Main Tasks:
Automate the ML pipeline by setting up the pipeline orchestrator that underpins all the components above.
Select the tool that is going to run the pipeline.
Orchestrate the pipeline by writing configurational python code. Setup and execute.
Tools & technologies used:
Apache Beam > Apache Airflow > Kubeflow — in the order of complexity and access to important features.
✨ Announcement — Cohort Based Course on Building production-ready ML Pipelines✨
Hello! I’m thinking of launching a 3-week-long cohort-based course on Building ML Production Pipelines. I’ll start with a beta cohort so I can perfect the material. The beta will be $800 which is a significant discount over the final price.
I(along with my team) will teach live classes, sharing everything we’ve learned about building robust ML pipelines using Google’s TensorFlow Extended, tools like Apache Airflow, Kubeflow, and Google Cloud Platform. A tech stack that is powering applications like #Spotify, #Airbnb, and #twitter to name a few.
Please fill out this form if you’d be interested in joining:
Goal: The goal is to accelerate your early ML Engineering career.
Value:
You’ll leave this course a more confident and resilient ML Engineer. This is the course I wish I had when I was diving into ML Engineering.
What you’ll learn:
Together, we will unpack every individual component(as shown in the infographic) of the ML pipeline that is required to move a notebook model to a production environment.
You’ll be learning by doing projects.
Material & Pedagogy:
Workshops facilitating active learning and hands-on sessions rather than passive lectures.
Learning with a Cohort of Peers — Zoom breakout groups, an engaged slack community, and group projects.
An applied course with access to study guides, flashcards, and AMAs.
Who should sign up
The course is meant for people who are already training ML models at scale and now wish to learn the art of building complete ML pipelines. For data scientists transitioning into a more hands-on engineering role or new ML Engineers <2 years into their career.
If you're interested in joining, fill out this form👇.
Hit me up!
My DMs are open for queries. Also, if you found this useful and would love to see more of it, connect with me on Twitter or LinkedIn. Also, subscribe to my Channel for more content on Data Science.