Learning to Build Robust ML Systems - Resources I am finding useful!

Documentation, courses and inspiration for your weekend.

May 14, 2021

I am going to keep this one short(but valuable🤞) without any detailed theme since I got my first jab of the vaccine today(Thursday, May 13) and my body isn’t allowing me to sit for long hours on my desk.

So, here are a few things that have kept me occupied this week outside of my work.

#1 Feature Engineering with TensorFlow

I am learning a lot about engineering practices in ML as I am working towards finalizing the curriculum for one of my ML Engineering courses. A significant amount of time goes into data curation and making it compatible for the model and feature engineering plays a major and critical role in building robust ML system.

This documentation from the google cloud team provides a brief introduction to almost all the important feature engineering techniques and how you can implement them using TensorFlow.

I had an idea about almost all of them except for feature crossing. I’d be creating a detailed tutorial on it as feature crossing(with TF) and embeddings are two important techniques that hugely impact the performance of the ML models in practice when you have considerable amount of data to train your model.

#2 Full Stack Deep Learning course

If you’re looking for a resource on building complete ML(especially Deep Learning) systems, this is a freely available course with lecture videos, useful articles and labs to practice what you learned. The course is designed for people who are already familiar with Deep Learning and now want to learn how to build production ready systems.

I am currently going through the material and I have found it useful up until now. I am not sure how complete this course is but it will definitely help you understand what goes into developing complete ML pipeline with special focus on testing, setting up CI/CD, deployment & monitoring.

#3 How Spotify Leverages TFX and Kubeflow to build Scalable ML Systems

This blog is listed as one of the case studies on the TensorFlow Extended page. A high-level technical article where Spotify’s Research and Engineering team has explained why and how they moved their infrastructure to TFX and Kubeflow.

It is an insightful article diving into the problems that are encountered while platformizing the ML experience. They wanted their ML Engineers to spend more time running experiments instead of maintaining data and backend code to support those deployed models.

Apparently, leveraging TFX for pipeline creation and Kubeflow to orchestrate the pipeline has helped their infrastructure to evolve and make faster iterations with a primary focus on feature creation and model experimentation.

You can read about the architectural changes that they have made in this article.

Interesting Read of the Week!

Running experiments is the key responsibility of a data scientist and the more experiments you can run, the better.

I came across this article by an Ex-Data Scientist at Airbnb where she has delineated 4 principles that helped their team scale from running 100 experiments a week to over 700.

Key highlights include adding sanity metrics to experiments to ensure proper exposure to the target population and understanding base rates of the hypothesis you’re testing.

Data Science with Harshit