Your Guide to Avoiding Critical Errors with Machine Learning in Production
I’ll never forget the first time I got a PagerDuty alert telling me that model scores weren’t being returned properly in production.
Panic set in — I had just done a deploy, and my mind started racing with questions:
- Did my code cause a bug?
- Is the error causing an outage downstream?
- What part of the code could be throwing errors?
Debugging live systems is stressful, and I learned a critical lesson: writing production-ready code is a completely different beast from writing code that works in a Jupyter Notebook.
In 2020, I made the leap from data analyst to machine learning engineer (MLE). While I was already proficient in SQL and Python, working with production systems forced me to level up my skills.
As an analyst, I mostly cared that my code ran and produced the correct output. This mindset no longer translated well to being an MLE.
As an MLE, I quickly realized I had to focus on writing efficient, clean, and maintainable code that worked in a shared codebase.