MLOps is not just about tools or technology
MLOps has gained tremendous popularity over the past few years as a practice that promises to apply the learnings of DevOps to Machine Learning. It strives to streamline the arduous process of robust, reliable and scalable machine learning systems that are ready to face end-users.
Yet, despite the promises of MLOps, as of 2022 it’s estimated that less than 20% of machine learning models developed by data scientists actually make it into production.
Why is it that so few companies achieve ML in production, and even less can do so reliably and efficiently?
the split between business and IT, between Dev and Ops.
Data science teams are great at staying close to the business. They have learned that quickly building simple models and iterating upon them is the fastest way to provide results to their business stakeholders. Exploration, Ideation and quickly iterating are the core traits that have rewarded data science teams, and rightfully so. The world of IT, in contrast to the world of Data Science, has been much more mature. Stability, reliability and robustness have been name of the game.
Given this difference in these two worlds; what will happen when data scientists have produced something that is important to the business, but is not a robust and reliable solution that can scale to production.? These worlds clash. Data Science teams are not used to running their software in production, nor do they have the expertise to do so. When this is the case; they need and IT/Ops department to productionize their solutions. However; IT/Ops teams don’t want to deal with all the hacky solutions that data scientists have built. What ends up happening is often one of 3 things:
The data science model is never moved to production, and its true value is never realized.
The data science model is handed over towards IT as a black box, with many quality issues happening inside this black box. Often, there is no clear path to solving these. It should be no surprise that IT is unable to monitor the quality of predictions for a model that they were not involved in for the development..
The data science model is completely rebuilt by an IT team, with tremendous effort and time investments. This results in the business waiting months, if not years, before they are able to use their high-value data science model. In addition, these IT teams often do often not have the right data science skillset to be able to improve upon the model itself once it has been productionized.
What's the cause of this gap?
To understand the root cause of these problems, we will further detail this through looking at three key areas; People, Process and Technology.
People: Data scientists are not engineers,
Scientists have different interests, skillsets and backgrounds that engineers, and this does not have to be a problem. Having unicorn hybrid data scientists/engineers that are able to do everything is a utopia, especially in a job market where both scientists and engineers are already highly sought after. However, having scientists and engineers far apart from each other, in different teams or even in different departments of an organization (DS in the business, engineers in IT) is a bad idea. The further they are kept apart, the more difficult it is to create shared language and understanding and the harder it is to foster a good collaboration between the two.
Process: Data Science teams should understand that the true value lies in production.
Data science and IT teams have different goals and processes in place. Data science teams are focused on exploring new ideas and quickly iterating towards PoC’s. IT, on the other hand, is often more focused on building and maintaining slow-moving, reliable and scalable solutions . A hybrid between the two mindsets is needed that both understands the need of quick iterations and exploration, yet is able to build reliable and scalable solutions.
Technology: Data science teams don't use the same tools as IT.
The core stack of data scientists revolves around being able to quickly iterate and explore. The tools associated with this are often Python, Pandas, Scikit learn and Jupyter notebooks. The core of the engineering stack is made up of other languages. These languages are often focused on robustness and scalability: Spark, Scala, other JVM based languages. Given that the tool stack that data scientists are using is different from the rest of the IT organization, there is often no clear place for them to run their applications on the platform: the platform does not support or enable data science teams to productionize their applications.
You need end-to-end teams to do MLOps.
To prevent the aforementioned gaps, an obvious solution is to put all the people that are involved in building data science products in a single team. This has several benefits: there are short communication lines between the data scientists and engineers, it is easy to formulate common grounds between all experts and it is trivial to take end-to-end ownership of the ML product. On the downside, though, such a team might take on a lot of responsibilities and require a substantial amount of team members with a lot of different expertise areas. Not only are ML engineers and data scientists needed; also, data- and platform engineers are needed to provide the data scientists and ML engineers with the needed infrastructure and raw data.
In an organization where multiple data science use-case teams are operating; such a setup might be inefficient. The organization could benefit from reusability between multiple teams. A good way to split up a team whose scope is growing too large into multiple teams is through separation of a platform team from a use-case team. In such a setup; each team (platform and use-case) is still responsible for the end-to-end life cycle of their product (platform or data science use-case respectively).
Thus, rather than a horizontal split between teams over the life-cycle of a product (see Figure 1), it is better to have a team structure where the split vertical: platform and use case (see figure 2).
Figure 1: Team structure with split Dev and Ops.
Figure 2: Dev and Ops merged into a single end-to-end team
What makes a good use-case team?
In a good data science use-case team, at least two roles are present: data scientists and machine learning engineers. Together, they are responsible for each stage of the machine learning life cycle: development, productionization and operations. These roles are distinct, but each of them bears responsibility during the entire life cycle
During development, the data scientist is focused on exploration and building models to solve business problems. The machine learning engineer helps the data scientist by providing him with software engineering guidance and best practices, and setting up the continuous integration (CI) pipelines.
During productionization, the data scientists focus on refactoring their PoC model into a production-ready Python package, whilst the machine learning engineer focuses on setting up the orchestration pipelines and monitoring around the model, transforming the machine learning model into a machine learning system. It might be the case that data scientists currently don't have all the skills necessary to write production quality code, in that case it's great that there are engineers around who can coach them to do so.
Finally, during operations, the data scientist focuses on monitoring the machine learning model, where the machine learning engineer focuses on monitoring the rest of the machine learning system.
What makes a good platform
The use-case team relies on a platform team that should enable them to build out and productionize their use-case quickly. The challenge for such a platform team often comes down to the following apparent trade-off: How do you ensure that the platform provides high-enough level abstractions to make it easy to work with and quickly develop & productionize machine learning applications, whilst at the same time keeping it flexible enough that the individual needs for each use-case can be accommodated for.
The [golden path] (https://engineering.atspotify.com/2020/08/how-we-use-golden-paths-to-solve-fragmentation-in-our-software-ecosystem/) and paved path models, as popularized by Spotify are frameworks that can be of great use here. According to Spotify, "The idea behind having Golden Paths is not to limit or stifle engineers, or set standards for the sake of it. With Golden Paths in place, teams don’t have to reinvent the wheel, have fewer decisions to make, and can use their productivity and creativity for higher objectives. " These golden and paved paths rely heavily on the creation of templates & tutorials to quickly enable teams, whilst allowing them to adjust them when that is required for a specific use-case.
How do they collaborate
Both the platform and the use-case team are responsible for building production-grade products and then operating them. As such, both teams should focus on serving their customers. In the case of the use-case team, these customers are the business stakeholders who need some data science product. In the case of the platform team, the use-case teams are their customers. Either way, close & quick feedback cycles with the customers can help with iteratively building customer focused products that provide real value to the business.
A holistic view is the key to MLOps
To conclude, when reading about MLOps; a lot of terms are often immediately introduced that are related to specific sets of tooling & platforms. Feature stores, experiment tracking & model registries to name a few. And in turn, many companies are focusing on these components when attempting to improve their capabilities of moving data science to production. However; you should not just rely on tooling and technology to solve your problems. MLOps is revolves around people and processes. You should focus on creating teams that have the capabilities to build and operate machine learning solutions, and are able to do so end-to end.
Do you want to learn more about the specifics of setting up data science use-case teams and platform teams? Stay tuned for the next blogs that will explore this in more detail.