Machine learning (ML) projects often fail despite the strong focus companies have on implementing MLOps1. In my experience this often comes from these three common issues.
- Misaligned business and technical goals
- Insufficient risk mitigation
- Lack of maturity
Misaligned business and technical goals hinder effective collaboration. This results in ML solutions built in search of a problem rather than ML solutions focussed on delivering value. Insufficient risk mitigation leads to unforeseen challenges. Ethical and legal constraints, for example, may be identified at a late stage in the project requiring significant rework that derail progress. Finally, a lack of maturity increases the chance of making the wrong engineering and architectural decisions. These lead to ML solutions that are hard to deploy in production and solutions that are difficult to maintain.
Failing to address these issues proactively results in projects not returning on their investment. Organisations are adopting the ML Solutions Architect role to address this. The upcoming sections will discuss how an ML Solutions architect does this through:
- Clear objectives: Through collaboration with with business stakeholders technical and business objectives can be defined. This ensures ML projects have a clear scope.
- Feasibility: Understanding constraints within an ML project allows analysing the feasibility and surfaces all knowable risks. This allows setting a realistically achievable scope.
- Technical standards and processes: Improving technical excellence of teams increases maturity. It enables organisations to tackle increasingly more complex projects and teams to correctly apply the right technology.
[1] MLOps stands for machine learning operations. MLOps focusses on expertise and tooling to autonomously build and maintain ML solutions. Read this whitepaper on MLOps for more information.
Clear Objectives
Bridging the gap between technical teams and business stakeholders is crucial. Organisations struggle to communicate the value of ML effectively. The ML Solutions Architect collaborates with the Product Owner or Product Manager to translate business problems into ML system requirements and quality attributes. To deliver your ML model to your customers you need more than just a model, you need to build an ML system.
An ML system is a software system with a machine learning component. ML system requirements describe how the ML system adds value. For example, the desired prediction error in demand forecasting model is quantified through a metric (e.g. Mean Absolute Error) a model can be optimized for.
Example ML system from Chip Huyen’s blog
Quality attributes describe requirements for a system that do not always directly add business value. The maximum model training or prediction pipeline duration is an example of a quality attribute. Say we forecast demand on a daily basis. Yesterday’s sales results may become available only at 8AM and business needs the predictions by 9AM to place orders with suppliers. This defines a quality attribute for our ML system to compute predictions and make them available to the end user within an hour. Using complex features or models that take long to compute may exceed this timeframe despite the potential added value.
Focussing too much on short-term business value instead of quality attributes leads to fragile ML solutions. Fragile solutions are prone to failures causing downtime and frustrated customers. Too much focus on quality attributes leads to over-engineered proof-of-concepts that never reach production or exceed budgets. It is the responsibility of an ML Solutions Architect to address technical project risk. The ML Solutions Architects balances technical excellence with business value. This ensures teams build ML solutions aimed at long-term success.
Feasibility
ML systems are built within the context of an organisation. This imposes constraints on the feasibility of solutions. Other constraints are time and budget constraints, and the limits of the skillset of the team. The ML system must operate in a business processes and connect to other non-ML systems to expose your model to end users. The ML Solutions Architect analyses the existing IT landscape to designs a solution that integrates best. Without this expertise to look at the bigger picture ML models cannot be consumed by your customers.
Cross-cutting concerns are a special type of contraints. Cross-cutting concerns cut across an organisation meaning they apply to all software systems in the organisation. Examples are ethical, regulatory, and security constraints. Insufficient coverage of cross-cutting concerns early-on may postpone your rollout. It can also result in legal penalties or brand reputation damage when a system already runs in production. Organisations will often have guidelines or strict rules for cross-cutting concerns that you must follow. They are not unique to the ML project, but dictated by the organisation, so we address them separately.
Most constraints are negotiable, cross-cutting concerns are not. An ML Solutions Architect negotiates about the contraints, and the quality attributes and features to find the solution space. The ML Solutions architect will find a solution within this space that is technically feasible within the organisation.
Technical standards and processes
A lack of technical standards leads to the application of the wrong technology or a poor implementation of the right technology. Both introduce inefficiency that make solutions too costly and prevent solutions from evolving to the required scale. As a result, ML projects can not be rolled out to all users. Projects will not deliver their estimated value, thus not return on investment.
Absence of standards leads teams to reinvent the wheel with each new project slowing down time-to-value. This overhead of working in isolation slows down the iteration speed of teams. Most ML models needs frequent updates as the environment they operate in changes. Being unable to keep up with this frequency causes the model detoriate. Such a model will stop delivering business value or produce negative value.
Image from EvidentlyAI
Each project will use different frameworks, programming languages, or libraries. Having these different technologies prevents sharing knowledge between teams on how to best apply them. This will lower the quality of ML products and potentially introduce less secure ML products, as teams need to stay informed on vulnerabilities in their respective technology stack. The disparate tech stack means an employee needs to learn many new things when switching between projects. Transfer ownership of ML projects to another team or rotating employees across projects becomes time-consuming. The latter will affect employee happiness, as employees will feel stuck on a project.
How does the ML Solutions Architect operate?
Thus far we discussed the common issues organisations face and the areas an ML Solutions Architect focusses on to address these. In this section we will discuss how the ML Solutions Architect will operate within your organisation and how to effectively position the role. We’ll discuss the skills the ML Solutions Architect brings to the table, their activities, and why these are important.
An ML Solutions Architect needs a broad set of engineering skills. This broad skillset enables them to solve all problems that can be encountered in building ML systems. This is an important ability to design ML systems from scratch. Designing an ML system involves collecting requirements and creating an architecture that addresses these requirements. The ML Solution Architect collaborates with business stakeholders and effectively negotiates requirements. The architecture must meet the current needs, yet be flexible to evolve to future needs. It must be clear communicated to all stakeholders what the requirements are and how the architecture addresses them. The focus in this communication must be on the trade-offs. Trade-offs show why the proposed architecture is the most suitable and how it compares to other solutions so consensus can be reached.
Photo from Pexels by Christina Morillo
Written communication is most effective communication to reach consensus with larger audiences. Therefore, good writing skills are paramount. RFC or Design Docs are tools to reach consensus on important decisions like the architecture of an ML system. Decisions are documented using tools like Architectural Decision Records. It is important for an ML Solutions Architect to stay in touch with technology. This ensures the proposed decisions are realistic and up-to-date with the latest advancements in the field. The decisions must be detailed enough to provide guidance and structure for an autonomous team to implement. ML Solutions Architect contributes and collaborates on code within the project to stay in touch with technology.
It is important for an ML Solutions Architect to stay in touch with the project and the team as well. It is recommended to work on one specific project or at most a few. Projects with a high level of risk or complexity require more architectural guidance and design. Making wrong decisions on these type of projects can get very costly or kill a project. In these cases, the ML Solutions Architect should serve one project. Companies with a lower ML maturity will work on less complex ML projects meaning one architect may serve multiple projects. At these companies there is also more need for setting standards that should be implemented across multiple teams.
At the start of a project an ML Solutions Architect will be involved more actively. At this stage important decisions need to be made on the architecture of the ML system. An ML system needs fewer changes from an architectural perspective when an it runs in production and does not require further scaling. The system needs active maintenance and new features will be added to the model, but the most significant engineering decisions are made. The ML Solutions Architect will have less involvement in projects at this stage.
Conclusion
Through the use of examples I have shown the ML Solutions Architect can:
- Reduce cost
- Increase robustness of models running in production
- Increase your return on investment in ML
If your company is struggling to deliver value from ML projects, and you recognise the common issues it may be time to consider introducing the role of ML Solutions Architect. In this article I showed how the role delivers ML projects in a way that aligns with the technical landscape of the organisation and its business needs. Over-engineered proof-of-concepts and fragile solutions in production will be replaced with stable and scalable solutions. The maturity of your ML teams will increase and you gain confidence in the complexity of ML projects you can deliver. See how we did this by scoping a demand forecasting project with a very tight deadline and limited resources still delivering on value.