Why your machine learning innovation may be stuck halfway at the pilot stage – an engineer’s introduction to MLOps
Christoph Netsch
Co-Founder & Managing Director of Alpamayo
You may have just listened to the third consecutive master’s thesis project presentation, claiming to have discovered "promising results", moving the performance needle “beyond state-of-the-art”. Possibly, your company hired its first staff data scientist a year ago, in order to draw value from the process data, service records, and quality measurements, team digital has been working hard to make accessible throughout the past three years. After all, an AI strategy is what you need as a machine manufacturer in 2024, right?
The data scientist’s results too seemed “promising”: accuracy, precision, recall, and AUC-ROC – the data scientist’s bread-and-butter metrics – she had them colorfully laid out across a slide deck that didn’t fail to convince, your AI can create true customer value.
But where is that value?
How has AI affected the way your customers operate your equipment? Reduced downtime, reduced waste, increased productivity? Chances are that your AI-enhanced condition or process monitoring service is currently being validated at the pilot stage. Most of the manufacturing industry’s digitization pioneers I talk to are there. Chances are that those pilots have shown that your research project’s promising results do not transfer to the shop floor as expected. False alarms and missed incidents not only erode trust in your AI’s capabilities, but below a certain threshold its utilization is no longer cost effective.
A leading equipment manufacturer determined that in his average AI project that makes it to production, algorithm development only accounts for 11% of the effort.
In writing this article, I don’t intend to discuss algorithms nor concrete use cases, but to shed light on a key technological challenge of the industrialization step, where I have seen many projects get stuck at, and share a few aspects you should consider early in your innovation project’s design, to produce a solution that truly qualifies as “industrial-grade AI”. Effectively addressing this challenge demands for technological capabilities to automatically deploy, monitor, adapt and continuously update AI models in a production environment. The concept behind this is referred to as machine learning operations (MLOps) and this article intends to provide a high-level introduction to the concept from an engineer’s point of view.
Traditionally, the process that takes you from a problem to its AI-enhanced solution encompasses four general steps:
Dataset Curation: The first step is about making the data available that enables you to apply any data-driven modeling technique. You will determine what is the right data to create a model that is useful to your use case. It may come from distinct sources. As in data-driven process monitoring applications, where sensor measurements and process parameters are correlated to higher-level quality assurance and production data, this step will likely involve creating data pipelines - software modules that link these data sources to your solution. And since data from real-world processes tends to be noisy, sensors tend to malfunction, and network connections can be interrupted, you will likely invest substantial effort, ensuring that your data is of high quality. Another aspect lies in curating enough data. When we are talking about “enough”, we are not referring to gigabytes or terabytes. We are talking about a dataset that includes measurements of all the circumstances, in which we intend to deploy the AI solution. If it should provide robust predictions for multiple machines operating under a range of process parameters and distinct recipes, the data needs to be representative of them.
Algorithm Development: The core intelligence within your application is a data-driven model. There are many types of modelling approaches both with respect to the model’s complexity, ranging from classical statistics to deep learning, and type, such as forecasting, anomaly detection, or classification algorithms. The data scientists you work with, and increasingly performant automated machine learning tools can determine which modeling approach best suits your use case, given the problem and the available data. When you are trying to predict failures, the data you have at your disposal may only describe the healthy process. After all, many failures occur extremely rarely in a production environment. For this reason, with few exceptions, the algorithm type should fall into the category of anomaly detection, where the algorithm is designed to create a robust model of the healthy process and problems are detected as deviations.
Data-driven Application Development: The model itself is of limited utility unless it is continuously supplied with the data it needs to run predictions and problem-specific logic, which postprocesses those predictions in a way that they are both interpretable and actionable to the operator. The task of packaging a model in a software application is generally performed by software engineers and data engineers, who are specialized in data-intensive applications.
Deployment: The software is deployed to the production environment. Considerations to take into account are the choice of the right IT-infrastructure (complex machine learning models like neural networks may only meet your latency requirements, if you parallelize the computation using GPUs), IT-security, and data sharing requirements, which may determine what parts of your solution run on-edge, on-premise, or in a cloud.
The linear workflow behind classical AI development leads to static AI models. And yet, our intuition tells us: AI learns, doesn't it?
Isn't the ability to adapt AI’s main strength? In manufacturing, adaptability is inevitable. While the equipment you sell surely meets consistent quality standards, the conditions under which each piece of equipment that you supply your customers with could be entirely different. In the case of natural products (i.e. wood, textiles) or recycled materials, the inputs to a process may be extremely heterogenous, production targets differ, and every operator guards his own recipes. The production conditions at a single site can change seasonally and in the longer term. In addition, the equipment itself may be modified after several years in operation. All these phenomena are reflected in the data.
As a result, the AI model that emerged from this linear development process may have worked well in a pilot project. At the next site, you already experience a drastic loss of performance and after some time, many outputs are plain nonsense. The reason for this is that the model cannot learn in this setup. And with gradual changes in data or under different operating conditions, it processes data that it can no longer interpret correctly. It has never seen similar data before. Statistical models (and at its core AI models are no more than that) generalize very poorly into unseen spaces. Each false alarm goes hand in hand with a loss of trust. And sooner than later, even your most enthusiastic early adopters will begin to question the utility of your solution and churn.
You can't execute a development project, where you collect so much data that you are able to produce an AI model that never has to dynamically relearn and yet your model is static.
If this is enough of a contradiction to lead you to the conclusion that AI is not the right tool for your problem, then you can stop reading at this point. However, if you are convinced that AI is the most (or only) effective way to solve your problem, then let us look at how to apply some techniques from the young discipline of MLOps to build a data architecture that is capable of managing, monitoring, retraining, and adopting models in production.
While our classical development workflow only considers the operation of the ML model to run continuously, we will now consider the entire process as a continuum. This enables the model and even how the model’s outputs are translated into concrete recommendations to be updated, whenever recent data indicates that there is a shift in operating conditions. Copies of the same model can be finetuned to handle the specific operating conditions found, whenever it is apparent that they deviate.
Essentially, your initial algorithm development project no longer produces a static model, but a recipe that is cooked again and again - if necessary, even on your customers’ own IT-infrastructure, without their data ever leaving the house.
Naturally, the process must be largely automated, so that it doesn't overwhelm your available resources. At first glance, the solution sounds straightforward: we could just integrate the training step into our data-driven application (it’s written in code already), trigger it in regular intervals, and call it a day. Unfortunately, any automated system raises the requirements for its control.
To control a complex system, we must create feedback loops within the system. And to establish such feedback loops for predictive models, we need the ability to compare actuals to targets. That means we start by monitoring every aspect of our AI system we control, specifically:
Data feedback: We subject all data that is fed into our model to an initial check to see if the model should even process it. To do so, a data health check module analyzes all data prior to its processing. The data feedback loop is closed via a set of rules that determine under what circumstances a model is invalidated, an automatic retraining is triggered, or where a review by an expert is triggered.
ML-model feedback: Whenever a copy of a model is tuned to the operating context of one site or the process parameters of a specific recipe, it undergoes an automated benchmark. Defining (and updating) representative cases leading to meaningful benchmarks is a major challenge in itself, that demands problem- and domain-specific expertise. The ML-model feedback loop can be thought of as a quality assurance step, asserting that only trustworthy models are brought into production. If the initially developed algorithm regularly produces models failing the benchmarks despite using fresh data, that indicates that the initially developed algorithm itself may have become unsuitable for the task. The developer team needs to be notified in this case, to optimize how the data is processed and used to train the model.
Performance feedback: Wherever possible, we compare a model’s output to the targets (take note that the actuals observed in the production process are a predictive model’s targets, also referred to as the “ground truth”). The performance feedback loop is closed by triggering the retraining of a model. While realizing this feedback loop itself may sound like the least complex of tasks, the principal challenge lies in obtaining that ground truth. In some cases, ground truth follows predictions after a certain deadtime, for example when a model is designed to anticipate a failure that leads to an error logged at a later point in time). In most cases, expert-in-the-loop feedback is inevitable, for instance when a specific failure mode is predicted and only a maintenance professional’s observation can confirm whether that prediction was correct.
The technology that enables this system is accompanied by additional challenges in most aspects of the solution:
Data Management: This is no longer a one-time dataset curation step. Since we don't discard all data after feeding it into the model, but need to store some of it in a digital shadow of the monitored assets, the data engineers building your AI system must put considerably more effort into how data is stored and structured. Additionally, they must implement rules that determine which data is persisted in each asset’s digital shadow and how that data is used either as fresh data for updating models or as part of the test cases that benchmark a model’s quality.
Health Check Module: Multiple checks are applied to all data prior to being processed by a model ranging from simple criteria (Is any sensor measuring values outside of a technically plausible value range, indicating its malfunction?) to more advanced criteria (Is a gradual shift in the data distribution leading to model inputs that the model cannot accurately process?).
Automated Retraining & Evaluation Module: Your system needs to be capable of automatically executing the training and evaluation steps a data scientist “manually” undertook during the initial algorithm development. This also imposes additional requirements on how the algorithm is developed: not only is the model itself evaluated, but also its ability to be adapted to new operating conditions and rules must be determined when this is necessary.
Model Registry & Management: Since your system may rapidly scale to managing dozens, hundreds or even thousands of models, your system registers all models alongside information, what data it is based on and how well it performs. And it needs a set of rules that orchestrate, which model is applied under what circumstances.
Operations: Beyond the pilot stage, manually setting up your application for each new user will strain your resources. Backed up by a well-designed data architecture, large parts of the deployment process can be automated, and intuitive user interfaces can shift large parts of the configuration process to the user, permitting your team to providing support, where their expertise creates the biggest impact.
Expert-in-the-loop: Any system requires fail safes and despite your engineer’s best efforts to ensure a robustly automated system, you need to establish mechanisms that determine how problems are escalated and reported to either developers or experienced service engineers, who have the knowledge to handle them adequately, without overburdening anyone.
The system I laid out may sound complex to you and building it truly extends your development checklist. However, this is the type of system it takes to deploy industry-grade AI. Do not abstain from validating a solution’s feasibility and value via lean proof-of-concepts. The concepts of MLOps can be applied incrementally.
Like many digital pioneers in the manufacturing space your company is likely in the process of building a data management platform specific to the requirements of your products and how they are operated. If data-driven applications are on your roadmap, it makes sense to review how well the platform’s underlying architecture will support MLOps, to avoid the need for expensive redesigns and stagnation.
Are you interested in learning more about how MLOps can be applied to your use case?
Make sure to book your free consultation call to discuss your current situation and challenges with one of our predictive analytics experts. No cost. No sales. 100% focus on the value creation and feasibility of your digital innovations.
Consultation Call
Christoph Netsch
Co-Founder & Managing Director Alpamayo | Predictive Analytics Expert
30 Minutes.
No Costs.
No Sales Pitch.
100% focus on the value creation and feasability of your digital innovation.