Select Page

Today we live in a world of unprecedented open-source code. Companies such as Google and Facebook have put their internal AI solutions in the public domain, a previously unheard of step. There are also plenty of resources today on how to quickly and easily build AI solutions. Despite this, it remains a massive amount of work to develop a real-world AI application to the level of quality and reliability usually required for production deployments, and the amount of work required is frequently underestimated, even by experienced developers and managers. There’s a saying ‘the last 20% of the work takes 80% of the time’ and nowhere is that more true than AI systems.

In this post I’ll go over some of the key reasons why AI systems take so much effort to build and why it’s frequently a better choice to buy an existing system. This is best highlighted with an example, for which I will refer to a Traffic Sign Recognition (TSR) system I once worked on for an automaker. TSR sounds simple, right? It is also frequently used in ‘Build Your Own Classifier in 5 Minutes’ posts. Well, let’s dive in!

Data

Edge Cases. Edge cases everywhere.

You probably guessed it, but data usually the number one time and money consumer. This stems from our consistent underestimation of the complexity of the real world, and how many edge cases there are for even the most simple tasks.

On my TSR project, we encountered all sorts of things. For example, many highways use LED signs, which in addition to looking completely different from normal signs, are difficult to capture with a camera (try filming a computer screen). To solve this, we ended up building a module that would overlay multiple frames to capture a complete image of a sign. The system also had to work in a variety of conditions. At night, the highly reflective signs are much brighter than the surrounding environment, whilst signs are barely visible when driving through fog, rain, or into the sun.

European signs frequently have a red border, but on old signs this can be almost completely faded away. Additionally, in many European cities, it’s also become trendy to put stickers on signs. Signs were also frequently obscured by plants or other roadside obstacles.

Traffic signs with various lighting conditions, stickers and age. Images by author

Even when signs are perfectly visible in good conditions, they can be tricky to identify amongst all the noise. For example, trucks in Europe have speed limit stickers on the back that are identical to roadside signs but that indicate how fast they are allowed to drive. Things also get tricky at highway intersections, as exit speed limit signs can be perfectly visible from the highway itself. And what if the sign is covered in snow, which just so happens to be the same colour as most traffic signs?

The complexity of the real world isn’t limited to Computer Vision, I recently wrote a complementary article on regexes in the real world.

A good dataset is hard to find

Lots of models are published and open-sourced, but the datasets they are trained on for production applications are usually kept under lock & key. Some data (like credit card numbers) are especially hard to obtain. In fact, a ‘data moat’ is the main competitive advantage of many AI companies.

But what about all of those juicy datasets researchers use, you might wonder? Unfortunately, production applications don’t match up neatly with research tasks. And even if they did, research datasets usually don’t allow for commercial use (e.g. ImageNet). It’s also common to have a lot of labelling errors in research datasets, preventing the development of high quality models. A good example is Google’s OpenImages object detection dataset. Consisting of 1.7 million images with 600 different classes labelled, it could be useful for training object detection models. Unfortunately, the training split has less than half the labels per image that the validation split does, which would imply that a significant number of examples aren’t labelled.

Datasets for TSR also fall prey to these issues. Freely available TSR datasets don’t allow for commercial use, contain too few examples to be of any real use, and are marred by significant labelling errors. Additionally, they only use examples captured in good lighting conditions in one country. And cars have a pesky habit of travelling into new jurisdictions with different traffic laws and different traffic sign designs.

Creating a custom dataset is expensive and time-consuming

Why not create your own dataset, you say? Well, let’s have a look at that. First step is to decide on labels/outputs and collect data, making sure every single edge case is captured. Then it’s important to make sure you have good validation and test sets that provide a reliable, balanced snapshot of your performance.

Next comes the data hygiene and formatting, which can take a lot of time. It’s very important to get this step right. Transformer models, for example suffer a surprisingly large drop in performance when this step isn’t done correctly.

For most tasks, the data then needs to be labelled. For the projects I’ve worked on, we’ve always built our own labelling tool or modified open source tools, as existing out-of-the-box tools never quite suit the task at hand. You’ll also need data infrastructure to manage, version and serve your new dataset.

Next, you’ll need to involve some humans to annotate your dataset. If you’re lucky and can share the data outside your organization and your task doesn’t require too much domain knowledge, you might outsource annotation tasks. If not, it takes a ton of work to hire and manage your new team of annotators. In either case, annotator training can also be some work, as most tasks require some domain knowledge and are typically more complicated than clicking on objects in an image. And since turnover in this type of role is high, you can expect to find yourself on that hamster wheel more than you’d expect. One of the best ways to support your annotators is by having an annotation guide they can start with reading before you jump into the annotation and feedback training cycle. Creating the annotation guide itself is a lot of work, as many labels are ambiguous if not defined correctly, often an exhaustive list of examples must be included, along with a living FAQ section that has to be added to as you discover that more and more clarifications are needed to account for the variety of understandings that humans can about a single concept.

Finally, it’s important to verify your process to ensure it maintains a high quality of output. Annotators also need to label edge cases consistently for the model to work well. For example, at Private AI we’re frequently confronted with thousands of tiny questions on what constitutes sensitive information. For example, “I like Game of Thrones” probably isn’t going to identify someone, but “I like David Lynch’s 1984 rendition of Dune” narrows things down a bit.

In summary, whilst data annotators can be found quite cheaply, a large amount of valuable dev/management time is required to construct a dataset. As an alternative, you can go to services like Amazon’s Mechanical Turk to outsource part of the process. In my experience however, these services are quite expensive and don’t deliver high quality labels. On top of this, in real projects, the requirements/specifications usually change. This means going over the data multiple times as internal and external requirements (like data protection regulations) change.

The process of building a dataset has also gotten harder over the last 5 years. The TSR project I worked on was pre-GDPR, and nowadays privacy is a must when collecting data.

Model Stuff

You’ve got your data. Now what?

Now we’ve arrived at the most visible part of the process: building the model. We can use the plethora of open-source solutions out there, but there’s typically a lot of work to be done fixing small bugs that impact accuracy, accounting for the large variety of possible real-world input types, ensuring the code works as well as it can given the new data and labels you’ve added, etc. A while back I wrote my own MobileNet V3 implementation, as none of the implementations I could find matched the paper — not even the keras-applications implementation. Similarly at Private AI, getting state-of-the-art models to run at 100% of their capacity has been a lot of work. You also need to make sure that the code allows for commercial use — this typically knocks out a lot of research paper implementations.

A production system frequently relies on a combination of domain-specific techniques to improve performance, which requires integrating a bunch of different codebases together. Finally everything should be tested, something that open-source code is usually light on. After all, who likes writing tests?

Deployment

So you’ve gotten the data and you’ve built your model — now it’s time to put it into production. This is another area open-source code is usually light on, even though things have gotten significantly better in the past few years. If your application is to run in the cloud, this can be quite simple (just put your Pytorch model into a Docker container), but that comes with a caveat: running ML in the cloud can get really expensive. Just a few GPU-equipped instances easily cost tens of thousands per year to run. And you’ll typically run in a few different zones to reduce latency.

Things get significantly more complicated when integrating into mobile apps or embedded systems. In these situations you’re usually forced to run on CPU due to hardware fragmentation (I’m looking at you, Android) or compatibility issues. That TSR project I worked on required all code to be written according to a 30-year-old C standard and had to fit in just a few megabytes! The use of external libraries was also precluded due to issues surrounding safety certification.

In any case, model optimization is usually necessary. The trouble is that Deep Learning inference packages are at a much lower state of readiness and much harder to use than training tools such as Tensorflow or Pytorch. Recently I converted a transformer model to Intel’s OpenVINO package. Except Intel’s demo example no longer worked with the latest version of Pytorch, so I had to go into OpenVINO’s source code and make some fixes myself.

Real-world applications also involve more than just running an AI model. There’s normally a lot of pre- and post-processing required, all of which also needs to be productionized. In particular, integration in an application may require porting to the application language (like C++ or Java). On that TSR project, a large amount of code was required to match the detected signs together with the navigation map.

Finally it’s worth noting that people with expertise in this area are REALLY hard to find.

Ongoing Tasks

So, we’re at the finish line! Your application is now in production, doing its thing sorting/identifying/talking with widgets. Now comes the ongoing maintenance.

Like any piece of software, there will be bugs and model prediction failures. In particular (and despite your best efforts), there will be plenty of work to do in collecting the data needed to fill in the edge cases that were missed during the initial data collection phase. The world we live in isn’t static, so data needs to be continually collected and put through the system. A good example is Covid-19. Try asking any pre-2019 chatbot what that is.

Finally, whilst not strictly necessary, it’s good practice to periodically evaluate and integrate the latest research advancements.

Summary

So that’s what it really takes to build a production ML application. As you can see, it typically takes a team with diverse specialities such as data science, model deployment to build a complete system, and application domain expertise There remains enormous demand for these skills in 2021, meaning that building up a team can be a very costly exercise. Complicating the matter further is staff turnover, which could mean that the system your company just spent a large amount of time & money building is suddenly unmaintainable, presenting a very real business risk.

So hopefully this helps you approach your ‘buy vs build’ decision armed with more info. It’s considerably more complicated than ‘oh lets get model X and switch it on’. I’ve seen firsthand and heard many accounts of companies not batting an eyelid at giving hundreds of thousands per year to Amazon/Microsoft/Google for cloud computing, despite 3rd party solutions offering a fraction of the total cost of ownership. If you decide to build yourself, make sure you have a lot of contingency! And consider all the costs like cloud compute, hiring & management.

And that TSR application? I can say I was quite proud of how well our system worked, but it required many, many decades of developer time to achieve.