Super Charge Your ML Systems In 4 Simple Steps | by Donal Byrne | Oct, 2023

Welcome to the rollercoaster of ML optimization! This post will take you through my process for optimizing any ML system for lightning-fast training and inference in 4 simple steps.

Imagine this: You finally get put on a cool new ML project where you are training your agent to count how many hot dogs are in a photo, the success of which could possibly make your company tens of dollars!

You get the latest hotshot object detection model implemented in your favourite framework that has lots of GitHub stars, run some toy examples and after an hour or so it’s picking out hotdogs like a broke student in their 3rd repeat year of college, life is good.

The next steps are obvious, we want to scale it up to some harder problems, this means more data, a larger model and of course, longer training time. Now you are looking at days of training instead of hours. That’s fine though, you have been ignoring the rest of your team for 3 weeks now and should probably spend a day getting through the backlog of code reviews and passive-aggressive emails that have built up.

You come back a day later after feeling good about the insightful and absolutely necessary nitpicks you left on your colleagues MR’s, only to find your performance tanked and crashed post a 15-hour training stint (karma works fast).

The ensuing days morph into a whirlwind of trials, tests and experiments, with each potential idea taking more than a day to run. These quickly start racking up hundreds of dollars in compute costs, all leading to the big question: How can we make this faster and cheaper?

Welcome to the emotional rollercoaster of ML optimization! Here’s a straightforward 4-step process to turn the tides in your favour:

  1. Benchmark
  2. Simplify
  3. Optimize
  4. Repeat

This is an iterative process, and there will be many times when you repeat some steps before moving on to the next, so it’s less of a 4 step system and more of a toolbox, but 4 steps sounds better.

“Measure twice, cut once” — Someone wise.

The first (and probably second) thing you should always do, is profile your system. This can be something as simple as just timing how long it takes to run a specific block of code, or as complex as doing a full profile trace. What matters is you have enough information to identify the bottlenecks in your system. I carry out multiple benchmarks depending on where we are in the process and typically break it down into 2 types: high-level and low-level benchmarking.

High Level

This is the sort of stuff you will be showing your boss at the weekly “How f**cked are we?” meeting and would want these metrics as part of every run. These will give you a high-level sense of how performant your system is running.

Batches Per Second — how quickly are we getting through each of our batches? this should be as high as possible

Steps Per Second — (RL specific) how quickly are we stepping through our environment to generate our data, should be as high as possible. There are some complicated interplays between step time and train batches that I won’t get into here.

GPU Util — how much of your GPU is being utilised during training? This should be consistently as close to 100%, if not then you have idle time that can be optimized.

CPU Util — how much of your CPUs are being utilised during training? Again, this should be as close to 100% as possible.

FLOPS — floating point operations per second, this gives you a view of how effectively are you using your total hardware.

Low Level

Using the metrics above you can then start to look deeper as to where your bottleneck might be. Once you have these, you want to start looking at more fine-grained metrics and profiling.

Time Profiling — This is the simplest, and often most useful, experiment to run. Profiling tools like cprofiler can be used to get a bird’s eye view of the timing of each of your components as a whole or can look at the timing of specific components.

Memory Profiling — Another staple of the optimization toolbox. Big systems require a lot of memory, so we have to make sure we are not wasting any of it! tools like memory-profiler will help you narrow down where your system is eating up your RAM.

Model Profiling — Tools like Tensorboard come with excellent profiling tools for looking at what is eating up your performance within your model.

Network Profiling — Network load is a common culprit for bottlenecking your system. There are tools like wireshark to help you profile this, but to be honest I never use it. Instead, I prefer to do time profiling on my components and measure the total time it is taking within my component and then isolate how much time is coming from the network I/O itself.

Make sure to check out this great article on profiling in Python from RealPython for more info!

Once you have identified an area in your profiling that needs to be optimized, simplify it. Cut out everything else except that part. Keep reducing the system down to smaller parts until you reach the bottleneck. Don’t be afraid to profile as you simplify, this will ensure that you are going in the right direction as you iterate. Keep repeating this until you find your bottleneck.


  • Replace other components with stubs and mock functions that just provide expected data.
  • Simulate heavy functions with sleep functions or dummy calculations.
  • Use dummy data to remove the overhead of the data generation and processing.
  • Start with local, single-process versions of your system before moving to distributed.
  • Simulate multiple nodes and actors on a single machine to remove the network overhead.
  • Find the theoretical max performance for each part of the system. If all of the other bottlenecks in the system were gone except for this component, what is our expected performance?
  • Profile again! Each time you simplify the system, re-run your profiling.


Once we have zoned in on the bottleneck there are some key questions we want to answer

What is the theoretical max performance of this component?

If we have sufficiently isolated the bottlenecked component then we should be able to answer this.

How far away are we from the max?

This optimality gap will inform us on how optimized our system is. Now, it could be the case that there are other hard constraints once we introduce the component back into the system and that’s fine, but it is crucial to at least be aware of what the gap is.

Is there a deeper bottleneck?

Always ask yourself this, maybe the problem is deeper than you initially thought, in which case, we repeat the process of benchmarking and simplifying.

Okay, so let’s say we have identified the biggest bottleneck, now we get to the fun part, how do we improve things? There are usually 3 areas that we should be looking at for possible improvements

  1. Compute
  2. Communication
  3. Memory


In order to reduce computation bottlenecks we need to look at being as efficient as possible with the data and algorithms we are working with. This is obviously project-specific and there is a huge amount of things that can be done, but let’s look at some good rules of thumb.

Parallelising — make sure that you carry out as much work as possible in parallel. This is the first big win in designing your system that can massively impact performance. Look at methods like vectorisation, batching, multi-threading and multi-processing.

Caching — pre-compute and reuse calculations where you can. Many algorithms can take advantage of reusing pre-computed values and save critical compute for each of your training steps.

Offloading — we all know that Python is not known for its speed. Luckily we can offload critical computations to lower level languages like C/C++.

Hardware Scaling — This is kind of a cop-out, but when all else fails, we can always just throw more computers at the problem!


Any seasoned engineer will tell you that communication is key to delivering a successful project, and by that, we of course mean communication within our system (God forbid we ever have to talk to our colleagues). Some good rules of thumb are:

No Idle Time — All of your available hardware must be utilised at all times, otherwise you are leaving performance gains on the table. This is usually due to complications and overhead of communication across your system.

Stay Local — Keep everything on a single machine for as long as possible before moving to a distributed system. This keeps your system simple as well as avoids the communication overhead of a distributed system.

Async > Sync — Identify anything that can be done asynchronously, this will help offload the cost of communication by keeping work moving while data is being moved.

Avoid Moving Data — moving data from CPU to GPU or from one process to another is expensive! Do as little of this as possible or reduce the impact of this by carrying it out asynchronously.


Last but not least is memory. Many of the areas mentioned above can be helpful in relieving your bottleneck, but it might not be possible if you have no memory available! Let’s look at some things to consider.

Data Types — keep these as small as possible helping to reduce the cost of communication, and memory and with modern accelerators, it will also reduce computation.

Caching — similar to reducing computation, smart caching can help save you memory. However, make sure your cached data is being used frequently enough to justify the caching.

Pre-Allocate — not something we are used to in Python, but being strict with pre-allocating memory can mean you know exactly how much memory you need, reduces the risk of fragmentation and if you are able to write to shared memory, you will reduce communication between your processes!

Garbage Collection — luckily python handles most of this for us, but it is important to make sure you are not keeping large values in scope without needing them or worse, having a circular dependency that can cause a memory leak.

Be Lazy — Evaluate expressions only when necessary. In Python, you can use generator expressions instead of list comprehensions for operations that can be lazily evaluated.

So, when are we finished? Well, that really depends on your project, what the requirements are and how long it takes before your dwindling sanity finally breaks!

As you remove bottlenecks, you will get diminishing returns on the time and effort you are putting in to optimize your system. As you go through the process you need to decide when good is good enough. Remember, speed is a means to an end, don’t get caught in the trap of optimizing for the sake of it. If it is not going to have an impact on users, then it is probably time to move on.

Building large-scale ML systems is HARD. It’s like playing a twisted game of “Where’s Waldo” crossed with Dark Souls. If you do manage to find the problem you have to take multiple attempts to beat it and you end up spending most of your time getting your ass kicked, asking yourself “Why am I spending my Friday night doing this?”. Having a simple and principled approach can help you get past that final boss battle and taste those sweet, sweet theoretical max FLOPs.

Source link

Be the first to comment

Leave a Reply

Your email address will not be published.