Serving Deep Learning Models to process billion impressions per day with sub second response | by Sudhahar Thiagarajan | Oct, 2023

In today’s digital landscape, personalized recommendations have become an integral part of user experiences. Whether it’s e-commerce platforms, streaming services, or news websites, recommendation engines play a vital role in engaging users and driving conversions. However, serving recommendations at scale, especially on web pages with billions of impressions, poses significant technical challenges. In this article, we will explore the intricacies of building and scaling recommendation engines to cater to such high traffic scenarios.

The data scientists have built recommendation engine which is a combination of 3 models.

  • Collaborative filtering algorithm Look at products purchased by several users and provides recommendation for each user in the form of cookie id and recommended products
  • Computer Vision algorithm Look at images and apply computer vision and provides recommendation for every product in the form of target product id and recommended products
  • Generic recommendations Look at purchase behavior, browsing patterns, etc and provides aggregated recommendations at location and device level

These models are given to users as follows:

  • If it’s a brand new user Serve generic recommendations based on the location and device of cookie.
  • If the user has purchases before Serve collaborative filtering algorithm results based on cookie id.
  • If the user is browsing right now (viewing products, adding products add to cart, etc Use the products that user is browsing and recommend using computer vision algorithm based on product id

Serving recommendations at such a massive scale presents unique challenges that require careful consideration. Here are some key challenges to address:

1. Latency and Response Time: With billions of impressions, low latency becomes crucial. Users expect near-instantaneous response times, and any delays can lead to frustration or abandonment. Delivering personalized recommendations in real-time requires optimizing algorithms, infrastructure, and data processing pipelines.

2. Scalability and Throughput: Accommodating high traffic volumes necessitates a scalable architecture capable of handling concurrent requests efficiently. A robust infrastructure, distributed computing, and load balancing techniques are essential to ensure smooth operations even during peak traffic.

3. Data Processing and Storage: Recommendation engines rely on large volumes of user data for analysis. Efficient data processing frameworks like Apache Hadoop, Apache Spark, or distributed databases are essential for handling massive datasets and extracting relevant insights.

4. Dynamic Content and Freshness: Web pages are often dynamic, with content changing frequently. Updating recommendations in real-time, reflecting changes in user preferences or inventory availability, is crucial to maintain relevance. A well-designed caching strategy can significantly reduce the processing overhead while keeping recommendations up to date.

5. Personalization and Diversity: Recommendation systems should strike a balance between personalization and diversity. Over-personalization can create filter bubbles and limit user exposure to new content, while too much diversity may result in irrelevant recommendations. Implementing diversity metrics and fine-tuning recommendation algorithms can ensure an optimal balance.

To overcome the challenges of serving recommendations at scale, here are some strategies to consider:

1. Distributed Computing: Utilize distributed computing frameworks like Apache Spark, Hadoop, or TensorFlow to parallelize recommendation computations. This allows for faster processing and scalability across multiple machines or clusters.

2. Caching and Precomputation: Leverage caching mechanisms to store precomputed recommendations, reducing the need for real-time computation. Cache invalidation techniques can be employed to refresh recommendations periodically or when significant changes occur.

3. Microservices Architecture: Decompose the recommendation engine into microservices, each responsible for a specific task. This enables independent scaling of different components, better fault isolation, and easier maintenance.

4. Asynchronous Processing: Employ asynchronous processing to handle large volumes of requests without blocking the system. Queue-based message brokers like Apache Kafka or RabbitMQ can ensure reliable message delivery and smooth integration with downstream systems.

5. Load Balancing and Auto-scaling: Distribute the load evenly across multiple servers using load balancing techniques. Auto-scaling allows the infrastructure to dynamically adjust resources based on traffic patterns, ensuring optimal performance during peak times.

Model parallelism

  • different layers in a Deep Learning model may be trained in parallel on different GPUs
  • For e.g. Elephas ( Keras ) to run distributed deep learning models at scale with Spark’s RDDs and data frames
    – Data parallel training of deep learning models
    – Distributed hyper parameter optimization
    – Distributed training of ensemble models

Data parallelism

  • the training data is divided into multiple subsets, and each one of them is run on the same replicated model in a different G PU (worker
  • For e.g. Spark: Thread Pools, Parallel Jobs & Schedules , Pandas UDFs, Dedicated ExecutionContext , Asynchronous functions concurrent.futures ), Partitioning


  • PyTorch torch.nn.parallel.DistributedDataParallel
  • TensorFlow tf.distribute.MirroredStrategyin
  • CUDA
  • Keras , RNN, Faster R CNN, YOLO

Model Monitoring

  • Populate ongoing, live measurements of model accuracy (RMSLE, RMSE) in time series monitoring systems
  • Threshold alerts.

Data Assumptions

Different groups of customers, like unique but isolated spending habits, Extremely different spending habits like gifts vs purchase for themselves, or one time expensive buy in one product category vs other categories, Men’s / Women’s / Kids / Home for each customer, registered gender to the website and app is indicative of purchase behavior false positive, Price points can be highly variable within a particular product category

For each customer’s viewing and purchase history:

  • aggregate and sum all activity for each member under the location, device, session id, cookie id.
  • Generate the input dense vector of the user, item, rating (implicit) in the form of ( user_id , location, device, session id, cookie id,
  • Table partition by day or hour

Fault tolerant & Recovery

  • Input sources must be replayable
  • Output sinks must support transactional updates
  • state to be managed in storage system, such as S3 or HDFS, e.g. running counts

Scheduling, Workflow and Monitoring

  • Airflow, Argo, Kubernetes


  • distributed cluster of n GPU servers connected with Infiniband

Platform Consideration

  • Data Pipeline and the ML Pipeline AWS or Azure
  • Recommendation by customer journey context
  • Log
    – customer journey
    – recommendations
    – customer actions
  • Performance (SLA of 500 ms ), scalability, failover
  • Role of MLOps
  • Monitoring, logging, alerting and tracing
  • Microservices serving the recommendations.
  • Assumptions
    – billion impression on the webpage / day
    – Dynamic recommendations as per user context.


Once you have designed the pipelines, you need to scale the system to handle a billion impressions. There are a number of factors that can affect the scalability of your system, including the following:

  • The amount of data: The more data you have, the more processing power you will need.
  • The number of users: The more users you have, the more requests you will need to handle.
  • The desired level of performance: If you need to generate recommendations in real time, you will need to use a system that can handle a high volume of requests.

There are a number of ways to scale a recommendation system. One common approach is to use a distributed architecture. This can be done by distributing the data across multiple servers or by using a cloud-based service. Another approach is to use a caching system. This can help to reduce the load on the recommendation engine by storing frequently requested recommendations in memory.

Serving a real-time recommendation engine and pipelines for a billion impressions can be challenging, but it is possible with the right architecture, design, and scaling techniques. By following the tips in this article, you can build a system that can handle the demands of a large-scale application.


Source link

Be the first to comment

Leave a Reply

Your email address will not be published.