1. YouTube Summaries
  2. Optimizing ML Serving Systems at Pinterest with GPU-Based Models

Optimizing ML Serving Systems at Pinterest with GPU-Based Models

By scribe 2 minute read

Create articles from any YouTube video or use our API to get YouTube transcriptions

Start for free
or, create a free article to see how easy it is.

Introduction to Pinterest's Machine Learning Framework

Pinterest, a platform that fuels inspiration for its users, heavily relies on machine learning (ML) to enhance user experience by recommending personalized content. The core of this functionality lies in their sophisticated recommender models which suggest pins based on user interests and past interactions.

Evolution of ML Serving Systems at Pinterest

Initially, Pinterest utilized a CPU-based ML serving system. This setup involved a scatter-gather architecture where requests were distributed among multiple servers (leaves), each handling specific tasks like feature hydration and inference batching. However, as the demand for faster and more efficient processing grew, Pinterest recognized the need for a shift towards GPU-based serving.

Transitioning to GPU-Based Serving

The transition began around 2021 when Pinterest moved from tensor-based CPU serving to PyTorch on CPUs. By 2022, they started implementing GPU serving for recommended models. This shift was driven by the need for higher throughput and lower latency in processing large volumes of requests.

Challenges Faced

One significant challenge was adapting existing CPU infrastructure to support GPU operations efficiently. GPUs perform best with larger batches of data due to their throughput-oriented nature, contrary to the high parallelism of CPUs. This required rethinking the scatter-gather model which was initially optimized for CPUs.

Optimizing GPU Serving System

To maximize efficiency, several optimizations were implemented:

  • Data Loading: Adjustments were made to accommodate larger batch sizes that GPUs favor. This involved disabling data sharding and using hybrid in-memory and SSD-based caching systems.

  • Memory Management: Advanced techniques like memory arenas were introduced to manage memory allocations dynamically during runtime, significantly reducing overhead.

  • Batch Processing: The batch workers' speed was crucial. Techniques such as CUDA graphs were employed to reduce kernel launch overheads by encapsulating multiple operations within a single graph execution.

  • Feature Handling: Handling extensive features efficiently was another hurdle. Custom CUDA operators and fused embedding techniques helped streamline processing multicategorical features.

Results Achieved

By mid-2024, all major recommended model use cases at Pinterest had been successfully migrated to this optimized GPU serving system. The improvements not only enhanced performance but also maintained infrastructure costs at neutral levels while achieving better throughput and latency metrics.

Future Directions and Continuous Optimization

The journey doesn't end here; continuous optimization is essential as new challenges arise with evolving model complexities and infrastructure demands. For instance, remote inference techniques have been explored where computation-heavy tasks are offloaded to specialized servers, further optimizing resource usage across different types of workloads.

Article created from: https://youtu.be/bg2Cfk649Mg?si=BmeAxCQR0YCegvKA

Ready to automate your
LinkedIn, Twitter and blog posts with AI?

Start for free