(Day 286) MLE - Model deployment from Andriy Burkov

Ivan Ivanov · October 13, 2024

Hello :) Today is Day 286!

A quick summary of today:

  • read the 2nd to last chapter of Andriy Burkov’s MLE book

Chapter 6 from Andriy Burkov’s MLE book

image

Deploying a model means to make it available for accepting queries generated by the users of the production system. Once the production system accepts the query, the latter is transformed into a feature vector. The feature vector is then sent to the model as input for scoring. The result of the scoring then is returned to the user.

A trained model can be deployed in various ways. It can be deployed on a server, or on a user’s device. It can be deployed for all users at once, or to a small fraction of users.

A model can be deployed following several patterns:

  • statically, as a part of an installable software package
  • dynamically on the user’s device
  • dynamically on a server
  • via model streaming

Static deployment

Static deployment involves packaging a machine learning model as part of traditional software deployment. The model and feature extractor are bundled into an installable binary, such as a dynamic-link library (DLL), shared object file, or serialized resource for virtual machine systems (e.g. Java, .NET). This approach offers fast execution, maintains user privacy, supports offline usage, and transfers model maintenance to the user. However, it complicates model upgrades, as updating the model often requires updating the entire application. Additionally, models with specific computational requirements, like GPU access, may limit deployment flexibility.

Dynamic deployment on user’s device

Dynamic deployment allows users to run part of a machine learning system on their device, but unlike static deployment, the model is not embedded in the application’s binary. This approach enables easier model updates without requiring a full application update. It can also select the appropriate model based on the device’s compute resources. There are several methods for dynamic deployment:

  • deployment of model parameters: the model file only contains learned parameters, while the device runs the model in a pre-installed environment. Examples include TensorFlow Lite for mobile and Apple’s Core ML

  • deployment of a serialized object: the model is deployed as a serialized object that is deserialized on the device, avoiding the need for a runtime environment but potentially leading to large updates

  • deployment to browser: using frameworks like TensorFlow.js, models can be run in a browser, leveraging JavaScript and possibly a GPU

Advantages and Drawbacks

The main advantage of dynamic deployment is faster model access for users and reduced server load for organizations. Browser-based deployments further lower infrastructure requirements by only needing to serve web pages with model parameters. However, drawbacks include higher bandwidth and startup costs due to repeated model downloads and challenges in maintaining model versions across users. Additionally, model security is at risk as it can be reverse-engineered by third parties.

Dynamic deployment on a server

Dynamic deployment on a server is the most common approach, offering scalability and ease of model updates. Models are made available via REST API or gRPC services and deployed on virtual machines, containers, or serverless platforms.

Deployment on a virtual Machine

In a virtual machine (VM) deployment, a web service running on a VM processes incoming HTTP requests, applying the model to the input data and returning predictions in formats like JSON or XML. Multiple VMs can be used in parallel, with a load balancer routing requests based on resource availability. VMs can be scaled manually or automatically to handle varying traffic loads. The simplicity of this architecture makes it a common choice, but maintaining VMs adds operational complexity and costs. Additionally, network latency can slow down responses, and there’s some overhead from running separate operating systems for each VM.

image

Deployment in a Container

Container-based deployment is a more modern approach where the model and necessary services are packaged within containers (like Docker). Containers share the host machine’s operating system, making them more lightweight and efficient compared to VMs. Containers can be managed using orchestration systems like Kubernetes, which can automatically scale the number of running containers based on demand. Containers can even scale to zero when idle, significantly reducing resource usage and costs. However, setting up and managing containerized deployments is more complex and requires specialized expertise.

image

Serverless deployment

In serverless deployment, models are packaged into functions (i.e. AWS Lambda) that automatically scale based on requests. This is highly cost-efficient and easy to maintain but limited by file size, runtime memory, and the unavailability of GPUs for deep learning models.

Model streaming

Model streaming is an alternative to REST API-based deployment, designed for complex workflows involving multiple models. In streaming deployments, data flows continuously through a series of models and transformations managed by a stream-processing engine (i.e. Apache Flink or Apache Storm). Rather than handling individual requests and responses, streaming applications process data in real-time, providing faster insights for scenarios that require multiple intermediate predictions, such as natural language processing or recommendation systems. This approach reduces latency, improves resource efficiency, and is particularly suited for applications that follow a predictable sequence of operations. However, setting up a streaming infrastructure requires careful design and is more suitable for specific use cases where data is continuously processed.

image

Deployment strategies

Single deployment

This is the simplest and quickest strategy, where the new model directly replaces the old one. You serialize the new model, upload it (along with a new feature extractor if needed), and replace the previous version on the server or device. In cloud environments, this involves updating virtual machines or containers. Though easy to implement, it’s the riskiest since all users are affected if the new model contains bugs.

Silent deployment

In this approach, both the old and new models run in parallel. The new model’s predictions are logged, but users continue seeing results from the old model. This allows engineers to detect issues with the new model without affecting users. However, running two models simultaneously increases resource consumption, and certain applications may require real-world feedback to properly evaluate the new model.

Canary Deployment

With canary deployment, the new model is rolled out to a small subset of users while the majority continue using the old model. This method enables validation of the model’s real-world performance with minimal risk to the broader user base. However, rare bugs can be missed if the subset of users is too small to trigger those errors.

Multi-Armed Bandits (MAB)

This strategy involves deploying multiple versions of a model and using the MAB algorithm to optimize their use in real-time. Initially, the system explores all versions, but over time, it prioritizes the best-performing model, automatically routing users to it. MAB solves the challenge of online evaluation and deployment simultaneously, ensuring most users experience the optimal model after the system has enough performance data.

Best practices

  • algorithmic efficiency (using the Big O notation)
  • deployment of deep models - inference might require a GPU, so deploy the model on a GPU instance, and the rest of the app on a CPU instance (if that part of the app does not require a GPU which can be costly)
  • caching and memoization
  • use appropriate serialization tools for the model and code
  • start with a simple model
  • test on outsiders

Today I woke up at 6am and had to travel to and back from Seouk. While travelling, I read the chapter and took the notes above. I had a look at the last chapter on ‘Model Serving, Monitoring, and Maintenance’ which I should finish next week - maybe even on stream (will see). Also next week I will be able to stream every day 🥳

Unfortunately, I could not do the gpt4o fine-tuning which I will do tomorrow when I get to the lab ^^

That is all for today!

See you tomorrow :)