Hello :) Today is Day 311!
A quick summary of today:
- joining some sessions in NODES 2024
- reading chapter 10 of LLM engineer’s handbook
- reading about not so popular linear models in sklearn’s docs
Before I start, I got my certificate today:
NODES 2024
Today is NODES - Neo4j’s free 24-hour online conference. I joined some of the sessions
Neo4j Product Vision and Roadmap
They are introducing CDC which is nice.
Available soon - text to cypher models will be made available. By looking at this linkedin post, it seems such natural text to prompts will be available in AuraDB
Most of the things were not for me. If I were working for a company that was trying to use/already using neo4j, then I think everything would be interesting for me 😆 but as I am just a student fan for now, I was just looking forward to any news regarding their Graph Data Science library. But there were non :(
Neo4j Visualisation Library (NVL)
I joined a session on NVL. Not sure how new it is, but it is a javascript library for … visualisations.
LLM Engineer’s handbook Chapter 10 - Inference pipeline deployment
Criteria for choosing deployment types
Throughput and latency
Throughput is generally measured in requests per second (RPS), and is crucial when deploying ML models when we expect to process many requests.
Latency is the time it takes for a system to process a single inference request from when it is received until the result is returned. It is critical in real-time applications where quick response times are essential, such as in live user interactions, fraud detection, or any system requiring immediate feedback
A lower latency translates to higher throughput when the service processes multiple queries in parallel successfully. For example, if the service takes 100 ms to process requests, this translates to a throughput of 10 requests per second. If the latency reaches 10 ms per request, the throughput rises to 100 requests per second.
Data
The model’s input and output data—including its format, volume, and complexity—are crucial to the inference process. Data characteristics like size and type directly impact latency and throughput, with more complex or larger data increasing processing time. Different model types require distinct handling based on input-output formats.
Before choosing a specific deployment type, we should ask ourselves:
- what are the throughput requirements?
- how many requests the system must handle simultaniously?
- what are the latency requirements?
- how should the system scale?
- what are the cost requirements?
- what is the size of the data we work with?
Inference deployment types
Monolithic vs. microservices architecture in model serving
The choice between monolithic and microservices architectures for serving ML models largely depends on the application’s specific needs. A monolithic approach might be ideal for smaller teams or more straightforward applications where ease of development and maintenance is a priority. It’s also a good starting point for projects without frequent scaling requirements. Also, if the ML models are smaller, don’t require a GPU, or don’t require smaller and cheaper GPUs, the trade-off between reducing costs and complicating your infrastructure is worth considering.
On the other hand, microservices, with their adaptability and scalability, are well suited for larger, more complex systems where different components have varying scaling needs or require distinct tech stacks. This architecture is particularly advantageous when scaling specific system parts, such as GPU-intensive LLM services.
LLMTwin’s inference pipeline
It is defined by the feature/training/inference (FTI) architecture. For the pipe to run, it needs two things:
- real-time features used for RAG, generated by the feature pipeline, which is queried from our online feature store, more concretely from the Qdrant vector db
- a fine-tuned LLM generated by the training pipe, which is pulled from our model registry
The flow is:
- a user sends a query through an HTTP request
- the user’s input retrieves the proper context by leveraging the advanced RAG retrieval module
- the user’s input and retrieved context are packed into the final prompt using a dedicated prompt template
- the prompt is sent to the LLM microservice through an HTTP request
- the busines microservices wait for the generated answer
- after the answer is generated, it is sent to the prompt monitoring pipeline along with the user’s input and other vital info to monitor
- ultimately, the generated answer is sent back to the user
The final parts of the chapter talks about deploying the model service using SageMaker and FastAPI
Stream for 2hrs
I went over the linear models section of scikit-learn’s docs. There were some familar ‘faces’ but also some concepts which I had not heard of before - like Least Angle Regression, Orthogonal Matching Pursuit (OMP), and others as the list is quite long.
That is all for today!
See you tomorrow :)