Hello :) Today is Day 137!
A quick summary of today:
- attended ‘data analysis’ and ‘business tech’ sessions during Day 2 of AWS Summit in Seoul
I tried to take more notes with my laptop during the session today.
First session - ‘Data strategy for successful GenAI on AWS’
A GenAI app is like an iceberg.
Under the water it consists of among other things data storage, database, data lake, data manipulation, data governance.
The session introduced us to RAG, and how we can either use an out-of-the-box language model, pre-train a model, or train a model from scratch.
As for a vector search AWS offers plenty of options: Amazon OpenSearch Service, OpenSearch Serverless, Aurora PostgreSQL, RDS for PostgreSQL, DocumentDB, DynamoDB via zero ETL, MemoryDB for redis, Neptune.
As for a database, there are plenty of structured and nonstructured dbs.
Amazon Bedrock - allows for using out-of-the-box models for GenAI applications. It gives us the ability to use a simple API for a model, allows for model customizing, RAG, Agents, protection, privacy and security.
If we use Amazon Bedrock, the vector databases we can use are Amazon OpenSearch Serverless, Redis Enterprise Cloud, Pinecone, Amazon Aurora, MongoDB.
Second session - ‘How to best use vector databases on AWS’
Start with a pdf - chunk - embed with Amazon Tital - store in Amazon Aurora PostgreSQL (or other db)
Things we need to look for when embedding.
- embedding time
- size
- query time
Questions to ask when choosing a db system
- for my workflow, which vector storage is the best?
- how much data am I going to store?
- among storage, efficiency, connectiviy, cost which is important to me?
- what are the pros and cons of the different indexing, query time and schema designs?
What is pgvector?
- offers open source similarity search, vector storage, indexing and metadata
From pgvector’s github it has 2 strategies for indexes:
HNSW: An HNSW index creates a multilayer graph. It has better query performance than IVFFlat (in terms of speed-recall tradeoff), but has slower build times and uses more memory. Also, an index can be created without any data in the table since there isn’t a training step like IVFFlat. IVFFlat: An IVFFlat index divides vectors into lists, and then searches a subset of those lists that are closest to the query vector. It has faster build times and uses less memory than HNSW, but has lower query performance (in terms of speed-recall tradeoff).
Third session - ‘Successful data stories from the business’
Presented many cases (companies) where after integrating Snowflake into their operations, the companies reduced cost, increased customer satisfaction, profits, amount of data, customer service. Cases like KFC, AT&T, Pfizer, SCANIA.
Below is what kind of services are offered
Basicaully, it was a sales pitch for snowflake haha
Fourth session - Improving GenAI search performance through Amazon’s OpenSearch
What is Amazon OpenSearch? The most optimized search and log solution
- offers similarity search
- seamless data analysis and visualisation
- connects data sources
- low cost solution
- easy of use
Introduced vector search and TF-IDF scoring, and the process of giving a PDF, chunking, embedding, saving, finding similar to a query passages and retrieving them. Using the ANN algorithm - Hierarchical navigable small world (HNSW), and IVF
Showed an example of semntic search:
- external model hosting and service connection
- creating an OpenSearch document gathering using NN search pipeline
- search request through the API gateway
- call the backend through the API gateway
- search the backend for related documents and after give them to the client
We saw a demo on using Amazon OpenSearch and Bedrock for text2sql - similar to my chat with your PDFs but of course on a much higher level using OpenSearch and using streamlit. Also every query, response and extra metadata gets saved in the backend.
Below is what happens on a higher level
Fifth session (from the business tech sessions because there is no data analysis session in this time slot) - Easy data lake creation using AWS
What is a data lake? A place that stores all kinds of data (structured and unstructured)
Data lake vs Data warehous
Data saved in any format (raw) vs fixed format
Schema is not needed vs Schema is needed
Low trustability vs High trustability
Storage and volume cost are better vs Provides fast and efficient queries
Data lake key features (example - Amazon S3)
- data catalogue and searching
- user access
- data gathering
- manipulation and analysis
There are also data lake houses that combine the benefits of both lakes and houses
Things to consider for data lakes in finance
- regulation compliance
- multi-user support
- user certification and ability to use/maintain
- authorization
- protection against attackers
- private network connection and traffic
Data lake security levels
- identity and access management
- application security
- data protection
- infrastructure security
- network security
- identity and access management (platform)
- multi-account management
AWS lake formation
Sixth session - ‘Using Amazing SageMaker Canvas for GenAI’
Data prep challenges
- too many different tools
- various code dependancies
- scalability
- operations management
Data prep/manipulation to get a clean version of the data for a model can take 60-80% of the time of an ML project
SageMaker Canvas has us covered - new feature - chat for data prep
- run code with plain text, edit/remove/add data, get insights, data prep process one with no code and ready to make a model
After our data is ready, there are ready to use models for text, image, audio content.
Finetuning? Can be done without writing any code as well. Selecting a model (or up to 3), data, and we get training and validation results and extra info if we want to finetune with python.
Great day. I am glad I attended ^^
That is all for today!
See you tomorrow :)