(Day 224) Learning about Snowflake and starting the book - Deep Learning at Scale

Ivan Ivanov · August 12, 2024

Hello :) Today is Day 224!

A quick summary of today:

a bit more about k8s
read chapter 1 from Deep Learning at Scale
learn about Snowflake

Kubernetes for the Absolute Beginners - Hands-On

Deployments in K8s

In Kubernetes, deployments manage the deployment and scaling of applications. They ensure multiple instances of an application are running, allow seamless updates via rolling updates (updating instances one by one to avoid downtime), and support rollback if an update causes issues. Deployments also allow pausing and resuming changes to batch multiple updates together. A deployment creates a ReplicaSet, which in turn creates pods that run the application. Deployment configuration is defined in a YAML file similar to a ReplicaSet, with the key difference being the kind set to “Deployment.” Once applied, the deployment manages ReplicaSets and pods, ensuring the application runs as specified.

Deployment Strategies

When a deployment is first created, it triggers a rollout, generating a deployment revision (e.g., revision one). Updating the application, such as changing the container version, triggers a new rollout and creates a subsequent revision (e.g., revision two). This versioning allows for easy tracking of changes and the ability to roll back if needed.

There are two deployment strategies:

Recreate Strategy: This involves shutting down all instances of the application before deploying the new version, causing temporary downtime.
Rolling Update Strategy (default): Instances are updated one by one, ensuring the application remains accessible during the upgrade.

Commands like kubectl rollout status and kubectl rollout history are used to monitor rollout status and history.

Deep Learning at Scale

This is one of the books on my to-read list in O’reilly, so today I decided to start it. Below is a summary of chapter 1 which included lots of nature references - trying to explain scalability in nature.

Scaling in deep learning is a multidimensional challenge, generally consisting of three aspects:

Generalizability: Enhancing the ability of models to generalize across diverse tasks or demographics.
Training and Development: Improving model training techniques to reduce development time while meeting resource needs.
Inference: Boosting the serviceability of the model during deployment and serving.

This book focuses primarily on the first two aspects, with inference being beyond its scope.

Six Development Considerations

When developing a deep learning solution, six key considerations are critical for scaling:

1. Well-Defined Problem

Understanding the Problem: Clearly define the problem you are solving and ensure you have the right data.
Example: Scaling a roof detection model from Sydney to global use requires defining if the task is just detecting roofs or also identifying roof styles.

2. Domain Knowledge (Constraints)

Domain-Specific Constraints: Understand how the problem space, like geographic differences in roof materials, impacts model development.

3. Ground Truth

Dataset Representation: Ensure the dataset represents global variability, balancing quality and generalization.
Example: Consider unique roofs, such as Scandinavian sod roofs, when curating data.

4. Model Development

Scaling Techniques: Develop models that can handle increased data and complexity as the task scales.
Questions: Plan experiments, adopt efficient training strategies, and consider distributed training when necessary.

5. Deployment

Deployment Strategy: Though not covered in this book, it’s essential to consider deployment constraints and performance from the early stages.

6. Feedback

Real-Time Feedback: Implement systems for collecting user feedback to continuously improve the model.
Example: Automate feedback collection to create a data engine that iteratively improves the system.

Scaling Considerations

Scaling is complex and requires careful planning and strategy. Key considerations include:

Questions to Ask Before Scaling

What are we scaling?
How will we measure success?
Do we really need to scale?
What are the ripple effects?
What are the constraints?
How will we scale?
Is our scaling technique optimal?

Characteristics of Scalable Systems

Reliability: The system should recover gracefully from failures.
Availability: Strive for high availability, ideally 99.999% uptime.
Adaptability: Ensure the system can handle growing demands.
Performance: Monitor and optimize resource consumption as the system scales.

Design Considerations for Scalable Systems

Avoid Single Points of Failure (SPOF): Minimize SPOFs to prevent complete system halts.
Design for High Availability: Implement reliability, resilience, and redundancy.
Scaling Paradigms: Utilize horizontal, vertical, or hybrid scaling based on the needs.
Communication: Choose between synchronous and asynchronous communication depending on system requirements.
Caching and Storage: Use caching to minimize expensive information retrieval processes.
Process State: Opt for stateless processes where possible to simplify scaling.
Graceful Recovery and Checkpointing: Implement checkpointing for efficient recovery from failures.
Maintainability and Observability: Monitor systems effectively to simplify maintenance and reduce downtime.

Scaling Effectively

Measure Twice, Cut Once: Apply careful planning, benchmarking, and iterative improvement to scale efficiently.
Efficiency: Efficient use of resources is critical, considering environmental and societal impacts of deep learning practices.

Then I found Snowflake has a 1 month free trial and also a 0-to-Snowflake in 90 mintues free course

I saw that Snowflake sales engineers offer a 90-min intro to snowflake, and that the next course is on the 14th of Aug. I registered, and then made an account in Snowflake and tried out some of the functionality.

This is the homepage UI:

Seems that SQL and python can be used for data manipulation. I chose SQL for the default offered quick intro to snowflake, and below are the queries.

We create a table

CREATE OR REPLACE TABLE tasty_bytes_sample_data.raw_pos.menu
(
    menu_id NUMBER(19,0),
    menu_type_id NUMBER(38,0),
    menu_type VARCHAR(16777216),
    truck_brand_name VARCHAR(16777216),
    menu_item_id NUMBER(38,0),
    menu_item_name VARCHAR(16777216),
    item_category VARCHAR(16777216),
    item_subcategory VARCHAR(16777216),
    cost_of_goods_usd NUMBER(38,4),
    sale_price_usd NUMBER(38,4),
    menu_item_health_metrics_obj VARIANT
);

Load data from an S3 bucket

---> create the Stage referencing the Blob location and CSV File Format
CREATE OR REPLACE STAGE tasty_bytes_sample_data.public.blob_stage
url = 's3://sfquickstarts/tastybytes/'
file_format = (type = csv);

---> query the Stage to find the Menu CSV file
LIST @tasty_bytes_sample_data.public.blob_stage/raw_pos/menu/;

Basically it is just normal sql. And we can query and manipulate data and get basic visualisations

SELECT TOP 10 * FROM tasty_bytes_sample_data.raw_pos.menu;

Next, I tried another tutorial: Create a Notebook to visualize data and insights

Snowflake Notebooks enable you to write and execute code in SQL, Python, or Markdown to create iterative flows, visualize results, and more.

Here is what I saw first:

On the left side, next to the notebook there is an environment.yml which seems to serve like requirements.txt

name: app_environment
channels:
  - snowflake
dependencies:
  - matplotlib=3.7.2
  - scipy=1.10.1

We can install packages directly as well using the Package button

We can also execute Python or SQL in the notebook cells. Up to us what we do and how we do it.

A cool thing we can do is:

Run this in some cell (say cell4)

-- Generating a synthetic dataset of Snowboard products, along with their price and rating
SELECT CONCAT('SNOW-',UNIFORM(1000,9999, RANDOM())) AS PRODUCT_ID, 
       ABS(NORMAL(5, 3, RANDOM())) AS RATING, 
       ABS(NORMAL(750, 200::FLOAT, RANDOM())) AS PRICE
FROM TABLE(GENERATOR(ROWCOUNT => 100));

Then in cell5, we can do:

df = cell5.to_pandas()

And we have the data from the SQL query as a pandas df now 🤯

We can also use Jinja syntax to use python vars in sql queries

threshold = 5

-- Reference Python variable in SQL
SELECT * FROM SNOW_CATALOG where RATING > 

We can also refer to other cell’s output using Jinja

SELECT AVG(RATING) FROM 
WHERE RATING > 5

Where cell23 returns a table.

There is also an AI & ML Studio section

I am looking forward to the 0-to-Snowflake 90 minute course on Wednesday ^^

Build an ELT Pipeline (dbt, Snowflake, Airflow)

I wanted to watch some tutorial related to snowflake and found this video that uses dbt (which I know) and Airflow (which I have not used), so I decided to give it a go.

It covered:

Setting up Snowflake (warehouse, roles, schemas, databases, users)
Configure dbt
Set up staging and source models
Dbt macros
Dbt generic and singular tests
Orchestration using Airflow

Setting up snowflake

-- Switch to the ACCOUNTADMIN role, which has full access to the Snowflake account
use role accountadmin;

-- Create a new warehouse named 'dbt_wh' with a size of 'x-small'
create warehouse dbt_wh with warehouse_size='x-small';

-- Create a database named 'dbt_db' if it does not already exist
create database if not exists dbt_db;

-- Create a role named 'dbt_role' if it does not already exist
create role if not exists dbt_role;

-- Display the grants (privileges) on the 'dbt_wh' warehouse
show grants on warehouse dbt_wh;

-- Grant the 'usage' privilege on the 'dbt_wh' warehouse to the 'dbt_role' role
grant usage on warehouse dbt_wh to role dbt_role;

-- Grant the 'dbt_role' role to the user 'divakaivan12'
grant role dbt_role to user divakaivan12;

-- Grant all privileges on the 'dbt_db' database to the 'dbt_role' role
grant all on database dbt_db to role dbt_role;

-- Switch to the 'dbt_role' role
use role dbt_role;

-- Create a schema named 'dbt_schema' within the 'dbt_db' database
create schema dbt_db.dbt_schema;

Set up dbt

The data used is some example dataset provided by snowflake

stg_tpch_line_items.sql

select
     as order_item_key,
    l_orderkey as order_key,
    l_partkey as part_key,
    l_linenumber as line_number,
    l_quantity as quantity,
    l_extendedprice as extended_price,
    l_discount as discount_percentage,
    l_tax as tax_rate
from
    

stg_tpch_orders.sql

select
    o_orderkey as order_key,
    o_custkey as customer_key,
    o_orderstatus as status_code,
    o_totalprice as total_price,
    o_orderdate as order_date
from 
    

int_order_items.sql

select
    line_item.order_item_key,
    line_item.part_key,
    line_item.line_number,
    line_item.extended_price,
    orders.order_key,
    orders.customer_key,
    orders.order_date,
     as item_discount_amount
from
     as orders
join 
     as line_item
        on orders.order_key = line_item.order_key 
order by 
    orders.order_date

int_order_items_summary.sql

select 
    order_key,
    sum(extended_price) as gross_item_sales_amount,
    sum(item_discount_amount) as item_discount_amount
from 
    
group by    
    order_key

fct_orders.sql

select 
    orders.*,
    order_item_summary.gross_item_sales_amount,
    order_item_summary.item_discount_amount    
from
     as orders
join 
     as order_item_summary
        on orders.order_key = order_item_summary.order_key
order by
    order_date

pricing macro*

(has to be as a pic because of a jinja setup in my blog that tries to execute the macro)

singular tests on specific fields

select 
    *
from
    
where
    date(order_date) > CURRENT_DATE()
    or date(order_date) < date('1990-01-01')

and

select
    *
from 
    
where
    item_discount_amount > 0

Running the pipeline in Airflow

Airflow was set up using astronomer-cosmos which requires astro dev init, astro dev start to start airflow.

To include the created dbt pipeline, all I needed to do was just put the dbt directory I was working in into astronomer-cosmos’ dag folder, and I could see it in Airflow

Here is the data lineage from Airflow:

And after a failed run due to some access errors, it ran

All the code I wrote is on my repo.

That is all for today!

See you tomorrow :)