(Day 224) Learning about Snowflake and starting the book - Deep Learning at Scale

Ivan Ivanov · August 12, 2024

Hello :) Today is Day 224!

A quick summary of today:

  • a bit more about k8s
  • read chapter 1 from Deep Learning at Scale
  • learn about Snowflake

Kubernetes for the Absolute Beginners - Hands-On

Deployments in K8s

In Kubernetes, deployments manage the deployment and scaling of applications. They ensure multiple instances of an application are running, allow seamless updates via rolling updates (updating instances one by one to avoid downtime), and support rollback if an update causes issues. Deployments also allow pausing and resuming changes to batch multiple updates together. A deployment creates a ReplicaSet, which in turn creates pods that run the application. Deployment configuration is defined in a YAML file similar to a ReplicaSet, with the key difference being the kind set to “Deployment.” Once applied, the deployment manages ReplicaSets and pods, ensuring the application runs as specified.

image

Deployment Strategies

When a deployment is first created, it triggers a rollout, generating a deployment revision (e.g., revision one). Updating the application, such as changing the container version, triggers a new rollout and creates a subsequent revision (e.g., revision two). This versioning allows for easy tracking of changes and the ability to roll back if needed.

There are two deployment strategies:

  1. Recreate Strategy: This involves shutting down all instances of the application before deploying the new version, causing temporary downtime.
  2. Rolling Update Strategy (default): Instances are updated one by one, ensuring the application remains accessible during the upgrade.

Commands like kubectl rollout status and kubectl rollout history are used to monitor rollout status and history.

image

Deep Learning at Scale

This is one of the books on my to-read list in O’reilly, so today I decided to start it. Below is a summary of chapter 1 which included lots of nature references - trying to explain scalability in nature.

Scaling in deep learning is a multidimensional challenge, generally consisting of three aspects:

  1. Generalizability: Enhancing the ability of models to generalize across diverse tasks or demographics.
  2. Training and Development: Improving model training techniques to reduce development time while meeting resource needs.
  3. Inference: Boosting the serviceability of the model during deployment and serving.

This book focuses primarily on the first two aspects, with inference being beyond its scope.

Six Development Considerations

When developing a deep learning solution, six key considerations are critical for scaling:

1. Well-Defined Problem

  • Understanding the Problem: Clearly define the problem you are solving and ensure you have the right data.
  • Example: Scaling a roof detection model from Sydney to global use requires defining if the task is just detecting roofs or also identifying roof styles.

2. Domain Knowledge (Constraints)

  • Domain-Specific Constraints: Understand how the problem space, like geographic differences in roof materials, impacts model development.

3. Ground Truth

  • Dataset Representation: Ensure the dataset represents global variability, balancing quality and generalization.
  • Example: Consider unique roofs, such as Scandinavian sod roofs, when curating data.

4. Model Development

  • Scaling Techniques: Develop models that can handle increased data and complexity as the task scales.
  • Questions: Plan experiments, adopt efficient training strategies, and consider distributed training when necessary.

5. Deployment

  • Deployment Strategy: Though not covered in this book, it’s essential to consider deployment constraints and performance from the early stages.

6. Feedback

  • Real-Time Feedback: Implement systems for collecting user feedback to continuously improve the model.
  • Example: Automate feedback collection to create a data engine that iteratively improves the system.

Scaling Considerations

Scaling is complex and requires careful planning and strategy. Key considerations include:

Questions to Ask Before Scaling

  • What are we scaling?
  • How will we measure success?
  • Do we really need to scale?
  • What are the ripple effects?
  • What are the constraints?
  • How will we scale?
  • Is our scaling technique optimal?

Characteristics of Scalable Systems

  1. Reliability: The system should recover gracefully from failures.
  2. Availability: Strive for high availability, ideally 99.999% uptime.
  3. Adaptability: Ensure the system can handle growing demands.
  4. Performance: Monitor and optimize resource consumption as the system scales.

Design Considerations for Scalable Systems

  • Avoid Single Points of Failure (SPOF): Minimize SPOFs to prevent complete system halts.
  • Design for High Availability: Implement reliability, resilience, and redundancy.
  • Scaling Paradigms: Utilize horizontal, vertical, or hybrid scaling based on the needs.
  • Communication: Choose between synchronous and asynchronous communication depending on system requirements.
  • Caching and Storage: Use caching to minimize expensive information retrieval processes.
  • Process State: Opt for stateless processes where possible to simplify scaling.
  • Graceful Recovery and Checkpointing: Implement checkpointing for efficient recovery from failures.
  • Maintainability and Observability: Monitor systems effectively to simplify maintenance and reduce downtime.

Scaling Effectively

  • Measure Twice, Cut Once: Apply careful planning, benchmarking, and iterative improvement to scale efficiently.
  • Efficiency: Efficient use of resources is critical, considering environmental and societal impacts of deep learning practices.

Then I found Snowflake has a 1 month free trial and also a 0-to-Snowflake in 90 mintues free course

I saw that Snowflake sales engineers offer a 90-min intro to snowflake, and that the next course is on the 14th of Aug. I registered, and then made an account in Snowflake and tried out some of the functionality.

This is the homepage UI:

image

Seems that SQL and python can be used for data manipulation. I chose SQL for the default offered quick intro to snowflake, and below are the queries.

  1. We create a table
CREATE OR REPLACE TABLE tasty_bytes_sample_data.raw_pos.menu
(
    menu_id NUMBER(19,0),
    menu_type_id NUMBER(38,0),
    menu_type VARCHAR(16777216),
    truck_brand_name VARCHAR(16777216),
    menu_item_id NUMBER(38,0),
    menu_item_name VARCHAR(16777216),
    item_category VARCHAR(16777216),
    item_subcategory VARCHAR(16777216),
    cost_of_goods_usd NUMBER(38,4),
    sale_price_usd NUMBER(38,4),
    menu_item_health_metrics_obj VARIANT
);
  1. Load data from an S3 bucket
---> create the Stage referencing the Blob location and CSV File Format
CREATE OR REPLACE STAGE tasty_bytes_sample_data.public.blob_stage
url = 's3://sfquickstarts/tastybytes/'
file_format = (type = csv);

---> query the Stage to find the Menu CSV file
LIST @tasty_bytes_sample_data.public.blob_stage/raw_pos/menu/;

Basically it is just normal sql. And we can query and manipulate data and get basic visualisations

SELECT TOP 10 * FROM tasty_bytes_sample_data.raw_pos.menu;

image

Next, I tried another tutorial: Create a Notebook to visualize data and insights

Snowflake Notebooks enable you to write and execute code in SQL, Python, or Markdown to create iterative flows, visualize results, and more.

Here is what I saw first:

image

On the left side, next to the notebook there is an environment.yml which seems to serve like requirements.txt

name: app_environment
channels:
  - snowflake
dependencies:
  - matplotlib=3.7.2
  - scipy=1.10.1

We can install packages directly as well using the Package button

image

We can also execute Python or SQL in the notebook cells. Up to us what we do and how we do it.

A cool thing we can do is:

  1. Run this in some cell (say cell4)
-- Generating a synthetic dataset of Snowboard products, along with their price and rating
SELECT CONCAT('SNOW-',UNIFORM(1000,9999, RANDOM())) AS PRODUCT_ID, 
       ABS(NORMAL(5, 3, RANDOM())) AS RATING, 
       ABS(NORMAL(750, 200::FLOAT, RANDOM())) AS PRICE
FROM TABLE(GENERATOR(ROWCOUNT => 100));
  1. Then in cell5, we can do:
df = cell5.to_pandas()

And we have the data from the SQL query as a pandas df now 🤯

We can also use Jinja syntax to use python vars in sql queries

threshold = 5
-- Reference Python variable in SQL
SELECT * FROM SNOW_CATALOG where RATING > 

We can also refer to other cell’s output using Jinja

SELECT AVG(RATING) FROM 
WHERE RATING > 5

Where cell23 returns a table.

There is also an AI & ML Studio section

image

I am looking forward to the 0-to-Snowflake 90 minute course on Wednesday ^^

Build an ELT Pipeline (dbt, Snowflake, Airflow)

I wanted to watch some tutorial related to snowflake and found this video that uses dbt (which I know) and Airflow (which I have not used), so I decided to give it a go.

It covered:

  1. Setting up Snowflake (warehouse, roles, schemas, databases, users)
  2. Configure dbt
  3. Set up staging and source models
  4. Dbt macros
  5. Dbt generic and singular tests
  6. Orchestration using Airflow

Setting up snowflake

-- Switch to the ACCOUNTADMIN role, which has full access to the Snowflake account
use role accountadmin;

-- Create a new warehouse named 'dbt_wh' with a size of 'x-small'
create warehouse dbt_wh with warehouse_size='x-small';

-- Create a database named 'dbt_db' if it does not already exist
create database if not exists dbt_db;

-- Create a role named 'dbt_role' if it does not already exist
create role if not exists dbt_role;

-- Display the grants (privileges) on the 'dbt_wh' warehouse
show grants on warehouse dbt_wh;

-- Grant the 'usage' privilege on the 'dbt_wh' warehouse to the 'dbt_role' role
grant usage on warehouse dbt_wh to role dbt_role;

-- Grant the 'dbt_role' role to the user 'divakaivan12'
grant role dbt_role to user divakaivan12;

-- Grant all privileges on the 'dbt_db' database to the 'dbt_role' role
grant all on database dbt_db to role dbt_role;

-- Switch to the 'dbt_role' role
use role dbt_role;

-- Create a schema named 'dbt_schema' within the 'dbt_db' database
create schema dbt_db.dbt_schema;

Set up dbt

The data used is some example dataset provided by snowflake

  • stg_tpch_line_items.sql
select
     as order_item_key,
    l_orderkey as order_key,
    l_partkey as part_key,
    l_linenumber as line_number,
    l_quantity as quantity,
    l_extendedprice as extended_price,
    l_discount as discount_percentage,
    l_tax as tax_rate
from
    
  • stg_tpch_orders.sql
select
    o_orderkey as order_key,
    o_custkey as customer_key,
    o_orderstatus as status_code,
    o_totalprice as total_price,
    o_orderdate as order_date
from 
    
  • int_order_items.sql
select
    line_item.order_item_key,
    line_item.part_key,
    line_item.line_number,
    line_item.extended_price,
    orders.order_key,
    orders.customer_key,
    orders.order_date,
     as item_discount_amount
from
     as orders
join 
     as line_item
        on orders.order_key = line_item.order_key 
order by 
    orders.order_date
  • int_order_items_summary.sql
select 
    order_key,
    sum(extended_price) as gross_item_sales_amount,
    sum(item_discount_amount) as item_discount_amount
from 
    
group by    
    order_key
  • fct_orders.sql
select 
    orders.*,
    order_item_summary.gross_item_sales_amount,
    order_item_summary.item_discount_amount    
from
     as orders
join 
     as order_item_summary
        on orders.order_key = order_item_summary.order_key
order by
    order_date
  • pricing macro*

image

(has to be as a pic because of a jinja setup in my blog that tries to execute the macro)

  • singular tests on specific fields
select 
    *
from
    
where
    date(order_date) > CURRENT_DATE()
    or date(order_date) < date('1990-01-01')

and

select
    *
from 
    
where
    item_discount_amount > 0

Running the pipeline in Airflow

Airflow was set up using astronomer-cosmos which requires astro dev init, astro dev start to start airflow.

To include the created dbt pipeline, all I needed to do was just put the dbt directory I was working in into astronomer-cosmos’ dag folder, and I could see it in Airflow

image

Here is the data lineage from Airflow:

image

And after a failed run due to some access errors, it ran

image

All the code I wrote is on my repo.

That is all for today!

See you tomorrow :)