(Day 324) I secured a free spot at Zach Wilson's Jan 2025 data engineering bootcamp

Ivan Ivanov · November 20, 2024

data-eng applying-knowledge

Hello :) Today is Day 324!

A quick summary of today:

121 chat with Zach Wilson (trans-continental video chat)
first homework submitted
small company reviewer update
debugging sql
graph data modelling in postgres

I got a free spot at Zach Wilson’s 2025 data engineering bootcamp 🤯

This morning, at 6.30am, I went into a 30-min call with Zach. Why 6.30am well he is in Utah, US and I am in South Korea and as he is a busy person his schedule is fixed so I picked the latest possible time for me. Thankfully, I was not sleepy during the chat, he is very energetic and I got some of his energy 😆.

We had a chat about bootcamps, how the bootcamp is going, other bootcamps, what I’m doing, and also a bit about this blog. He said a lot of positive things about this blog and I appreciate that because this blog does not really reach a big audience and I am using it on my CV as well ~ so it was nice hearing a compliment about this blog journey :)

Anyway ~ at the end, because I am so active on his free bootcamp’s discord (helping others, having discussions about the material, etc.) he said he can offer me a free spot at his paid bootcamp in January 2025. I mean… it’s like a dream. Learning from Zach Wilson’s best content… really just speechless. There is also so much content on his platform and I will be able to learn sooo much from his bootcamp as well. Super excited for January 2025!!! I’m over the moon 🥳🥳🥳🥳🥳

As there is other content on the platform, I will watch some of it before the bootcamp, HOWEVER, I will not talk about it at all - just like I did for Stanford’s XCS ML for graphs course - it is behind a paywall and nothing can go out to the public. If I am watching some video about pyspark (for example) - I will just share ‘I watched a video about pyspark’ as I do not want to break the platform’s policies.

First homework for the free bootcamp submitted

Here is some feedback from the AI grader:

Company reviewr LLM update

All the above models use 15 train epochs. I decided to use the best of them and train it for 20 epochs (llama3.2-3b) but the result was bad, I guess it starts to overfit, and maybe 15 is the sweet spot, but maybe 10 epochs is worth exploring as well (of course other hyperparams need to be considered as well). Also I got an email today from openAI that I got access to their o1-preview and o1-mini models so I will be fine-tuning those as well soon-ish.

The prompt tuning failed again, so I looked online and found some example and I decided to completely ‘send it’ by using 50 training epochs. And the train time is 24hours 😆. If it overfits - fine, if it doesn’t… well we go back to the drawing board.

Debugging SQL

insert into actors_test
WITH years AS (
    SELECT *
    FROM GENERATE_SERIES(1970, 2020) AS year
),first_year as(
    select actor,actorid,min(year) as first_release_year
    from public.actor_films 
    group by actor,actorid
),actor_all_film_release_years as(
    select * from first_year fy
    join years y 
    on fy.first_release_year<=y.year
), yearly_released_films as (
    select afy.actor,
    afy.actorid,
    afy.year as current_year,
    ARRAY_REMOVE(
        ARRAY_AGG(
            case when af.year is not null then 
            row(af.film,
                af.votes,
                af.rating,
                af.filmid)::films
            end
        )over(partition by afy.actorid 
                  order by af.year
         ),null) as films,

    ROUND(
        avg( case when af.year is not null then
            af.rating
           end
       ) over(partition by afy.actorid 
                  order by af.year)::NUMERIC,2) as avg_rating_year,

    case when af.year is not null 
        then True else False end as is_active
    from actor_all_film_release_years afy 
    left join actor_films af on
    afy.actorid=af.actorid
    and afy.year=af.year
    order by afy.actorid,afy.year

)
select
    DISTINCT
    actorid,
    actor,
    films,
    case when avg_rating_year >8 then 'star'
    when avg_rating_year>7 and avg_rating_year<=8 then 'good'
    when avg_rating_year >6 and avg_rating_year<=7  then 'average'
    else 'bad' end::quality_class as quality_class,
    current_year,
    is_active

from yearly_released_films
order by actorid

So … this query is used to populate an actors table in an ‘efficient’ way. The result being efficient bit is not the important part of this particular writing - it’s about the query.

Another student messaged me with a similar query and saying they get duplicate rows in some cases. I think debugging cases like this really helps me to understand the errors and what to expect from some queries. Tldr, the above query works, but the new thing compared to the query that does not work are the DISTINCT in the final select, and what took me the most time to grasp and understand - in the partitions using order by af.year vs coalesce(afy.year,af.year). And I think it is because of the left join. When we do the left join, the af.year column is often not matched because afy.year is a generated series (from the start), and it may not align exactly with the actual years in af.year. Since afy.year is derived from the GENERATE_SERIES and is never NULL, the COALESCE(afy.year, af.year) always goes to afy.year. This results in incorrect ordering in the PARTITION BY clause because the actual release years (af.year) are ignored. Consequently, the calculated aggregates and film arrays do not correctly follow the chronological order of film releases.

Lab 3 - graph data modelling

I covered lab 3 today related to modelling data as nodes(or vertices, as they are called in DE, I guess) and edges.

Here is my code from the lab

-- build a graph data model see which NBA players play with each other
-- and which players play for what team at each time

create type vertex_type as enum ('player', 'team', 'game');

create table vertices (
    identifier text,
    type vertex_type,
    properties json, -- postgres doesnt have MAP
    primary key (identifier, type)
);

create type edge_type as
    enum ('plays_against', 'shares_team', 'plays_in', 'plays_on');

create table edges (
    subject_identifier text,
    subject_type vertex_type,
    object_identifier text,
    object_type vertex_type,
    edge_type edge_type,
    properties json,
    -- specific id cols might be included and used for the PK
    -- for this example this PK is fine
    primary key (subject_identifier,
                subject_type,
                object_identifier,
                object_type,
                edge_type)
);

insert into vertices
select
    game_id as identifier,
    'game'::vertex_type as type,
    json_build_object(
        'pts_home', pts_home,
        'pts_away', pts_away,
        'winning_team', case when home_team_wins = 1 then home_team_id else visitor_team_id end
    ) as properties
from games;

insert into vertices
with players_agg as (
    select
        player_id as identifier,
        max(player_name) as player_name,
        count(1) as number_of_games,
        sum(pts) as total_points,
        array_agg(distinct team_id) as teams
    from game_details
    group by 1
)
select identifier, 'player'::vertex_type,
        json_build_object(
            'player_name', player_name,
            'number_of_games', number_of_games,
            'total_points', total_points,
            'teams', teams
       )
from players_agg;

insert into vertices
with teams_deduped as (
    select *, row_number() over (partition by team_id) as row_num
    from teams
)
select
    team_id as identifier,
    'team'::vertex_type as type,
    json_build_object(
        'abbreviation', abbreviation,
        'nickname', nickname,
        'city', city,
        'arena', arena,
        'year_founded', yearfounded
    )
from teams_deduped
where row_num = 1;

select type, count(1)
from vertices
group by 1
-- 30 team, 9384 game, 1496 player

insert into edges
with deduped as (
    select *, row_number() over (partition by player_id, game_id) as row_num
    from game_details
)
select
    player_id as subject_identifier,
    'player'::vertex_type as subject_type,
    game_id as object_identifier,
    'game'::vertex_type as object_type,
    'plays_in'::edge_type as edge_type,
    json_build_object(
        'start_position', start_position,
        'pts', pts, -- this is treated as a str so later we need to convert it to int if we want to use it
        'team_id', team_id,
        'team_abbreviation', team_abbreviation
    ) as properties
from deduped
where row_num = 1;

select
    v.properties->>'player_name',
    max(cast(e.properties->>'pts' as integer))
from vertices v
    join edges e
on e.subject_identifier = v.identifier
and e.subject_type = v.type
group by 1
order by 2 desc;

insert into edges
with deduped as (
    select *, row_number() over (partition by player_id, game_id) as row_num
    from game_details
),
    filtered as (
        select * from deduped
                 where row_num = 1
    ),
    aggregated as (
        select
            f1.player_id as subject_player_id,
            f2.player_id as object_player_id,
            case when f1.team_abbreviation = f2.team_abbreviation
                then 'shares_team'::edge_type
                else 'plays_against'::edge_type
            end as edge_type,
            max(f1.player_name) as subject_player_name, -- maybe they changed their name
            max(f2.player_name) as object_player_name,
            count(1) as num_games,
            sum(f1.pts) as subject_points,
            sum(f2.pts) as object_points
        from filtered f1 join filtered f2
        on f1.game_id = f2.game_id
        and f1.player_name <> f2.player_name
        where f1.player_name > f2.player_name -- remove double edges
        group by f1.player_id,
            f2.player_id,
            case when f1.team_abbreviation = f2.team_abbreviation
                then 'shares_team'::edge_type
                else 'plays_against'::edge_type
            end
    )
select
    subject_player_id as subject_identifier,
    'player'::vertex_type as subject_type,
    object_player_id as object_identifier,
    'player'::vertex_type as object_type,
    edge_type as edge_type,
    json_build_object(
        'num_games', num_games,
        'subject_points', subject_points,
        'object_points', object_points
    )
from aggregated;

-- we can calculate avg points, points when X plays with Y
-- or points when X plays vs Y, etc.
select
    v.properties->>'player_name',
    e.object_identifier,
    cast(v.properties->>'number_of_games' as real) /
    case when cast(v.properties->>'total_points' as real) = 0 then 1
        else cast(v.properties->>'total_points' as real) end,
    e.properties->>'subject_points',
    e.properties->>'num_games'

from vertices v join edges e
    on v.identifier = e.subject_identifier
    and v.type = e.subject_type
where e.object_type = 'player'::vertex_type

That is all for today!

See you tomorrow :)