Hello :) Today is Day 324!
A quick summary of today:
- 121 chat with Zach Wilson (trans-continental video chat)
- first homework submitted
- small company reviewer update
- debugging sql
- graph data modelling in postgres
I got a free spot at Zach Wilson’s 2025 data engineering bootcamp 🤯
This morning, at 6.30am, I went into a 30-min call with Zach. Why 6.30am well he is in Utah, US and I am in South Korea and as he is a busy person his schedule is fixed so I picked the latest possible time for me. Thankfully, I was not sleepy during the chat, he is very energetic and I got some of his energy 😆.
We had a chat about bootcamps, how the bootcamp is going, other bootcamps, what I’m doing, and also a bit about this blog. He said a lot of positive things about this blog and I appreciate that because this blog does not really reach a big audience and I am using it on my CV as well ~ so it was nice hearing a compliment about this blog journey :)
Anyway ~ at the end, because I am so active on his free bootcamp’s discord (helping others, having discussions about the material, etc.) he said he can offer me a free spot at his paid bootcamp in January 2025. I mean… it’s like a dream. Learning from Zach Wilson’s best content… really just speechless. There is also so much content on his platform and I will be able to learn sooo much from his bootcamp as well. Super excited for January 2025!!! I’m over the moon 🥳🥳🥳🥳🥳
As there is other content on the platform, I will watch some of it before the bootcamp, HOWEVER, I will not talk about it at all - just like I did for Stanford’s XCS ML for graphs course - it is behind a paywall and nothing can go out to the public. If I am watching some video about pyspark (for example) - I will just share ‘I watched a video about pyspark’ as I do not want to break the platform’s policies.
First homework for the free bootcamp submitted
Here is some feedback from the AI grader:
Company reviewr LLM update
All the above models use 15 train epochs. I decided to use the best of them and train it for 20 epochs (llama3.2-3b) but the result was bad, I guess it starts to overfit, and maybe 15 is the sweet spot, but maybe 10 epochs is worth exploring as well (of course other hyperparams need to be considered as well). Also I got an email today from openAI that I got access to their o1-preview and o1-mini models so I will be fine-tuning those as well soon-ish.
The prompt tuning failed again, so I looked online and found some example and I decided to completely ‘send it’ by using 50 training epochs. And the train time is 24hours 😆. If it overfits - fine, if it doesn’t… well we go back to the drawing board.
Debugging SQL
insert into actors_test
WITH years AS (
SELECT *
FROM GENERATE_SERIES(1970, 2020) AS year
),first_year as(
select actor,actorid,min(year) as first_release_year
from public.actor_films
group by actor,actorid
),actor_all_film_release_years as(
select * from first_year fy
join years y
on fy.first_release_year<=y.year
), yearly_released_films as (
select afy.actor,
afy.actorid,
afy.year as current_year,
ARRAY_REMOVE(
ARRAY_AGG(
case when af.year is not null then
row(af.film,
af.votes,
af.rating,
af.filmid)::films
end
)over(partition by afy.actorid
order by af.year
),null) as films,
ROUND(
avg( case when af.year is not null then
af.rating
end
) over(partition by afy.actorid
order by af.year)::NUMERIC,2) as avg_rating_year,
case when af.year is not null
then True else False end as is_active
from actor_all_film_release_years afy
left join actor_films af on
afy.actorid=af.actorid
and afy.year=af.year
order by afy.actorid,afy.year
)
select
DISTINCT
actorid,
actor,
films,
case when avg_rating_year >8 then 'star'
when avg_rating_year>7 and avg_rating_year<=8 then 'good'
when avg_rating_year >6 and avg_rating_year<=7 then 'average'
else 'bad' end::quality_class as quality_class,
current_year,
is_active
from yearly_released_films
order by actorid
So … this query is used to populate an actors table in an ‘efficient’ way. The result being efficient bit is not the important part of this particular writing - it’s about the query.
Another student messaged me with a similar query and saying they get duplicate rows in some cases. I think debugging cases like this really helps me to understand the errors and what to expect from some queries. Tldr, the above query works, but the new thing compared to the query that does not work are the DISTINCT in the final select, and what took me the most time to grasp and understand - in the partitions using order by af.year
vs coalesce(afy.year,af.year)
. And I think it is because of the left join. When we do the left join, the af.year column is often not matched because afy.year is a generated series (from the start), and it may not align exactly with the actual years in af.year. Since afy.year is derived from the GENERATE_SERIES and is never NULL, the COALESCE(afy.year, af.year) always goes to afy.year. This results in incorrect ordering in the PARTITION BY clause because the actual release years (af.year) are ignored. Consequently, the calculated aggregates and film arrays do not correctly follow the chronological order of film releases.
Lab 3 - graph data modelling
I covered lab 3 today related to modelling data as nodes(or vertices, as they are called in DE, I guess) and edges.
Here is my code from the lab
-- build a graph data model see which NBA players play with each other
-- and which players play for what team at each time
create type vertex_type as enum ('player', 'team', 'game');
create table vertices (
identifier text,
type vertex_type,
properties json, -- postgres doesnt have MAP
primary key (identifier, type)
);
create type edge_type as
enum ('plays_against', 'shares_team', 'plays_in', 'plays_on');
create table edges (
subject_identifier text,
subject_type vertex_type,
object_identifier text,
object_type vertex_type,
edge_type edge_type,
properties json,
-- specific id cols might be included and used for the PK
-- for this example this PK is fine
primary key (subject_identifier,
subject_type,
object_identifier,
object_type,
edge_type)
);
insert into vertices
select
game_id as identifier,
'game'::vertex_type as type,
json_build_object(
'pts_home', pts_home,
'pts_away', pts_away,
'winning_team', case when home_team_wins = 1 then home_team_id else visitor_team_id end
) as properties
from games;
insert into vertices
with players_agg as (
select
player_id as identifier,
max(player_name) as player_name,
count(1) as number_of_games,
sum(pts) as total_points,
array_agg(distinct team_id) as teams
from game_details
group by 1
)
select identifier, 'player'::vertex_type,
json_build_object(
'player_name', player_name,
'number_of_games', number_of_games,
'total_points', total_points,
'teams', teams
)
from players_agg;
insert into vertices
with teams_deduped as (
select *, row_number() over (partition by team_id) as row_num
from teams
)
select
team_id as identifier,
'team'::vertex_type as type,
json_build_object(
'abbreviation', abbreviation,
'nickname', nickname,
'city', city,
'arena', arena,
'year_founded', yearfounded
)
from teams_deduped
where row_num = 1;
select type, count(1)
from vertices
group by 1
-- 30 team, 9384 game, 1496 player
insert into edges
with deduped as (
select *, row_number() over (partition by player_id, game_id) as row_num
from game_details
)
select
player_id as subject_identifier,
'player'::vertex_type as subject_type,
game_id as object_identifier,
'game'::vertex_type as object_type,
'plays_in'::edge_type as edge_type,
json_build_object(
'start_position', start_position,
'pts', pts, -- this is treated as a str so later we need to convert it to int if we want to use it
'team_id', team_id,
'team_abbreviation', team_abbreviation
) as properties
from deduped
where row_num = 1;
select
v.properties->>'player_name',
max(cast(e.properties->>'pts' as integer))
from vertices v
join edges e
on e.subject_identifier = v.identifier
and e.subject_type = v.type
group by 1
order by 2 desc;
insert into edges
with deduped as (
select *, row_number() over (partition by player_id, game_id) as row_num
from game_details
),
filtered as (
select * from deduped
where row_num = 1
),
aggregated as (
select
f1.player_id as subject_player_id,
f2.player_id as object_player_id,
case when f1.team_abbreviation = f2.team_abbreviation
then 'shares_team'::edge_type
else 'plays_against'::edge_type
end as edge_type,
max(f1.player_name) as subject_player_name, -- maybe they changed their name
max(f2.player_name) as object_player_name,
count(1) as num_games,
sum(f1.pts) as subject_points,
sum(f2.pts) as object_points
from filtered f1 join filtered f2
on f1.game_id = f2.game_id
and f1.player_name <> f2.player_name
where f1.player_name > f2.player_name -- remove double edges
group by f1.player_id,
f2.player_id,
case when f1.team_abbreviation = f2.team_abbreviation
then 'shares_team'::edge_type
else 'plays_against'::edge_type
end
)
select
subject_player_id as subject_identifier,
'player'::vertex_type as subject_type,
object_player_id as object_identifier,
'player'::vertex_type as object_type,
edge_type as edge_type,
json_build_object(
'num_games', num_games,
'subject_points', subject_points,
'object_points', object_points
)
from aggregated;
-- we can calculate avg points, points when X plays with Y
-- or points when X plays vs Y, etc.
select
v.properties->>'player_name',
e.object_identifier,
cast(v.properties->>'number_of_games' as real) /
case when cast(v.properties->>'total_points' as real) = 0 then 1
else cast(v.properties->>'total_points' as real) end,
e.properties->>'subject_points',
e.properties->>'num_games'
from vertices v join edges e
on v.identifier = e.subject_identifier
and v.type = e.subject_type
where e.object_type = 'player'::vertex_type
That is all for today!
See you tomorrow :)