Hello :) Today is Day 329!
A quick summary of today:
- read Designing data-intensive applications - chapter 3
- watched a bit more of the dataexpert.io paid content
- did the homework (but am still not 100% confident on the validity of my solution)
Designing data-intensive applications - chapter 3 - storage and retrieval
Data structures that power your db
basics of databases
- key-value stores are simple database structures similar to dictionaries with db_set to store key-value pairs and db_get to retrieve them
- appending key-value pairs to a file is easy for writing but slows down reading as the file size increases
Indexing and optimization
- indexes improve read speeds but add overhead to writes and should be chosen based on query patterns
- hash index uses a hash map for fast lookups, requiring enough memory to store the map
- segmented files divide data into manageable pieces, with compaction and merging reducing redundancy and optimizing storage
Advanced index structures
- sstables store sorted key-value pairs for efficient merging, minimal in-memory index needs, and compression support
- data is first written to an in-memory balanced tree (memtable) and then persisted as sstable segments
- lsm-trees are based on sstables, offering high write throughput with sequential writes and crash safety using write-ahead logs
- b-trees organize data into balanced structures for predictable o(log n) lookups, using write-ahead logs to prevent corruption during updates
Trade-offs between lsm-trees and b-trees
- lsm-trees are faster for writes but compaction can interfere with performance
- b-trees provide predictable reads and are better for transactional consistency but write slower
Specialized indexes
- secondary indexes enable queries on non-unique values, useful for relational joins
- clustered and covering indexes store data within or alongside the index for faster reads but increase write overhead
- multi-column indexes support multi-field queries using concatenated or multi-dimensional approaches
- full-text search handles fuzzy queries with techniques like edit distance and automata
In-memory databases
- store the entire dataset in ram for speed, with durability through snapshots, logging, or replication
- performance comes from optimized data structures and avoiding disk i/o
- anti-caching extends in-memory models to handle datasets larger than available memory, similar to virtual memory systems
Transaction processing or analytics?
Transaction processing (OLTP)
- focuses on low-latency reads and writes for real-time updates
- typical operations: looking up small records by key and modifying them based on user input
- primarily used by end users for current data states (e.g., sales or orders)
- dataset size: gigabytes to terabytes
Analytics (OLAP)
- scans large datasets to calculate aggregates like sums or averages
- queries analyze historical events and support decision-making
- mainly used by analysts, with datasets ranging from terabytes to petabytes
Divergence of OLTP and OLAP
- oltp systems prioritize availability and low latency but struggle with resource-heavy analytic queries
- separate data warehouses emerged for analytics to avoid impacting oltp performance
Data warehousing
- uses extract-transform-load (etl) to consolidate and clean data from various oltp systems
- optimized for analytic patterns like bulk reads and aggregations
- schema types:
- star schema: central fact table linked to dimension tables for simplicity
- snowflake schema: normalized dimensions for complex hierarchies
Tools and systems
- modern data warehouses include commercial systems (e.g., teradata, redshift) and open-source platforms (e.g., hive, presto)
- some databases support both oltp and olap but often separate their engines for efficiency
Column-oriented storage
Column-Oriented Storage: Optimizing Data Warehouses
Column-oriented storage is crucial for efficiently handling massive datasets, common in data warehouses with trillions of rows and petabytes of data. Unlike traditional row-oriented storage, which stores all attributes of a row together, columnar storage organizes data by columns. This structure brings significant benefits for analytics, where queries typically focus on a subset of columns.
It allows queries to load only the relevant columns, reducing unnecessary disk I/O. It also facilitates compression, as columns often contain repetitive values. For instance, bitmap encoding compresses data by representing values as compact bitmaps, while run-length encoding further optimizes storage, especially for sorted columns.
Sorting plays a key role in column stores. Organizing data by frequently queried columns, like dates or product IDs, improves query performance and enhances compression. Some systems replicate datasets sorted differently across machines to optimize for diverse query patterns.
However, columnar storage poses challenges for writes. Updating compressed, sorted columns is computationally expensive. Systems like Vertica address this by temporarily storing writes in memory and periodically merging them with on-disk data.
Data accessibility and Airflow
I covered 2 lectures + 2 labs on the dataexper.io platform related to the above two topics. As usual, I cannot share anything about it or from. I can just share that it’s amazing 😆
Fact Data Modelling homework
After watching all the lectures + labs yesterday from Zach Wilson’s free youtube bootcamp related to fact data modelling I decided to try to do the homework today.
Surprisingly it was more straightforward than expected. Maybe the fact that I did hw1 and all the sql practice got me rolling. Here are my solutions at the moment. I am not 100% sure they are correct but I am getting similar results with 1-2 other people on the community discord.
-- 1. A query to deduplicate game_details from Day 1 so there's no duplicates
with deduped_game_details as (
select *,
row_number() over (partition by game_id, team_id, player_id) as row_num
from game_details
)
select
*
from deduped_game_details;
-- 2. A DDL for a user_devices_cumulated table
create table user_devices_cumulated (
user_id numeric,
device_activity date[],
browser_type text,
date date,
primary key (user_id, browser_type, date)
);
-- 3. A cumulative query to generate device_activity_datelist from events
insert into user_devices_cumulated
with yesterday as (
select *
from user_devices_cumulated
where date = date('2023-01-30')
), today as (
select
user_id,
browser_type,
date(cast(event_time as timestamp)) as date_active
from events left join devices
on events.device_id = devices.device_id
where
date(cast(event_time as timestamp)) = date('2023-01-31')
and user_id is not null
and devices.browser_type is not null
group by user_id, date(cast(event_time as timestamp)), browser_type
)
select
coalesce(t.user_id, y.user_id) as user_id,
case
when y.device_activity is null then array[t.date_active]
when t.date_active is null then y.device_activity
else array[t.date_active] || y.device_activity
end as device_activity,
coalesce(y.browser_type, t.browser_type) as browser_type,
coalesce(t.date_active, y.date + interval '1 day') as date
from today t full outer join yesterday y
on t.user_id = y.user_id and t.browser_type = y.browser_type;
select count(1)
from user_devices_cumulated
-- 4. A datelist_int generation query.
-- Convert the device_activity_datelist column into a datelist_int column
with users as (
select * from user_devices_cumulated
where date = date('2023-01-31')
),
series as (
select * from generate_series(date('2023-01-01'), date('2023-01-31'), interval '1 day') as series_date
),
place_holder_ints as (
select
case
when device_activity @> array[date(series_date)]
then cast(pow(2, 31 - (date - date(series_date))) as bigint)
else 0
end as placeholder_int_value,
*
from users cross join series
)
SELECT user_id,
browser_type,
cast(cast(sum(placeholder_int_value) as bigint) as bit(32)) as datelist_int_binary,
bit_count(cast(cast(sum(placeholder_int_value) as bigint) as bit(32))) as cnt_active_dates,
cast(sum(placeholder_int_value) as bigint) as datelist_int,
date
FROM place_holder_ints
where user_id = '444502572952128450'
GROUP BY user_id, browser_type, date
-- 31,30,29,28,27,25,24,19,18,17,16,15,14,13,12,11,7,6,5,4,3,2
-- select * from user_devices_cumulated
-- where user_id = '444502572952128450'
-- and date = date('2023-01-31')
-- 5. A DDL for hosts_cumulated table
-- a host_activity_datelist which logs to see which dates each host is experiencing any activity
create table hosts_cumulated (
host text,
host_activity_datelist date[],
today_date date,
PRIMARY KEY (host, today_date)
);
-- 6. The incremental query to generate host_activity_datelist
insert into hosts_cumulated
with yesterday as (
select *
from hosts_cumulated
where today_date = date('2023-01-30')
),
today as (
select
host,
cast(cast(event_time as timestamp) as date) as today_date
from events
where cast(cast(event_time as timestamp) as date) = date('2023-01-31')
and host is not null
group by host, cast(cast(event_time as timestamp) as date)
)
select
coalesce(y.host, t.host) as host,
case
when y.host_activity_datelist is null then array[t.today_date]
when t.today_date is not null then array[t.today_date] || y.host_activity_datelist
else y.host_activity_datelist
end as host_activity_datelist,
coalesce(t.today_date, y.today_date + interval '1 day') as today_date
from yesterday y full outer join today t
on y.host = t.host;
-- 7. A monthly, reduced fact table DDL host_activity_reduced
create table host_activity_reduced (
month date,
host text,
hit_metric text,
hit_array integer[],
unique_visitors_metric text,
unique_visitors_array numeric[],
primary key (month, host, hit_metric, unique_visitors_metric)
);
-- 8. An incremental query that loads host_activity_reduced
insert into host_activity_reduced
with daily_agg as (
select
host,
date(event_time) as date,
count(1) as num_site_hits,
COUNT(DISTINCT user_id) as unique_visitors
from events
where date(event_time) = date('2023-01-31')
and user_id is not null
group by host, date(event_time)
),
yesterday_array as (
select *
from host_activity_reduced
where month = date('2023-01-01') -- hardcoded
)
select
coalesce(ya.month, date_trunc('month', da.date)) as month,
coalesce(da.host, ya.host) as host,
'site_hits' as hit_metric,
case
when ya.hit_array is not null then ya.hit_array || array[coalesce(da.num_site_hits, 0)]
when ya.hit_array is null then array_fill(0, array[coalesce(date - date(date_trunc('month', date)), 0)]) || array[coalesce(da.num_site_hits, 0)]
end as hit_array,
'unique_visitors' as unique_visitors_metric,
case
when ya.unique_visitors_array is not null then ya.unique_visitors_array || array[coalesce(da.unique_visitors, 0)]
when ya.unique_visitors_array is null then array_fill(0, array[coalesce(date - date(date_trunc('month', date)), 0)]) || array[coalesce(da.unique_visitors, 0)]
end as unique_visitors_array
from daily_agg da full outer join yesterday_array ya
on da.host = ya.host
on conflict (month, host, hit_metric, unique_visitors_metric)
do
update set hit_array = excluded.hit_array,
unique_visitors_array = excluded.unique_visitors_array;
-- not part of the exercise just for fun
with agg as (
select
host,
month,
array [
sum(hit_array[1]),
sum(hit_array[2]),
sum(hit_array[3])
] as avg_hit_array,
array [
avg(unique_visitors_array[1]),
avg(unique_visitors_array[2]),
avg(unique_visitors_array[3])
] as avg_unique_visitors_array
from host_activity_reduced
group by host, month
),
daily_records as (
select
host,
month,
generate_series(1, array_length(avg_hit_array, 1)) as day_offset,
avg_hit_array,
avg_unique_visitors_array
from agg
),
unnested as (
select
dr.host,
dr.month + (dr.day_offset - 1) * interval '1 day' as date,
dr.avg_hit_array[dr.day_offset] as hit_value,
dr.avg_unique_visitors_array[dr.day_offset] as unique_visitor_value
from daily_records dr
)
select *
from unnested
order by host, date;
That is all for today!
See you tomorrow :)