(Day 330) Who are the top 10 NBA players by consecutive 20+ point seasons?

Ivan Ivanov · November 26, 2024

data-eng theory applying-knowledge

Hello :) Today is Day 330!

A quick summary of today:

applying some of my SCD knowledge
learning more on the dataexpert.io platform
got A on the 2nd homework for the free DE bootcamp
read/listened to chapter 4 of Designing Data-Intensive Applications

Doing Hard SQL questions

Today I decided to try to finish the Qs on the dataexpert.io platform. All the Qs are completely free and there’s also a built-in sql editor which is amazing.

I just want to mention the very last Q and that it feels so good knowing how to solving it. And I learned how to solve it thanks to week 1 of Zach Wilson’s free DE bootcamp - related to slowly changing dimensions

Here is the question:

Who are the top 10 NBA players by consecutive 20+ point seasons?
 
Using the table bootcamp.nba_player_seasons, find the most consecutive seasons a player has scored 20+ points and find the top 10 players! Make sure to handle ties correctly (you might have more than 10 records!). Sort the output data set by consecutive_seasons descending and player_name ascending!

These are the tables to query for this question:
bootcamp.nba_player_seasons
player_name string
age int
height string
weight int
college string
country string
draft_year string
draft_round string
draft_number string
gp double
pts double
reb double
ast double
netrtg double
oreb_pct double
dreb_pct double
usg_pct double
ts_pct double
ast_pct double
season int
Your answer should include these columns:
player_name varchar
consecutive_seasons integer

And here is my solution:

with player_20pt_seasons as (
    select
        player_name,
        season,
        case when pts >= 20 then 1 else 0 end as has_20pts
    from bootcamp.nba_player_seasons
),
prev_season as (
    select
        player_name,
        season,
        has_20pts,
        lag(has_20pts, 1) over (partition by player_name order by season) as previous_season_has_20pts
    from player_20pt_seasons
),
indicators as (
    select *,
        case
            when has_20pts <> previous_season_has_20pts then 1
            else 0
        end as change_ind
    from prev_season
),
streaks as (
    select *,
        sum(change_ind) over (partition by player_name order by season) as streak_identifier
    from indicators
),
almost_final as (
    select 
        player_name,
        has_20pts,
        min(season) as start_season,
        max(season) as end_season
    from streaks
    group by player_name, streak_identifier, has_20pts
    order by player_name, streak_identifier
),
ranked_results as (
    select
        player_name,
        end_season + 1 - start_season as consecutive_seasons,
        rank() over (order by end_season + 1 - start_season desc) as rank
    from almost_final
    where has_20pts = 1
)
select
    player_name,
    consecutive_seasons
from ranked_results
where rank <= 10
order by rank, player_name

Got A on the 2nd homework on fact data modelling

Submitted it today and got an immediate answer since it’s AI graded ^^

I keep watching videos on the dataexpert.io platform

Just mentioning them and that they are related to Airflow/dbt/Spark and are awesome 😆 but cannot share anything as they are completely non-public content.

Designing Data-Intensive Applications - Chapter 4 - Encoding and Evolution

Formats for encoding data

Binary encoding formats like Thrift, Protocol Buffers, and Avro provide efficient and flexible alternatives to textual formats like JSON, XML, and CSV, especially when handling schema evolution—the inevitable changes in data structure over time.

Textual formats, while popular, have notable shortcomings. JSON struggles with ambiguity in encoding numbers, often leading to precision issues with large numbers. XML and JSON offer optional and complex schema support, which is frequently ignored, while CSV lacks schema support entirely, making data interpretation more error-prone and labor-intensive. Additionally, handling binary data in these formats necessitates Base64 encoding, inflating data size and increasing complexity.

In contrast, binary formats are more compact and enforce schema use, which doubles as documentation. They enable schema evolution by allowing compatibility checks and ensuring consistency in data interpretation. Thrift and Protocol Buffers achieve this through field tags—unique numbers assigned to fields in the schema. These tags allow for adding or removing optional fields without breaking compatibility. However, care must be taken when changing data types or field tags, as these can introduce risks such as data truncation.

Avro, unlike Thrift and Protocol Buffers, doesn’t use field tags. Instead, it employs the writer’s schema and reader’s schema, which allow encoding and decoding with differing but compatible schemas. This approach supports adding or removing fields with default values, converting compatible data types, and dynamic schema generation from sources like relational databases. Avro’s flexibility extends to typed languages, offering optional code generation while remaining highly usable in dynamic environments.

The use of schemas in binary formats ensures compact data encoding, consistent documentation, and robust compatibility checks. These features provide stronger data guarantees and better tooling, making binary formats an appealing choice for managing evolving data structures in modern applications.

Modes of Dataflow

Dataflow through dbs

databases act as a shared communication medium, requiring backward compatibility for decoding older data and forward compatibility for concurrent applications using different schema versions
preserving unknown fields ensures data integrity when older applications access newer data. formats like avro handle this effectively, though application-level precautions may still be necessary
since data outlives code, databases must manage schema evolution without rewriting existing data. archival storage, such as data dumps, can use the latest schema for consistency, especially in analytics

Dataflow through services: REST and RPC

services enable client-server communication, often using http-based rest apis for simplicity, resource identification, and microservices integration
soap, a more complex xml-based protocol, has declined in popularity due to its tooling requirements and interoperability issues
rpc (remote procedure call) frameworks like grpc address network complexities with features like futures and streams. however, restful apis remain preferred for their ease of use and platform support
ensuring backward compatibility for requests and forward compatibility for responses is vital for service evolvability. api versioning is critical, especially for inter-organizational communication, often requiring multiple active api versions

Message-passing Dataflow

asynchronous message-passing systems, relying on a message broker, facilitate decoupled, one-way communication with features like buffering and message redelivery
responses in message-passing can be achieved through separate channels, supporting use cases where direct replies are unnecessary
distributed actor frameworks use message-passing to scale applications across nodes, with frameworks like akka and orleans handling message encoding compatibility during rolling upgrades
these systems balance flexibility and reliability, making them well-suited for evolving, distributed applications

That is all for today!

See you tomorrow :)