Hello :) Today is Day 197!
A quick summary of today:
- finally decided to learn about kafka and streamed data
I started the day by watching and replicating this video on Building a Real-Time Data Streaming Pipeline using Kafka,Postgres and Streamlit
Kafka, and in general streaming (data/models) have been in my zone of interest for a while, and this video finally gave me a good starting point.
The video helped me set up a Kafka service, a Zookeeper (which manages metadata), a producer (which reads data from a streaming source like a live weather feed and sends it to Kafka), and a consumer (which reads data from Kafka).
The whole thing was around sentiment of generated sentences.
First we generate sentences, send them to kafka using the producer, read them from kafka using the consumer and use a sentence sentiment analyser to get a score -> then upload the sentence + score to postgres -> then as each new data comes in, show it in a UI (streamlit). I took some pics but I do not have the code for it as I started overwriting the code to fit it to my project.
This is what the producer produced:
The consumer read the data, added sentiment score:
Then each line as it was processed was added to the db
And then a live feed of new sentences could be seen on streamlit
Now ~ for my project idea
This was my idea: get full transaction data from stripe api (kafka python producer) > read with a kafka python consumer > use pyspark to clean it to keep only what i need > store results in a db > use grafana to visualise data from db. Also maybe use mage for orchestration I spent most my day setting up java/pyspark/spark in Docker. At first after I finally managed to get pyspark/spark running, I used batch processing, but I wanted to use the spark streaming api, so I tried that, but I kept getting empty data as I was not sure what is going on.
My producer was a basic while loop that sends that to kafka every 3 seconds, and after the consumer reads it I wanted to do some basic manipulation. However given I could not see any data when using the streaming api, I gave up on it and decided to not use spark. So I went back to just basic cleaning of data in python and then sending to postgres. I was not happy with that. I did not want to give up on using the spark api, so now I decided to try a different route - use it in the producer to process data from stripe (to potentially utilise spark’s parallelism in case millions of transactions can come in a second - this is not possible because the stripe API limits me to 25 transactions at a time, but still)
All my latest code (now at 1.30am right before I sleep) is on this repo.
While I appreciate the fact that I learned a lot about kafka today, I have not given up on using spark (even a little bit). I know the tech might not be that important - I also need to do documentation and data modelling. I need to think about the transferrable skills and how the things I am doing might be scaled to millions of transactions per second. But rather than implementing anything that can handle that, maybe just knowing about it is fine too …
That is all for today!
See you tomorrow :)