Hello :) Today is Day 173!
A quick summary of today:
- learned more about terraform and how to set up a GCP VM and connect to it locally
- used mage for some data engineering pipelines with GCP
Last videos from Module 1: terraform variables, GCP set up
Turns out there is a bit more of terraform from the data eng zoomcamp, and today I covered it.
After learning how to connect to gcp using terraform and create a storage bucket, the first thing today was creating a bigquery dataset
Adding the above to main.tf which now looks like:
terraform apply, creates a demo_dataset as well
Then I learned about variables in terraform
Create a variables.tf file and put a variable like:
and in main.tf we can directly use the created variables like:
Great intro to terraform - being able to define infrastructure code, create resources, and destroy resources.
The next part was an instruction on setting up GCP (cloud VM + SSH access)
First was creating an ssh key locally
And add it to the metadata in GCP’s compute engine (hiding the username just in case)
Then create a VM, and connect to it locally using that ssh key, using ssh -i ~/.ssh/gcp username@gcp_vm_external_ip
For a quick connection to the VM, I set up a config which includes Host, HostName, User and IdentityFile, so now I can just run ssh Host
and I am connected to the VM through my terminal. Nice.
Also set up vs code to connect to the created ssd
Then, I installed anaconda.
And docker
Then docker-compose
And make it executable from anywhere by adding the below to .bashrc
export PATH="${HOME}/bin:${PATH}"
And now we have it
Then installed pgcli with conda
(random note - I am using a vm from my local terminal like this for the 1st time and its kind of cool)
And just like before(2 days ago, first part of the data eng zoomcamp), I can run docker-compose up -d and then pgcli to connect to mt db
In VS code that is connected to the VM, we can forward the port to the db
And now when I run pgcli from my own PC’s terminal, I can connect to it too.
By adding port 8080 as well in VS code, I now can access pgadmin too from my browser (even tho it is all running on that GCP VM)
Same for jupyter - added port 8888 in vs code, then I can run jupyter notebook in the VM, and access in my browser.
Next, I installed terraform for linux
When I was setting up terraform, I created a my-creds.json with the credentials from GCP. Now using sftp I transferred the json file from my local to the VM (sftp - another tool I am using for the 1st time)
And then I could run the same terraform apply and destroy to create and destroy resources.
And if I want to stop and restart the instance I can do it through the terminal (sudo shutdown now
) or the GCP console. And when I start it again in order for the quick ssh connection command to work, I need to edit the config file’s HostName that I created earlier.
I found that if I restart the VM and want to use terraform, I need to set my credentials and gcloud auth using (and also just saving the commands for later):
export GOOGLE_APPLICATION_CREDENTIALS=~/.gc/my-creds.json
gcloud auth activate-service-account --key-file $GOOGLE_APPLICATION_CREDENTIALS
Next onto Module 2: workflow orchestration
My good old friend mage.ai. Let’s hope for at least less errors than when I covered it in the MLOps zoomcamp.
The first bit was to establish a connection with the postgres database which is ran alongside mage in docker-compose.yml
Creating a block in a new pipeline to test the connection:
So far so good.
Next is writing a simple ETL pipeline - loading data from an API to postgres, where I just load taxi data in the first block using data type checking, do a little bit of preprocessing in the 2nd block and then make a connection to my db and load the data there in the 3rd block
Next is connecting my gcp service account to mage (using the creds.json file), and a connection is made.
Loading data to google cloud storage is very easy - just adding a google cloud storage (GCS) data exporter block, and putting my info down and its done
Can be seen in gcs
However, with larger files we should not load data into a single parquet file. We should partition it.
I learned how to use pyarrow (as it abstracts chunking logic) for that as in the below block:
And the partitioned data is in gcs now. Awesome
And then load it into BigQuery
My experience using mage in this course compared to the MLOps zoomcamp is completely different haha. Now, the teacher Matt Palmer did a great job ^^
That is all for today!
See you tomorrow :)