(Day 173) Terraform, GCP, virtual machines, data pipelines

Ivan Ivanov · June 22, 2024

Hello :) Today is Day 173!

A quick summary of today:

  • learned more about terraform and how to set up a GCP VM and connect to it locally
  • used mage for some data engineering pipelines with GCP

Last videos from Module 1: terraform variables, GCP set up

Turns out there is a bit more of terraform from the data eng zoomcamp, and today I covered it.

After learning how to connect to gcp using terraform and create a storage bucket, the first thing today was creating a bigquery dataset

image

Adding the above to main.tf which now looks like:

image

terraform apply, creates a demo_dataset as well

image

Then I learned about variables in terraform

Create a variables.tf file and put a variable like:

image

and in main.tf we can directly use the created variables like:

image

Great intro to terraform - being able to define infrastructure code, create resources, and destroy resources.

The next part was an instruction on setting up GCP (cloud VM + SSH access)

First was creating an ssh key locally

image

And add it to the metadata in GCP’s compute engine (hiding the username just in case)

image

Then create a VM, and connect to it locally using that ssh key, using ssh -i ~/.ssh/gcp username@gcp_vm_external_ip

image image

For a quick connection to the VM, I set up a config which includes Host, HostName, User and IdentityFile, so now I can just run ssh Host and I am connected to the VM through my terminal. Nice.

Also set up vs code to connect to the created ssd

image

Then, I installed anaconda.

image

And docker

image

Then docker-compose

image

And make it executable from anywhere by adding the below to .bashrc export PATH="${HOME}/bin:${PATH}"

And now we have it

image

Then installed pgcli with conda

image

(random note - I am using a vm from my local terminal like this for the 1st time and its kind of cool)

And just like before(2 days ago, first part of the data eng zoomcamp), I can run docker-compose up -d and then pgcli to connect to mt db

image image

In VS code that is connected to the VM, we can forward the port to the db

image

And now when I run pgcli from my own PC’s terminal, I can connect to it too.

image

By adding port 8080 as well in VS code, I now can access pgadmin too from my browser (even tho it is all running on that GCP VM)

image

Same for jupyter - added port 8888 in vs code, then I can run jupyter notebook in the VM, and access in my browser.

Next, I installed terraform for linux

image

When I was setting up terraform, I created a my-creds.json with the credentials from GCP. Now using sftp I transferred the json file from my local to the VM (sftp - another tool I am using for the 1st time)

image

And then I could run the same terraform apply and destroy to create and destroy resources.

And if I want to stop and restart the instance I can do it through the terminal (sudo shutdown now) or the GCP console. And when I start it again in order for the quick ssh connection command to work, I need to edit the config file’s HostName that I created earlier. I found that if I restart the VM and want to use terraform, I need to set my credentials and gcloud auth using (and also just saving the commands for later):

export GOOGLE_APPLICATION_CREDENTIALS=~/.gc/my-creds.json gcloud auth activate-service-account --key-file $GOOGLE_APPLICATION_CREDENTIALS

Next onto Module 2: workflow orchestration

My good old friend mage.ai. Let’s hope for at least less errors than when I covered it in the MLOps zoomcamp.

The first bit was to establish a connection with the postgres database which is ran alongside mage in docker-compose.yml

image

Creating a block in a new pipeline to test the connection:

image

So far so good.

Next is writing a simple ETL pipeline - loading data from an API to postgres, where I just load taxi data in the first block using data type checking, do a little bit of preprocessing in the 2nd block and then make a connection to my db and load the data there in the 3rd block

image

Next is connecting my gcp service account to mage (using the creds.json file), and a connection is made.

Loading data to google cloud storage is very easy - just adding a google cloud storage (GCS) data exporter block, and putting my info down and its done

image

Can be seen in gcs

image

However, with larger files we should not load data into a single parquet file. We should partition it.

I learned how to use pyarrow (as it abstracts chunking logic) for that as in the below block:

image

And the partitioned data is in gcs now. Awesome

image

And then load it into BigQuery

image

My experience using mage in this course compared to the MLOps zoomcamp is completely different haha. Now, the teacher Matt Palmer did a great job ^^

That is all for today!

See you tomorrow :)