(Day 173) Terraform, GCP, virtual machines, data pipelines

Ivan Ivanov · June 22, 2024

data-eng cloud applying-knowledge

Hello :) Today is Day 173!

A quick summary of today:

learned more about terraform and how to set up a GCP VM and connect to it locally
used mage for some data engineering pipelines with GCP

Last videos from Module 1: terraform variables, GCP set up

Turns out there is a bit more of terraform from the data eng zoomcamp, and today I covered it.

After learning how to connect to gcp using terraform and create a storage bucket, the first thing today was creating a bigquery dataset

Adding the above to main.tf which now looks like:

terraform apply, creates a demo_dataset as well

Then I learned about variables in terraform

Create a variables.tf file and put a variable like:

and in main.tf we can directly use the created variables like:

Great intro to terraform - being able to define infrastructure code, create resources, and destroy resources.

The next part was an instruction on setting up GCP (cloud VM + SSH access)

First was creating an ssh key locally

And add it to the metadata in GCP’s compute engine (hiding the username just in case)

Then create a VM, and connect to it locally using that ssh key, using ssh -i ~/.ssh/gcp username@gcp_vm_external_ip

For a quick connection to the VM, I set up a config which includes Host, HostName, User and IdentityFile, so now I can just run ssh Host and I am connected to the VM through my terminal. Nice.

Also set up vs code to connect to the created ssd

Then, I installed anaconda.

And docker

Then docker-compose

And make it executable from anywhere by adding the below to .bashrc export PATH="${HOME}/bin:${PATH}"

And now we have it

Then installed pgcli with conda

(random note - I am using a vm from my local terminal like this for the 1st time and its kind of cool)

And just like before(2 days ago, first part of the data eng zoomcamp), I can run docker-compose up -d and then pgcli to connect to mt db

In VS code that is connected to the VM, we can forward the port to the db

And now when I run pgcli from my own PC’s terminal, I can connect to it too.

By adding port 8080 as well in VS code, I now can access pgadmin too from my browser (even tho it is all running on that GCP VM)

Same for jupyter - added port 8888 in vs code, then I can run jupyter notebook in the VM, and access in my browser.

Next, I installed terraform for linux

When I was setting up terraform, I created a my-creds.json with the credentials from GCP. Now using sftp I transferred the json file from my local to the VM (sftp - another tool I am using for the 1st time)

And then I could run the same terraform apply and destroy to create and destroy resources.

And if I want to stop and restart the instance I can do it through the terminal (sudo shutdown now) or the GCP console. And when I start it again in order for the quick ssh connection command to work, I need to edit the config file’s HostName that I created earlier. I found that if I restart the VM and want to use terraform, I need to set my credentials and gcloud auth using (and also just saving the commands for later):

export GOOGLE_APPLICATION_CREDENTIALS=~/.gc/my-creds.json gcloud auth activate-service-account --key-file $GOOGLE_APPLICATION_CREDENTIALS

Next onto Module 2: workflow orchestration

My good old friend mage.ai. Let’s hope for at least less errors than when I covered it in the MLOps zoomcamp.

The first bit was to establish a connection with the postgres database which is ran alongside mage in docker-compose.yml

Creating a block in a new pipeline to test the connection:

So far so good.

Next is writing a simple ETL pipeline - loading data from an API to postgres, where I just load taxi data in the first block using data type checking, do a little bit of preprocessing in the 2nd block and then make a connection to my db and load the data there in the 3rd block

Next is connecting my gcp service account to mage (using the creds.json file), and a connection is made.

Loading data to google cloud storage is very easy - just adding a google cloud storage (GCS) data exporter block, and putting my info down and its done