(Day 172) Learning about terraform + adding more data to the Glaswegian audio dataset

Ivan Ivanov · June 21, 2024

Hello :) Today is Day 172!

A quick summary of today:

preprocessed and loaded more audio into the Glaswegian audio dataset on huggingface
learned about Terraform from Data Eng Zoomcamp

As for the Glaswegian dataset

Preprocessing took a bit longer because now I was cutting multiple >5min audio clips and matching to the transcription that my Scottish collaborator had written down. I actually did only 4 out of the 6 so I have 2 more to do from Limmy - a famous Scottish comedian. The total time at the moment is 63 minutes.

I started fine-tuning whisper-small again, but turns out the colab subscription I got (the cheapest one) includes a limited amount of computer units. So right now I am finetuning it on the free limited amount. It says it takes about 4 hours… but hopefully it finishes before the free TPU hours run out.

As for Terraform

Continuing yesterday’s videos. The final part of Module 1 from the DataTalksClub’s data engineering camp is an intro to terraform. Fortunately, I have met terraform before - at the Microsoft Azure hackathon, but then I had no idea what I was running, I was just running it. So today it got (a little but) clearer.

First, I created a service account in GCP and give it access

Create a manage key

Create a main.tf file and use google provider to set up terraform (using terraform init)

After initialising terraform with my service account.

Next is creating a cloud storage bucket resource (adding the below to main.tf) (and then executing terraform plan)

First I needed to set my GOOGLE_APPLICATION_CREDENTIALS env var and then it worked.

Next, to create the bucket, I executed terraform apply, and then I see the created storage in GCP

We can destroy the created bucket with terraform destroy. The main.tf file is on my repo.

That is all for today!

See you tomorrow :)