(Day 172) Learning about terraform + adding more data to the Glaswegian audio dataset

Ivan Ivanov · June 21, 2024

Hello :) Today is Day 172!

A quick summary of today:

  • preprocessed and loaded more audio into the Glaswegian audio dataset on huggingface
  • learned about Terraform from Data Eng Zoomcamp

As for the Glaswegian dataset

image

Preprocessing took a bit longer because now I was cutting multiple >5min audio clips and matching to the transcription that my Scottish collaborator had written down. I actually did only 4 out of the 6 so I have 2 more to do from Limmy - a famous Scottish comedian. The total time at the moment is 63 minutes.

I started fine-tuning whisper-small again, but turns out the colab subscription I got (the cheapest one) includes a limited amount of computer units. So right now I am finetuning it on the free limited amount. It says it takes about 4 hours… but hopefully it finishes before the free TPU hours run out.

As for Terraform

Continuing yesterday’s videos. The final part of Module 1 from the DataTalksClub’s data engineering camp is an intro to terraform. Fortunately, I have met terraform before - at the Microsoft Azure hackathon, but then I had no idea what I was running, I was just running it. So today it got (a little but) clearer.

First, I created a service account in GCP and give it access

image

Create a manage key

image

Create a main.tf file and use google provider to set up terraform (using terraform init)

image

After initialising terraform with my service account.

Next is creating a cloud storage bucket resource (adding the below to main.tf) (and then executing terraform plan)

First I needed to set my GOOGLE_APPLICATION_CREDENTIALS env var and then it worked.

image image

Next, to create the bucket, I executed terraform apply, and then I see the created storage in GCP

image

We can destroy the created bucket with terraform destroy. The main.tf file is on my repo.

That is all for today!

See you tomorrow :)