Hello :) Today is Day 172!
A quick summary of today:
- preprocessed and loaded more audio into the Glaswegian audio dataset on huggingface
- learned about Terraform from Data Eng Zoomcamp
As for the Glaswegian dataset
Preprocessing took a bit longer because now I was cutting multiple >5min audio clips and matching to the transcription that my Scottish collaborator had written down. I actually did only 4 out of the 6 so I have 2 more to do from Limmy - a famous Scottish comedian. The total time at the moment is 63 minutes.
I started fine-tuning whisper-small again, but turns out the colab subscription I got (the cheapest one) includes a limited amount of computer units. So right now I am finetuning it on the free limited amount. It says it takes about 4 hours… but hopefully it finishes before the free TPU hours run out.
As for Terraform
Continuing yesterday’s videos. The final part of Module 1 from the DataTalksClub’s data engineering camp is an intro to terraform. Fortunately, I have met terraform before - at the Microsoft Azure hackathon, but then I had no idea what I was running, I was just running it. So today it got (a little but) clearer.
First, I created a service account in GCP and give it access
Create a manage key
Create a main.tf file and use google provider to set up terraform (using terraform init)
After initialising terraform with my service account.
Next is creating a cloud storage bucket resource (adding the below to main.tf) (and then executing terraform plan)
First I needed to set my GOOGLE_APPLICATION_CREDENTIALS env var and then it worked.
Next, to create the bucket, I executed terraform apply, and then I see the created storage in GCP
We can destroy the created bucket with terraform destroy. The main.tf file is on my repo.
That is all for today!
See you tomorrow :)