Hello :) Today is Day 210!
A quick summary of today:
- final audio clips preprocessing to reach our audio dataset mark
Final dataset for the glaswegian voice assistant AI (link to HuggingFace). Today I preprocessed the final audios from 2 of Limmy’s youtube videos (Limmy accidentaly kills the city and The writer of Saw called Limmy a …). Just an update on how the process goes now ~
Since our transcription AI is pretty good (according to my Glaswegian speaking project partner), we pass the full raw audio to our fine-tuned whisper model hosten on HuggingFace spaces. Then the transcript is put into a docs file (where first I check over it for obious mistakes and flag if I see something odd and cannot understand it from re-listening to the audio) and split into sensible (small) bits while listening to the audio, like:
(this is the start from Limmy accidentaly kills the city)
Then using an audio tool, I cut the full audio length into clips according to the cut text, then I match clip name and transcript in excel, then using python I get the clip length and sampling rate. Finally I add static info like gender, age, class, location, speaker id to the data, and finally I get a csv which I upload onto the hugging face dataset along with the cut audio clips.
Next steps are to train a final whisper model, and then a Text-To-Speech model using this final dataset. Maybe tomorrow when we go to breakfast I will leave whisper to fine-tune.
That is all for today!
See you tomorrow :)