Getting the blog back in order
This site has been down for a good while. The host I was using decided they didn't want to deal with shared hosting anymore so they sold off all their customers to a third party. The transition didn't go smoothly so I took the opportunity to move to a better host. The old site ran wordpress, but I never liked it much. I may or may not port the old content over.
2017-12-10string(10) "2017-12-10"
Mozilla's Common Voice project has released data
Get the data here. According to the readme file, the following information is avaialbe for each sample. I've also listed out the directory sizes for fun. I've been waiting for Mozilla to release this dataset for a long time. There's a bunch of analysis stuff I want to do with it, I'll be talking more about that later on.
* filename - relative path of the audio file
* text - supposed transcription of the audio
* up_votes - number of people who said audio matches the text
* down_votes - number of people who said audio does not match text
* age - age of the speaker, if the speaker reported it
* gender - gender of the speaker, if the speaker reported it
* accent - accent of the speaker, if the speaker reported it
Ashley$ du -d1 -h
156M ./.dat
1.2G ./cv-invalid
108M ./cv-other-dev
105M ./cv-other-test
5.1G ./cv-other-train
154M ./cv-valid-dev
153M ./cv-valid-test
7.3G ./cv-valid-train
14G .