Main Content

Use Datastores to Manage Audio Data Sets

Deep learning and machine learning models are popular for processing audio signals for various tasks. Training these models requires working with large data sets containing both audio data and labeling information. For example, when training a model to identify spoken commands, the data can be a collection of audio files and the labels in this case are the ground truth commands for each file. Datastores are useful for working with large collections of data, and the audioDatastore object allows you to manage collections of audio files.

This example shows you how to use datastores to manage three different audio data sets. The first data set uses the names of the folders containing the audio files as labels, the second data set uses the file names as labels, and the third data set contains labels in a metadata file. You can then use these datastores to train machine learning or deep learning models on the audio data.

Data With Folder Name Labels

The Google Speech Commands data set [1] contains files with spoken command words stored in folders whose names are the word labels. Download and extract the data set.

downloadFolder = matlab.internal.examples.downloadSupportFile("audio","google_speech.zip");
dataFolder = tempdir;
unzip(downloadFolder,dataFolder)
dataset = fullfile(dataFolder,"google_speech");

Create an audioDatastore that points to the training data.

ads = audioDatastore(fullfile(dataset,"train"),IncludeSubfolders=true);

Extract the labels for each file from the folder names using the folders2labels function. Use countlabels to view the distribution of labels.

labels = folders2labels(ads);
countlabels(labels)
ans=30×3 table
    Label     Count    Percent
    ______    _____    _______

    bed       1340     2.6229 
    bird      1411     2.7619 
    cat       1399     2.7384 
    dog       1396     2.7325 
    down      1842     3.6055 
    eight     1852     3.6251 
    five      1844     3.6095 
    four      1839     3.5997 
    go        1861     3.6427 
    happy     1373     2.6875 
    house     1427     2.7932 
    left      1839     3.5997 
    marvin    1424     2.7873 
    nine      1875     3.6701 
    no        1853     3.6271 
    off       1839     3.5997 
      ⋮

Use combine to create a CombinedDatastore object from the audio data and the labels. Each call to read on the datastore returns one of the audio signals and its label.

lds = arrayDatastore(labels);
cds = combine(ads,lds);

You can create a separate datastore for validation data by repeating the same steps after creating an audioDatastore that instead points to the validation subfolder of the data set. Alternatively, you can use splitlabels to separate an existing datastore into training and validation sets. Specify UnderlyingDatastoreIndex to indicate which of the underlying datastores in the combined datastore contains the labels.

idxs = splitlabels(cds,0.8,"randomized",UnderlyingDatastoreIndex=2);
trainDs = subset(cds,idxs{1});
valDs = subset(cds,idxs{2});

Call read on the train datastore. The function returns both the audio signal and the label in a cell array.

read(trainDs)
ans=1×2 cell array
    {14861×1 double}    {[bed]}

Data With File Name Labels

The Free Spoken Digit Dataset (FSDD) [2] contains recordings of spoken digits in files whose names contain the digit labels as well as speaker labels. Download the data set and create an audioDatastore that points to the data.

downloadFolder = matlab.internal.examples.downloadSupportFile("audio","FSDD.zip");
dataFolder = tempdir;
unzip(downloadFolder,dataFolder)
dataset = fullfile(dataFolder,"FSDD","recordings");
ads = audioDatastore(dataset);

Select a random file from the data set and display its name. The file name is formatted as digitLabel_speakerName_index.

[~,name] = fileparts(ads.Files{randi(length(ads.Files))})
name = 
'1_jackson_45'

Use filenames2labels to extract the digit labels from the file names. Combine the labels with the audio into a CombinedDatastore and see the label distribution of the data set.

labels = filenames2labels(ads,ExtractBefore="_");
lds = arrayDatastore(labels);
cds = combine(ads,lds);

countlabels(cds,UnderlyingDatastoreIndex=2)
ans=10×3 table
    Label    Count    Percent
    _____    _____    _______

      0       200       10   
      1       200       10   
      2       200       10   
      3       200       10   
      4       200       10   
      5       200       10   
      6       200       10   
      7       200       10   
      8       200       10   
      9       200       10   

Data With Metadata File

The Mozilla Common Voice data set [3] contains recordings of subjects speaking short sentences. The data set has a metadata file with various labels including sentence transcriptions and speaker IDs. Download the data set and create an audioDatastore that points to the training data.

downloadFolder = matlab.internal.examples.downloadSupportFile("audio","commonvoice.zip");
dataFolder = tempdir;
unzip(downloadFolder,dataFolder)
dataset = fullfile(dataFolder,"commonvoice","train");
ads = audioDatastore(fullfile(dataset,"clips"));

Read the metadata file into a table.

metadata = readtable(fullfile(dataset,"train.tsv"),FileType="text");

Assert that the order of the files in the datastore matches the table. This ensures you can easily associate the metadata information with the datastore.

[~,adsFilenames,~] = fileparts(ads.Files);
assert(length(adsFilenames)==length(metadata.path))
assert(all(strcmp(adsFilenames,metadata.path)))

Create a CombinedDatastore using the transcribed sentences as labels.

sentences = arrayDatastore(string(metadata.sentence));
transcriptDs = combine(ads,sentences);

Create another CombinedDatastore with speaker IDs as labels. Rename the speaker ID labels to natural numbers for simplicity.

speakerLabels = categorical(metadata.client_id);
speakerIDs = string(1:length(categories(speakerLabels)));
speakerLabels = renamecats(speakerLabels,speakerIDs);

labelsDs = arrayDatastore(speakerLabels);
speakerDs = combine(ads,labelsDs);
countlabels(speakerDs,UnderlyingDatastoreIndex=2)
ans=595×3 table
    Label    Count    Percent
    _____    _____    _______

     1         1       0.05  
     10        1       0.05  
     100       3       0.15  
     101       4        0.2  
     102      36        1.8  
     103       4        0.2  
     104       1       0.05  
     105       2        0.1  
     106       4        0.2  
     107       1       0.05  
     108       1       0.05  
     109       1       0.05  
     11        4        0.2  
     110       1       0.05  
     111       1       0.05  
     112      10        0.5  
      ⋮

Next Steps

You can now use the data sets to train deep learning or machine learning models, and you can use read and readall to access the data and labels. You can also perform feature extraction on the data and use transform to create a new datastore that extracts features from the audio data.

References

[1] Warden P. "Speech Commands: A public dataset for single-word speech recognition", 2017. Available from https://storage.googleapis.com/download.tensorflow.org/data/speech_commands_v0.01.tar.gz. Copyright Google 2017. The Speech Commands Dataset is licensed under the Creative Commons Attribution 4.0 license, available here: https://creativecommons.org/licenses/by/4.0/legalcode.

[2] Zohar Jackson, César Souza, Jason Flaks, Yuxin Pan, Hereman Nicolas, and Adhish Thite. “Jakobovski/free-spoken-digit-dataset: V1.0.8”. Zenodo, August 9, 2018. https://doi.org/10.5281/zenodo.1342401.

[3] Mozilla Common Voice. https://commonvoice.mozilla.org/en.

See Also

Objects

Functions