Manage Data Sets for Machine Learning and Deep Learning Workflows
Use MATLAB® and Signal Processing Toolbox™ functionality to create a successful artificial intelligence (AI) workflow from labeling to training to deployment.
Common AI Tasks
Common AI tasks are signal classification, sequence-to-sequence classification, and regression. An AI model predicts:
- For signal classification — A discrete class label for each input signal 
- For sequence-to-sequence classification — A label for each time step of the sequence data 
- For regression — A continuous numeric value 
Data Organization
For many machine learning and deep learning applications, data sets are large and consist of both signal and label variables. Based on how your data set is organized, you can use datastores and functions in MATLAB and Signal Processing Toolbox to manage your data.
There are various methods to collect and store data that influence how you can access it in a workflow. In the data preparation stage, you might come across one or more of these common questions:
- How do I organize my data? 
- How do I access data for training? 
- How do I create labels? 
- How do I combine signal and label data? 
This table provides different data organization scenarios and shows you how to create datastores that correspond to these scenarios, so that you can access and prepare your data for your workflow.
| Data Organization | Task | Related Datastore | Example | 
|---|---|---|---|
| Signal and label variables stored separately in memory | 
 | Consider a data set consisting of signals stored in matrix
                       ads1 = arrayDatastore(sig); ads2 = arrayDatastore(lbls); Use
                    the  cds = combine(ads1,ads2); Determine
                    the count of each label in the data set. Specify the underlying datastore index
                    to count the labels in
                     cnt = countlabels(cds,UnderlyingDatastoreIndex=2) cnt =
  4×3 table
    Label    Count    Percent
    _____    _____    _______
      a       20        25   
      b       20        25   
      c       20        25   
      d       20        25   
Use
                    the  idxs = splitlabels(cds,[0.7 0.2],"randomized",UnderlyingDatastoreIndex=2);
trainDs = subset(cds,idxs{1});
valDs = subset(cds,idxs{2});
testDs = subset(cds,idxs{3});Count the number of labels in the training subset datastore. trainCnt = countlabels(trainDs,UnderlyingDatastoreIndex=2) trainCnt =
  4×3 table
    Label    Count    Percent
    _____    _____    _______
      a       14        25   
      b       14        25   
      c       14        25   
      d       14        25    | |
| Signal and label variables stored in separate MAT files | 
 | Consider a data set consisting of two sets of MAT files. The first set
                    contains signal data and the second set contains corresponding labels. All files
                    are saved in the same folder and have either " sds = signalDatastore(datasetFolder); Use
                    the  sigds = subset(sds,contains(sds.Files,"signal")); lblds = subset(sds,contains(sds.Files,"label")); Read the
                    label data into memory. Convert the labels to a categorical array with
                    categories  labeldata = readall(lblds);
lblcat = categorical(labeldata,{'a' 'b' 'c'});Create an
                       ads = arrayDatastore(lblcat); allds = combine(sigds,ads); Preview the first signal and the corresponding label in the datastore. preview(allds) ans =
  1×2 cell array
    {1000×1 double}    {[a]}Note A datastore parses files alphabetically. To ensure that signal variables and label variables stored in separate files are paired correctly, use a matching identifier for corresponding filenames. | |
| Signal and label variables stored in a single MAT file | 
 | Consider a data set consisting of MAT files that contain both signal
                      ( sds = signalDatastore(datasetFolder,IncludeSubFolders=true, ... SignalVariableNames=["sig" "lbl"]); Read the first pair of signal and label data. read(sds) ans =
  2×1 cell array
    {225000×1 double}
    {225000×1 categorical}Divide the data at random into training and testing sets. Use 80% of the data to train the network and 20% of the data to test the network. [trainIdx,~,testIdx] = dividerand(numel(sds.Files),0.8,0.2); trainds = subset(sds,trainIdx); testds = subset(sds,testIdx); | |
| Signals stored in MAT files and labels stored in memory | 
 | Consider a data set consisting of signals stored in MAT files in
                    location  sds = signalDatastore(folder); ads = arrayDatastore(lbls); Use the  cds = combine(sds,ads) cds = 
  CombinedDatastore with properties:
      UnderlyingDatastores: {[1×1 signalDatastore]  [1×1 matlab.io.datastore.ArrayDatastore]}
    SupportedOutputFormats: ["txt"    "csv"    "xlsx"    "xls"    "parquet"    "parq"    …    ] | |
| Signals stored in MAT files saved in folders containing label names | 
 | Consider a data set consisting of signals stored in MAT files. The
                    files are saved in folders, and each folder name corresponds to a label. Create
                    a  sds = signalDatastore(location); Use
                    the  lbls = folders2labels(location,FileExtensions=".mat");
ads = arrayDatastore(lbls);Combine the signal datastore
                    and the array datastore using the  cds = combine(sds,ads); | |
| Signals stored in MAT files and region-of-interest (ROI) limits stored in separate MAT files | 
 | Consider a data set consisting of MAT files that contain signal data and other MAT files that contain label data. The label data is stored as region-of-interest tables that define a label value for different signal regions. Create two separate datastores to consume the data. sds1 = signalDatastore(FileLocation1,SampleRate=fs); sds2 = signalDatastore(FileLocation2, ... SignalVariableNames=["LabelVals";"LabelROIs"]); Convert the ROI limits and labels to a categorical sequence that you can use to train a model. i = 1; while hasdata(sds1) signal = read(sds1); label = read(sds2); % Convert label values to categorical vector labelCats = categorical(label{2,1}.Value,{'a' 'b' 'c' 'd'}); % Convert label values and ROI limits to table for signalMask input roiTable = table(label{2,1}.ROILimits,labelCats); m = signalMask(roiTable); % Obtain categorical sequence mask mask = catmask(m,length(signal)); lbls{i} = mask; i = i+1; end % Store categorical sequence mask in array datastore ads = arrayDatastore(lbls,IterationDimension=2); Combine
                       sds4 = combine(sds1,ads); | |
| Labeled signal set containing signal and label data | 
 | Consider a labeled signal set  lblnames = getLabelNames(lss) ans = 3×1 string
    "WhaleType"
    "MoanRegions"
    "TrillRegions"
Use
                    the  [sds,ads] = createDatastores(lss,lblnames) sds = 
  signalDatastore with properties:
    MemberNames:{
                'Whale1';
                'Whale2'
                }
       Members: {2×1 cell}
      ReadSize: 1
    SampleRate: 4000
ads = 
  ArrayDatastore with properties:
              ReadSize: 1
    IterationDimension: 1
            OutputType: "cell" | |
| Input and output signals stored in the same MAT file | 
 | Consider a data set consisting of MAT files stored in
                       sds = signalDatastore(folder,SignalVariableNames=["xIn" "xOut"]); You
                    can input  Consider a different data set
                    consisting of MAT files stored in  inDs = signalDatastore(location,SignalVariableNames=["a" "b" "c"]); outDs = signalDatastore(location,SignalVariableNames=["d" "e"]); | 
When your data is ready, you can use the trainnet (Deep Learning Toolbox) function
        to train a neural network. Common functions that you can use for network training, like
          trainnet or minibatchqueue (Deep Learning Toolbox),
        accept datastores as an input for training data and
        responses.
net = trainnet(ds,...)
Note
When data is stored in memory, you can input a cell array directly to the
            trainnet function. If you need to transform in-memory data before
          training, use a TransformedDatastore.
Data Preprocessing
Some workflows require you to preprocess the data before feeding it to a network. For example, you can resample, resize, or filter signals before or during training. You can precompute features or use datastore transformations to prepare the data for training.
Example: Compute Fourier synchrosqueezed transform (FSST)
Calculate the FSST of each signal in datastore ds.
fsstDs = transform(ds,@fsst);
The transformed data fits in memory. Use the readall
        function to read all of the data from the TransformedDatastore into memory
        so that the FSST computations are performed only once during the training step.
transformedData = readall(fsstDs);
Example: Extract time-frequency features from signal data
Obtain the short-time Fourier transform (STFT) of each signal in datastore
          ds. Call the transform
        function to compute the stft and then
        use the writeall
        function to write the output to the disk.
tds = transform(ds,@stft); writeall(tds,outputLocation);
Create a new datastore that points to the out-of-memory features.
ds = signalDatastore(outputLocation);
Example: Extract spectral skewness and time-frequency ridges from signal data
Create a datastore that points to a location that contains signal data files. The sample rate is 1000 Hz.
Fs = 1000; sds = signalDatastore(datasetFolder,IncludeSubfolders=true);
Create a signalTimeFrequencyFeatureExtractor object defining a sample rate. Enable the
        spectral skewness and time-frequency ridges as features to extract.
tfFE = signalTimeFrequencyFeatureExtractor(SampleRate=Fs, ...
              SpectralSkewness=true,TFRidges=true);Call the extract
        function to extract the specified features.
numDataFiles = length(sds.Files); M = cell(numDataFiles,1); for i=1:numDataFiles data = read(sds); [M{i},infoFeatures] = extract(tfFE,data); end Features = cell2mat(M);
Example: Filter and downsample signal data and downsample label data with custom preprocessing function
Create a datastore that points to a location containing both signal data files and label data files.
sds = signalDatastore(location,SignalVariableNames=["data" "labels"]);
Define a custom preprocessing function that bandpass-filters and downsamples the signal data and the label data.
function [dataOut] = downsampleData(dataIn) sig = dataIn{1}; lbls = dataIn{2}; filtsig = bandpass(sig,[10 400],3000); downsig = downsample(filtsig,3); downlbls = downsample(lbls,3); dataOut = [downsig,downlbls]; end
Call transform on
          sds to apply the custom preprocessing function to each file.
tds = transform(sds,@downsampleData);
For more information about preprocessing in deep learning workflows, see Preprocess Data for Domain-Specific Deep Learning Applications (Deep Learning Toolbox).
Workflow Scenarios
A general workflow for any machine learning or deep learning task involves these steps:
- Data preparation 
- Network training 
- Model deployment 
This table shows examples and functions you can use to go from preparing data to training a network for signal classification tasks.
| Example | Data | Related Functions | Highlights | 
|---|---|---|---|
| Spoken Digit Recognition with Custom Log Spectrogram Layer and Deep Learning | 
 
 
 | Predict labels for audio recordings using deep convolutional neural network (DCNN) and custom log spectrogram layer 
 
 
 | |
| Hand Gesture Classification Using Radar Signals and Deep Learning | 
 
 
 | Preprocess signals using custom functions and train multiple-input single-output convolutional neural network (CNN) 
 
 
 | |
| Train Spoken Digit Recognition Network Using Out-of-Memory Features | 
 
 
 | Predict labels for audio recordings using a network trained on mel-frequency spectrograms 
 
 
 | 
This table shows examples and functions you can use to go from preparing data to training a network for sequence-to-sequence classification tasks.
| Example | Data | Related Functions | Highlights | 
|---|---|---|---|
| Waveform Segmentation Using Deep Learning | 
 
 
 | 
 | Segment regions of interest in signals 
 
 
 | 
| Classify Arm Motions Using EMG Signals and Deep Learning | 
 
 
 | 
 
 | Classify signal ROIs 
 
 
 | 
This table shows examples and functions you can use to go from preparing data to training a network for regression tasks.
| Example | Data | Related Functions | Highlights | 
|---|---|---|---|
| Denoise EEG Signals Using Differentiable Signal Processing Layers | 
 | Denoise signals using regression model 
 
 
 | 
Tip
Use the read, readall, and
              writeall functions to read data in a datastore or write data from
            a datastore to files.
- read— Use this function to read data iteratively from a datastore that contains file data or in-memory data.
- readall— Use this function to read all the data in a datastore at once when the data set fits in memory. If the data set is too large to fit in memory, you can transform the data at each training epoch or use the- writeallfunction to store the transformed data that you can then read using a- signalDatastore.
- writeall— Use this function to write preprocessed data that does not fit in memory to files. You can then create a new datastore that points to the location of the output files.
Available Data Sets
There are several data sets readily available for use in an AI workflow:
- QT Database — 210 ECG signals with region labels. Available for download at - https://www.mathworks.com/supportfiles/SPT/data/QTDatabaseECGData.zip.
- EEGdenoiseNet — 4514 clean EEG segments and 3400 ocular artifact segments. Available for download at - https://ssd.mathworks.com/supportfiles/SPT/data/EEGEOGDenoisingData.zip.
- UWB-gestures — 96 multichannel UWB impulse radar signals. Available for download at - https://ssd.mathworks.com/supportfiles/SPT/data/uwb-gestures.zip.
- Myoelectric Data — 720 multichannel EMG signals with region labels. Available for download at - https://ssd.mathworks.com/supportfiles/SPT/data/MyoelectricData.zip.
- Mendeley Data — 327 accelerometer signals with class labels. Available for download at - https://ssd.mathworks.com/supportfiles/wavelet/crackDetection/transverse_crack.zip.
For additional data sets, see Time Series and Signal Data Sets (Deep Learning Toolbox).
See Also
Topics
- Datastores for Deep Learning (Deep Learning Toolbox)
- Signal Processing Applications (Deep Learning Toolbox)
- Sequence Classification Using Deep Learning (Deep Learning Toolbox)
- Sequence-to-Sequence Classification Using Deep Learning (Deep Learning Toolbox)
- Sequence-to-One Regression Using Deep Learning (Deep Learning Toolbox)