In this example, you create a logical mask for an audio signal where ones correspond to the utterance "yes" and zeros correspond to the absence of the utterance "yes". To create the mask, you use the IBM™ speech-to-text API through the Audio Labeler app.
This example requires that you install the Speech-to-Text Transcription functionality.
Listen to the audio file that you want to label and then visualize it in the time domain.
Open the Audio Labeler app and load the KeywordSpeech-16-16-mono-34secs.flac
file into the Data Browser.
Under Automation, click Speech to Text. On the Speech to Text tab, select your preferred speech-to-text API. This example uses the IBM speech-to-text API. Select Segment Words so that the text labels are divided into individual words instead of sentences. Click Run to interface with the speech-to-text API and create a new region of interest (ROI) label. The ROI label contains words detected and labeled by IBM's speech-to-text API.
Close the Speech to Text tab and then export the labeled signal set to the workspace.
The labels are exported to the workspace as labeledSignalSet
object with a time stamp. Set the variable labeledSet
to the time-stamped labeledSignalSet
object.
Inspect the SpeechContent
label.
speechContent=52×2 table
ROILimits Value
____________ _________
0.87 1.31 "first"
1.31 1.41 "you"
1.41 1.63 "said"
1.63 2.22 "yes"
2.25 2.52 "then"
2.52 3.03 "no"
3.09 3.22 "and"
3.22 3.32 "you"
3.32 3.52 "said"
3.52 3.94 "yes"
3.94 4.16 "then"
4.16 4.66 "no"
4.83 5.39 "yes"
5.42 5.57 "the"
5.57 6.07 "no"
6.15 6.56 "driving"
⋮
The speech-to-text API returns the limits of the ROI labels in seconds. Use the SpeechContent
table to create a logical vector.
Plot the speech signal and the keyword spotting mask.