Main Content

OpenL3

OpenL3 embeddings extraction network

  • Library:
  • Audio Toolbox / Deep Learning

  • OpenL3 block

Description

The OpenL3 block leverages a pretrained convolutional neural network that extracts feature embeddings from audio signals. These embeddings are powerful audio representations that can be used for tasks such as classification. This block requires Deep Learning Toolbox™.

Ports

Input

expand all

Spectrograms generated from audio, specified as an N-by-M matrix or an N-by-M-by-1-by-K array. K represents the number of spectrograms, and N-by-M is the size of the spectrograms and depends on the value of the Spectrum type parameter.

  • Mel (128 bands) –– The network accepts mel spectrograms of size 128-by-199, where 128 is the number of mel bands, and 199 is the number of time hops.

  • Mel (256 bands) –– The network accepts mel spectrograms of size 256-by-199, where 256 is the number of mel bands, and 199 is the number of time hops.

  • Linear –– The network accepts positive one-sided spectrograms of size 257-by-197, where 257 is the FFT length and 197 is the number of time hops.

Data Types: single | double

Output

expand all

Output embeddings, returned as a K-by-L matrix, where K is the number of input spectrograms, and L is specified by the Embedding length parameter.

Data Types: single

Parameters

expand all

Type of spectrum generated from audio and used as input to the neural network, specified as Mel (128 bands), Mel (256 bands), or Linear. This parameter specifies the size of the network input Port_1.

Type of audio content the neural network was trained on, specified as Environmental sounds or Musical sounds. Set this parameter to Environmental sounds to use a neural network pretrained on environmental audio data, and set it to Musical sounds to use a network pretrained on musical data.

Length of output embedding, specified as 512 or 6144.

Size of mini-batches to use for prediction, specified as a positive integer. Larger mini-batch sizes require more memory but can lead to faster predictions.

Block Characteristics

Data Types

double | single

Direct Feedthrough

no

Multidimensional Signals

no

Variable-Size Signals

no

Zero-Crossing Detection

no

References

[1] Cramer, Jason, et al. "Look, Listen, and Learn More: Design Choices for Deep Audio Embeddings." In ICASSP 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2019, pp. 3852-56. DOI.org (Crossref), doi:/10.1109/ICASSP.2019.8682475.

Extended Capabilities

Version History

Introduced in R2022b