machine learning- extracting calls from an audio file

Question

ytzhak goussha 2018 年 10 月 8 日

0
リンク

この質問への直接リンク

https://jp.mathworks.com/matlabcentral/answers/422878-machine-learning-extracting-calls-from-an-audio-file

I'll explain how I'm currently working and then I'll get to how I want to work. I'm working on a project on which I am recording animal vocalizations in order to study them and get out parameters such as dominant frequency, call duration, call rate etc. We are recording 10 minutes each hour during 12 hours in high resolution (250k samples per second) using spike2. The animals are communicating with "calls", each calls is comprised of syllables. Each recording has about 20 calls, each call is 1-20 seconds long,containing 3-20 syllable, each syllable is about 50 milliseconds long.

Then the files are imported to matlab in chunks 10-100 minutes each (about 200-2000 mb) and a script performs the next steps:

Load the files, split them to 10 min chunks and save them with their time stamps, each 10 minute clip represent a sample from a different hour
Load the 10 min chunks and split them into 5 sec files and save them
Create a spectogram for each file and save it as jpg file.

then i do the next steps manually:

Go though the hundreds of images and see when the animals are communicating
Type in an excel sheet the time stamps of each call (onset and offset) while sorting them up into 2 different types of calls.
Load the original 10 min clips (each representing an hour) and clip out of them the calls and save them individually with their timestamps as their file names. every clip represents a call.
load each clip's spectogram, look at the beginning and the end of each syllable and record the time stamps in an excel sheet.
Then I run a function that calculates the parameters i need, it takes the clips with the calls and the excel file and produces elements of a class- "Calls" and "Sessions" which is built from the Calls elements.

what I want to do is, to make it automatic.

I want to take a machine learning tool, train is with all the data that I already got manually (me and my team already analyzed and got data from thousands of syllables and hundreds of calls), so that it can identify when a syllable begins and when it ends, and when a call begins and when it ends
feed the audio file in 10 min chunks and get all time stamps of the calls and syllables-their onsets and offsets.

I only know one way to do it and that is to feed the spectogram images (each representing 5 seconds or each representing a call) and train the neuronal network to identify the onset and offset of a call and of a syllable. if it can even sort them into the 2 different types that would be great but even if it can't it will be good enough.

Is there a better way? can it even be done? Apologies for the bad English, and it is needless to say that I am not a programmer, so I need the simple version