Frequency words for each labels

Question

Rachele Franceschini 2022 年 7 月 7 日

0
リンク

この質問への直接リンク

https://jp.mathworks.com/matlabcentral/answers/1755040-frequency-words-for-each-labels

コメント済み: Rachele Franceschini 2022 年 7 月 7 日

I have one dataset with two columns: text and data. The data is made up two labels 0 and 1. I would like to calculate the frequency of each word for each labels. I mean, how many time, for example "damage" there is within class 1 and 0? How can I do? Furthermore, I don't understand if I have to, however, use tokens or no. Maybe I can use a cicle for? I don't know it.

Here there is a little image with a similar result. I would like a similar table.

0 件のコメント
-2 件の古いコメントを表示-2 件の古いコメントを非表示

サインインしてコメントする。

サインインしてこの質問に回答する。

Answer 1

Karim 2022 年 7 月 7 日

1
リンク

この回答への直接リンク

https://jp.mathworks.com/matlabcentral/answers/1755040-frequency-words-for-each-labels#answer_1001990

編集済み: Karim 2022 年 7 月 7 日

MATLAB Online で開く

dati_classificati.xlsx

Edit to make so that the code works with the latter added example data...

% read the file
data = readtable("dati_classificati.xlsx",'TextType','string');
% split each sentence into words, assuming that spaces are used as delimiter...
cell_text = arrayfun(@(x) data.text(x,:),1:size(data.text,1),'UniformOutput',false)';
cell_text = cellfun(@(x) split(x,' '), cell_text,'UniformOutput',false);
% count the number of words in each sentence
numWords = cellfun(@numel, cell_text);
% expand the labels to match the number of words for each sentence
expandedLabels = repelem( data.label ,numWords);
% gather the words in 1 big string array
expandedWords = vertcat(cell_text{:});
% list a few words to count the frequency...
MyWords = ["strada" "il" "Via" "donne" "della"];
% allocate a table for the results
varTypes = ["string","double","double"]; % data type for each column
varNames = ["Words","Ones","Zeros"]; % variable name for each column
MyResult = table('Size',[numel(MyWords) 3],'VariableTypes',varTypes,'VariableNames',varNames);
MyResult.Words = MyWords(:);
% count the labels for each word
for i = 1:numel(MyWords)
    currLabels = expandedLabels( contains(expandedWords,MyResult.Words(i)) );
     MyResult.Ones(i) = sum(currLabels==1);
     MyResult.Zeros(i) = sum(currLabels==0);
end
% display the results
MyResult
MyResult = 5×3 table
     Words      Ones    Zeros
    ________    ____    _____

    "strada"     48       1  
    "il"         34      20  
    "Via"        53       0  
    "donne"       0       2  
    "della"       3      14  

9 件のコメント
7 件の古いコメントを表示7 件の古いコメントを非表示

Rachele Franceschini 2022 年 7 月 7 日

MATLAB Online で開く

I used your code. I put one image of the result. I tried also to put a pre-process for cleaning data. But I would like to get: how many time there is the word "ciao" within of classes 1 and 0 etc

% first gererate some random data..
MyWords = daticlassificati.text;
% now create a big list from the set of words
numItems = 1000; 
BigList = MyWords ( randi(numel(MyWords),1,numItems) )
% crea un elenco con etichette casuali 0 o 1
RandomLabel = daticlassificati.label
uWords = unique(BigList);
% allocate a table for the results
varTypes = ["string","double","double"]; % data type for each column
varNames = ["Words","Ones","Zeros"]; % variable name for each column
MyResult = table('Size',[numel(uWords) 3],'VariableTypes',varTypes,'VariableNames',varNames);
MyResult.Words = uWords(:);
% count the labels for each word
for i = 1:numel(uWords)
    currLabels = RandomLabel(contains(BigList,MyResult.Words(i)));
    MyResult.Ones(i) = sum(currLabels==1);
    MyResult.Zeros(i) = sum(currLabels==0);
end
% display the results
MyResult

I put my code with preprocess for cleaning dataset

% input file excel or text
filename = "dati_classificati.xlsx";
data = readtable(filename,'TextType','string');
% remove the rows of the table with empty reports (classify text data using deep learning)    
    idx = strlength(data.text) == 0;
    data(idx,:) = [];
% read and next extract all raws of the colomn name (X)
textData = data.text;
% clean data (remove punctuation etc.)
Train_pr = preprocessText(textData);
    Train_bag = bagOfWords(Train_pr)  
    Train_bag = removeInfrequentWords(Train_bag,5);
    [Train_bag,idx] = removeEmptyDocuments(Train_bag);
    Train_bag
tbl_train = topkwords(Train_bag,2000);

Karim 2022 年 7 月 7 日

I modified the original answer accoring to the file you provided, see at the top. Note that i just used the raw text and only included a few words. But normally now you see how the concept works.

Rachele Franceschini 2022 年 7 月 7 日

VERY VERY thank you!!!!Thank you so much!!I tried also with pre-process and it is ok!

サインインしてコメントする。

Frequency words for each labels

0 件のコメント
-2 件の古いコメントを表示-2 件の古いコメントを非表示

採用された回答

9 件のコメント
7 件の古いコメントを表示7 件の古いコメントを非表示

その他の回答 (0 件)

参考

カテゴリ

タグ

製品

リリース

Community Treasure Hunt

Frequency words for each labels

0 件のコメント -2 件の古いコメントを表示-2 件の古いコメントを非表示

採用された回答

9 件のコメント 7 件の古いコメントを表示7 件の古いコメントを非表示

その他の回答 (0 件)

参考

カテゴリ

タグ

製品

リリース

Community Treasure Hunt

0 件のコメント
-2 件の古いコメントを表示-2 件の古いコメントを非表示

9 件のコメント
7 件の古いコメントを表示7 件の古いコメントを非表示