How to improve K-means clustering with TF-IDF?
3 ビュー (過去 30 日間)
古いコメントを表示
Geovane Gomes
2024 年 10 月 7 日
コメント済み: Christopher Creutzig
2024 年 10 月 22 日
Hi all,
I’m currently working on a project where I need to classify company segments based on their activity descriptions.
I’ve implemented K-means clustering using TF-IDF for feature extraction from text data. However, the current clustering results aren’t entirely accurate, especially when it comes to grouping semantically similar segments (e.g., "cars" and "vehicles" are placed into separate clusters). Is this possible to optmise it, or use another approche rather than TF-IDF.
See cluster 13. More than 50% of the items were assigned to this cluster. I also tried using other distance parameters, but the results didn't improve.
Here is my code:
clear
close
% load and preprocess
d = readtable("segmentos95Translated.xlsx");
t = d.TRANSLATED;
for i = 1:height(t)
str = t{i};
splitStr = strsplit(str, 'EXCEPT');
t{i} = strtrim(splitStr{1});
end
for i = 1:height(t)
str = t{i};
splitStr = strsplit(str, 'WITHOUT PREDOMINANCE');
t{i} = strtrim(splitStr{1});
end
% tokenization
t = lower(t);
t = tokenizedDocument(t);
t = removeStopWords(t);
t = normalizeWords(t);
customStopWords = ["manufactur","activ",",","rental","(",")","*","exempt"...
"commerci","repres","agent","trade","product","retail","sale","waiv","special","wholesal"];
t = removeWords(t,customStopWords);
% bag of words and TF-IDF
bag = bagOfWords(t);
tfidfMatrix = tfidf(bag);
X = full(tfidfMatrix);
% kmeans
rng(1)
numClusters = 25; % about 10%
[idx, C, sumd, D] = kmeans(X, numClusters);
d.clusters = idx;
% display results
for i = 1:numClusters
fprintf('Cluster %d:\n', i);
disp(d.TRANSLATED(idx == i));
end
sortrows(groupcounts(d,"clusters"),"Percent","descend")
0 件のコメント
採用された回答
Sandeep Mishra
2024 年 10 月 8 日
Hi Geovane,
I can observe that you are trying to enhance the accuracy of your K-means clustering implementation.
The current implementation using 'TF-IDF' fails to capture the semantic meanings between words, which can lead to unrelated synonyms or related terms being treated as distinct.
To resolve this, you can use word embeddings such as 'fastText' which represent words in a continuous vector space, capturing semantic meanings.
You can leverage the 'Text Analytics Toolbox Model for fastText English 16 Billion Token Word Embedding' add-on in MATLAB to implement 'fastText' word embedding.
Consider the following implementation:
% Converting tokenized documents to cell array
textData = arrayfun(@(doc) joinWords(doc), t, 'UniformOutput', false);
% Loading fastText word embedding
emb = fastTextWordEmbedding;
% Converting text to embedding
X = zeros(numel(textData), emb.Dimension);
for i = 1:numel(textData)
words = split(textData{i});
validWords = words(isVocabularyWord(emb, words));
if ~isempty(validWords)
vecs = word2vec(emb, validWords);
X(i, :) = mean(vecs, 1);
end
end
[idx, C] = kmeans(X, numClusters);
Refer to the following MathWorks Documentation to learn more about ‘Text Analytics Toolbox Model for fastText English 16 Billion Token Word Embedding’ function in MATLAB: https://www.mathworks.com/matlabcentral/fileexchange/66229-text-analytics-toolbox-model-for-fasttext-english-16-billion-token-word-embedding
I hope this helps.
4 件のコメント
Christopher Creutzig
2024 年 10 月 22 日
Also worth checking out are documentEmbedding and, for a different workflow with “soft clustering,” fitlda.
その他の回答 (0 件)
参考
カテゴリ
Help Center および File Exchange で Language Support についてさらに検索
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!