measuring term frequency of words
6 ビュー (過去 30 日間)
古いコメントを表示
I have been able to obtain a bag of words from a document. Please, how can I interact with the bag of words array, so I may make calculations on the frequency of terms within each document?
str = extractFileText('file.txt');
paras = split(str,"</P>");
paras(end) = []; % the split left an empty last entry
paras = extractAfter(paras,">") % Drop the "<P ID=n>" from the beginning
tdoc = tokenizedDocument(lower(paras));
bag = bagOfWords(tdoc)
I have this result:
For clarification, I believe the columns are the terms, while the rows are the documents. Am I right?
I loaded 2 txt files (1 document set, 1 query set) I want to evaluate similarity between each document and each query by Cosine similarity, tf-idf or whatsoever means.
3 件のコメント
Christopher Creutzig
2020 年 4 月 24 日
If I understand your question correctly, you can simply divide the counts, aka term frequency, by the document length. You may need to adapt the orientation of the vectors a bit, and also transpose everything if you want to, as I did here, display them in a table:
>> str = ["This is a short document.",...
"This is a longer document. With more tokens. Maybe that is about enough?"];
>> td = tokenizedDocument(str)
td =
1×2 tokenizedDocument:
6 tokens: This is a short document .
16 tokens: This is a longer document . With more tokens . Maybe that is about enough ?
>> bow = bagOfWords(td);
>> relFreq = bow.Counts ./ doclength(td).';
>> table(bow.Vocabulary.', relFreq.', 'VariableNames',["Word","relative Frequency"])
ans =
15×2 table
Word relative Frequency
__________ __________________
"This" 0.16667 0.0625
"is" 0.16667 0.125
"a" 0.16667 0.0625
"short" 0.16667 0
"document" 0.16667 0.0625
"." 0.16667 0.125
"longer" 0 0.0625
"With" 0 0.0625
"more" 0 0.0625
"tokens" 0 0.0625
"Maybe" 0 0.0625
"that" 0 0.0625
"about" 0 0.0625
"enough" 0 0.0625
"?" 0 0.0625
D. Frank
2020 年 10 月 16 日
Can i ask, is there any way to find the frequency and the number of repeated letters,pair of letters, space in a note, word or pdf file??
採用された回答
Christopher Creutzig
2017 年 12 月 4 日
See the bagOfWords documentation. E.g., you can use the tfidf function, you can extract bag.Counts and use pdist(bag.Counts,'cosine'), you can use fitlsa for what is essentially a principal component analysis for dimensionality reduction, or fitlda to train/fit a topic model.
2 件のコメント
Christopher Creutzig
2018 年 10 月 15 日
編集済み: Christopher Creutzig
2018 年 10 月 15 日
John, you need to encode both sets of documents with the same bag-of-words model. (That model not only contains counts, it also has a specific mapping which word to put into which position, and if you use tfidf, you need to use the same idf factors for consistency within your analysis.) Something like this:
corpus = tokenizedDocument(corpusData);
bow = bagOfWords(corpus);
query = tokenizedDocument(queryData);
queryVectors = encode(bow,query);
dists = pdist2(queryVectors,bow.Counts,'cosine');
その他の回答 (0 件)
参考
カテゴリ
Help Center および File Exchange で Modeling and Prediction についてさらに検索
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!