Using an LDA Model with a tall table

Question

Kathryn Janiuk 2023 年 3 月 23 日

0
リンク

この質問への直接リンク

https://jp.mathworks.com/matlabcentral/answers/1934399-using-an-lda-model-with-a-tall-table

回答済み: Christopher Creutzig 2024 年 1 月 19 日

Hello,

I am trying to read a .txt file that is ~39gb and is a lot of unstructured data. I am trying to run the file through an LDA model to run natural language processing on it. (I cannot post the actual file but it is just a large file that has data in it that doesn't have a specific structure to it)

Here is the code I have thus far, and I understand that some of the functions/commands may not work correctly but I am not sure what to do and am wondering how I should approach this.

Here is a draft of the code I have been working on thus far:

opts = delimitedTextImportOptions("NumVariables", 1);
% Specify range and delimiter
opts.DataLines = [1, Inf];
opts.Delimiter = "";
% Specify column names and types
opts.VariableNames = "Var";
opts.VariableTypes = "char";
% Specify file level properties
opts.ExtraColumnsRule = "ignore";
opts.EmptyLineRule = "read";
% Specify variable properties
opts = setvaropts(opts, "Var", "WhitespaceRule", "preserve");
opts = setvaropts(opts, "Var", "EmptyFieldRule", "auto");
% Import the data
ds = readtable("********", opts);
%% Clear temporary variables
clear opts
tt = tall(ds) %tt = tall table 
tall2text = table2array(tt);
%%extract text data from under variable 
textData = tt.Report;
textData(1:10)
%%preprocess text data 
documents = preprocessText(textData);
documents(1:5)
%%create bag of words from tokenized data 
bag = bagOfWords(documents)
%%remove infrequent words (words appearing 2 or less times), remove any
%%empy docs 
bag = removeInfrequentWords(bag,2);
bag = removeEmptyDocuments(bag)
%%fit LDA model with 7 topics (change numTopics if different num topics
%%desired)
rng("default")
numTopics = 7;
mdl = fitlda(bag,numTopics,Verbose=0);
%%visualize topics in word cloud - view words with highest probabilities in
%%word clouds 
figure(1)
t = tiledlayout("flow");
title(t,"LDA Topics")
for i = 1:numTopics
    nexttile
    wordcloud(mdl,i);
    title("Topic " + i)
end
%%Preprocess function 
function documents = preprocessText(textData)
% Tokenize the text.
documents = tokenizedDocument(textData);
% Lemmatize the words.
documents = normalizeWords(documents,Style="lemma");
% Erase punctuation.
documents = erasePunctuation(documents);
% Remove a list of stop words.
documents = removeStopWords(documents);
% Remove words with 2 or fewer characters, and words with 15 or greater
% characters.
documents = removeShortWords(documents,2);
documents = removeLongWords(documents,15);
end

I have looked at the documentation for parallel processing and haven't been able to figure out how to import the file - importing it as a dataframe hasn't worked due to how the file is set up. Any help would be appreciated!

0 件のコメント
-2 件の古いコメントを表示-2 件の古いコメントを非表示

サインインしてコメントする。

サインインしてこの質問に回答する。

Answer 1

Nikhilesh 2023 年 3 月 27 日

0
リンク

この回答への直接リンク

https://jp.mathworks.com/matlabcentral/answers/1934399-using-an-lda-model-with-a-tall-table#answer_1201509

MATLAB Online で開く

Here's an example of how you could modify your code to load the data in chunks

% Open the file
fid = fopen('large_file.txt');
% Read the file in chunks
chunk_size = 10*1024*1024; % 10 MB
data = '';
while ~feof(fid)
    chunk = fread(fid, chunk_size, 'uint8=>char');
    data = [data chunk];
    % Preprocess the chunk
    if numel(data) > 10*chunk_size
        documents = preprocessText(data);
        % Add code to process the chunk here
        data = '';
    end
end
% Close the file
fclose(fid);

0 件のコメント
-2 件の古いコメントを表示-2 件の古いコメントを非表示

サインインしてコメントする。

Answer 2

Christopher Creutzig 2024 年 1 月 19 日

0
リンク

この回答への直接リンク

https://jp.mathworks.com/matlabcentral/answers/1934399-using-an-lda-model-with-a-tall-table#answer_1392851

MATLAB Online で開く

As Nikhilesh said, you probably want to process your data in some form of chunks. Since you want to fit an LDA model, the most obvious choice for chunks would be the documents you already need to define in some way while reading your data. (An LDA model of a single huge document doesn't make semantic sense, LDA is looking for differences between documents.)

Now, the important thing to realize is that while fitlda needs to look at all your data at once and cannot learn incrementally, it does not need all of your text data, just the whole bag of words. And that is something you can collect incrementally:

bag = bagOfWords;
while hasdata(myDataSource)
    document = getNextDocument(myDataSource);
    bag = addDocument(bag,document);
end
bag = removeInfrequentWords(bag,2);
bag = removeEmptyDocuments(bag);
mdl = fitlda(bag);

As a bonus link, see the doc example doing the same thing in parallel.

0 件のコメント
-2 件の古いコメントを表示-2 件の古いコメントを非表示

サインインしてコメントする。

Using an LDA Model with a tall table

0 件のコメント
-2 件の古いコメントを表示-2 件の古いコメントを非表示

採用された回答

0 件のコメント
-2 件の古いコメントを表示-2 件の古いコメントを非表示

その他の回答 (1 件)

0 件のコメント
-2 件の古いコメントを表示-2 件の古いコメントを非表示

参考

カテゴリ

タグ

製品

リリース

Community Treasure Hunt

Using an LDA Model with a tall table

0 件のコメント -2 件の古いコメントを表示-2 件の古いコメントを非表示

採用された回答

0 件のコメント -2 件の古いコメントを表示-2 件の古いコメントを非表示

その他の回答 (1 件)

0 件のコメント -2 件の古いコメントを表示-2 件の古いコメントを非表示

参考

カテゴリ

タグ

製品

リリース

Community Treasure Hunt

0 件のコメント
-2 件の古いコメントを表示-2 件の古いコメントを非表示

0 件のコメント
-2 件の古いコメントを表示-2 件の古いコメントを非表示

0 件のコメント
-2 件の古いコメントを表示-2 件の古いコメントを非表示