フィルターのクリア

Reading/fetching text from text/PDF file for pre-processing

3 ビュー (過去 30 日間)
moin khan
moin khan 2021 年 3 月 21 日
コメント済み: moin khan 2021 年 3 月 21 日
I have text/pdf files which contains millions of words(text). If i use str = extractFileText(filename) then firstly matlab became very slow also some time hancked . Also variable is not able to hold such a large data.
I want to read file word by word so i can filter text and make a smaller array of filtered data. Or i want to make filtered data temp file for next processing of data(as t will be small).
i need help in this also if you have any other solution of my probelm do reply.
  2 件のコメント
Ive J
Ive J 2021 年 3 月 21 日
編集済み: Ive J 2021 年 3 月 21 日
Did you try using the function with name, value pair?
for i = 1:numel(pages)
str = extractFileText(filename, 'pages', pages(i)); % get only one page per time
% do whatever you want with str
end
moin khan
moin khan 2021 年 3 月 21 日
kindly have a look on my solution below.

サインインしてコメントする。

採用された回答

moin khan
moin khan 2021 年 3 月 21 日
I firstly tried extractFileText on my file(text file with 19million words) it was really slow ad didnt worked because it all was going in single variable. Now i fetched data line by line and saved in an array and now its ok just take some seconds but its fine with such large file.
code:
fid=fopen(filename);
inputData = cell(0,1);
while ~feof(fid)
tline = fgetl(fid);
if ~isempty(tline)
inputData{end+1,1} = tline;
end
end
fclose(fid);
clear('ans','fid','tline');
documents = tokenizedDocument(inputData);
clear('inputData');

その他の回答 (0 件)

カテゴリ

Help Center および File ExchangeText Files についてさらに検索

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by