Reading/fetching text from text/PDF file for pre-processing

Question

moin khan 2021 年 3 月 21 日

0
リンク

この質問への直接リンク

https://jp.mathworks.com/matlabcentral/answers/779012-reading-fetching-text-from-text-pdf-file-for-pre-processing

コメント済み: moin khan 2021 年 3 月 21 日

I have text/pdf files which contains millions of words(text). If i use str = extractFileText(filename) then firstly matlab became very slow also some time hancked . Also variable is not able to hold such a large data.

I want to read file word by word so i can filter text and make a smaller array of filtered data. Or i want to make filtered data temp file for next processing of data(as t will be small).

i need help in this also if you have any other solution of my probelm do reply.

2 件のコメント
なしを表示なしを非表示

Ive J 2021 年 3 月 21 日

編集済み: Ive J 2021 年 3 月 21 日

MATLAB Online で開く

Did you try using the function with name, value pair?

for i = 1:numel(pages)
    str = extractFileText(filename, 'pages', pages(i)); % get only one page per time
    % do whatever you want with str
end

moin khan 2021 年 3 月 21 日

kindly have a look on my solution below.

サインインしてコメントする。

サインインしてこの質問に回答する。

Answer 1

moin khan 2021 年 3 月 21 日

0
リンク

この回答への直接リンク

https://jp.mathworks.com/matlabcentral/answers/779012-reading-fetching-text-from-text-pdf-file-for-pre-processing#answer_653842

I firstly tried extractFileText on my file(text file with 19million words) it was really slow ad didnt worked because it all was going in single variable. Now i fetched data line by line and saved in an array and now its ok just take some seconds but its fine with such large file.

code:

fid=fopen(filename);

inputData = cell(0,1);

while ~feof(fid)

tline = fgetl(fid);

if ~isempty(tline)

inputData{end+1,1} = tline;

end

fclose(fid);

clear('ans','fid','tline');

documents = tokenizedDocument(inputData);

clear('inputData');

0 件のコメント
-2 件の古いコメントを表示-2 件の古いコメントを非表示

サインインしてコメントする。

Reading/fetching text from text/PDF file for pre-processing

2 件のコメント
なしを表示なしを非表示

採用された回答

0 件のコメント
-2 件の古いコメントを表示-2 件の古いコメントを非表示

その他の回答 (0 件)

参考

カテゴリ

タグ

Community Treasure Hunt

Reading/fetching text from text/PDF file for pre-processing

2 件のコメント なしを表示なしを非表示

採用された回答

0 件のコメント -2 件の古いコメントを表示-2 件の古いコメントを非表示

その他の回答 (0 件)

参考

カテゴリ

タグ

Community Treasure Hunt

2 件のコメント
なしを表示なしを非表示

0 件のコメント
-2 件の古いコメントを表示-2 件の古いコメントを非表示