Extract the same keyword from a list of PDF and store the sentence containing the keyword into an excel document composed with two columns

1 回表示 (過去 30 日間)
Dear Matlab community,
I am trying to extract a number of sentences with the same keyword (for example "%") from a list of pdf (let's say A1.pdf, A2.pdf, A3.pdf) and I would like to export all the sentences to excel. The excel document would have as a first column: the sentence extracted containing the keyword; as a second column: the name of the pdf document where the sentence has been taken.
Any idea how do that? Thank you very much in advance
Best regards,
Greenmamba

回答 (1 件)

Jemima Pulipati
Jemima Pulipati 2020 年 11 月 28 日
Hello,
From my understanding you are trying to read data from a group of PDF files and then write the data to an excel file.
You can initially loop through every PDF file and use the extractFileText() to extract and store the data locally. Later you can use writetable() to write the data to an excel file.
The following links from the community and documentation may help you get started.
  2 件のコメント
Greenmamba
Greenmamba 2020 年 12 月 1 日
編集済み: Greenmamba 2020 年 12 月 1 日
pth = 'XXX';
nam = '*.pdf';
S = dir(fullfile(pth,nam));
C = cell(size(S));
for j= 1:numel(S)
tmp = fullfile(pth,S(j).name);
str = extractFileText(tmp);
ii = strfind(str,"XXX");
for k=1:numel(ii)
start = ii(k);
st=start-120;
if any(st >= 0)
M{k}= extractBetween(str,st,start+30)
mat = vertcat(M{:})
else
M{k}= extractBetween(str,start,start+30)
mat = vertcat(M{:})
end
end
end
xlswrite('XXX.xlsx',mat);
I still have two problems:
1- Is there any way to count the number of total characters in a pdf?
2- Now I get a column as an output, is there any option to get a second one with the name of the file where the sentence was taken?
Jemima Pulipati
Jemima Pulipati 2020 年 12 月 14 日
  1. Since 'extractFileText' returns the content of a pdf in a string, you may use the strlength() to get the length of the string which refers to the total number of characters in a pdf.
  2. The 'tmp' variable in your code has the name of the file from where sentences are picked up. So if you could try storing this variable as another column inside the 'mat' variable then the output will have another column showing the name of file.
Example when a single pdf file exists:
mat(:,2) = tmp;
xlswrite('XXX.xlsx',mat);

サインインしてコメントする。

カテゴリ

Help Center および File ExchangeSpreadsheets についてさらに検索

製品


リリース

R2020b

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by