Text Extraction and retrieval

<P ID=1>
A LITTLE BLACK BIRD.
</P>
<P ID=2>
Story about a bird,
(1811)
</P>
<P ID=3>
Part 1.
</P>
As I am new to text extraction, I need help in;
  1. Writing a code to count the delimiters (</P>)
  2. Remove all punctuation
  3. Break the text into individual documents at each delimiter, knowing that ID=1 refers to document 1, ID=2 refers to document 2. etc

 採用された回答

Akira Agata
Akira Agata 2017 年 10 月 25 日

1 投票

Just tried to make a script to do that. Here is the result (assuming the maximum ID = 10).
% Read your text file
fid = fopen('yourText.txt');
C = textscan(fid,'%s','TextType','string','Delimiter','\n','EndOfLine','\r\n');
C = C{1};
fclose(fid);
% 1. Count the delimiters '</P>'
idx = strfind(C,'</P>');
n = nnz(cellfun(@(x) ~isempty(x), idx));
% 2. Remove all punctuation
C2 = regexprep(C,'[.,!?:;]','');
% 3. Break the text into individual documents at each delimiter
idx2 = find(strcmp(C,'</P>'));
for kk = 1:10
str = ['<P ID=',num2str(kk),'>'];
idx_s = find(strcmp(C,str));
if ~isempty(idx_s)
idx_e = idx2(find(idx2>idx_s,1));
fileName = ['document',num2str(kk),'.txt'];
fid = fopen(fileName,'w');
fprintf(fid,'%s\r\n',C(idx_s:idx_e));
fclose(fid);
end
end

6 件のコメント

John
John 2017 年 10 月 26 日
Thank you very much for the early response, I'm grateful. Now that I have an head start plus motivation, I'll try to proceed personally for self development. I'll be back please in case I need some assistance later.
John
John 2017 年 10 月 28 日
編集済み: per isakson 2017 年 11 月 9 日
Thanks for your help. I proceeded with the task, and now I am stuck again. Please save me. I am trying to break the contents of the documents into tokens so I can,
  1. Find the total number of words in the document
  2. Find the total number of distinct words (repetitions are not re-counted)
  3. Find the number of times each word is found in the main document, and total number of documents it's found in.
Here is where I lose it:
%Split string (Tokenizer)
fileName = 'Tokens.txt';
fid = fopen(fileName,'w');
Tk=strtok(C2);
fprintf(fid,'%s\r\n',Tk);
fclose(fid);
Here, Matlab makes token ONLY of the first word on each line, and I can't seem to proceed. I need help!!!
I tried to count the words with this, but it won't work;
for kk = 1:n
str = ['<p id=',num2str(kk),'>'];
del = find(strcmp(C2,str));
No_Delimiter= regexprep(C2,del,''); %trying to remove the document IDs so I don't count them
end
N = doclength(No_Delimiter);
Akira Agata
Akira Agata 2017 年 10 月 28 日
I don't catch your question clearly, so let me clarify some points.
If my understanding is correct, what you want to do is:
(1) Count the total number of words in your original text file without counting the '<P ID=**>' and '</P>'
(2) Count the total number of distinct words
(3) Find the number of times each word is used in the original text
Is my understanding correct?
John
John 2017 年 10 月 28 日
編集済み: John 2017 年 10 月 28 日
You clearly understand 1 and 2, and a part of 3. For number 3, because ID=1 means Document 1, ID=2 means Document 2 etc. I want to also count the number of documents each word appear in.
Akira Agata
Akira Agata 2017 年 10 月 30 日
編集済み: Akira Agata 2017 年 10 月 30 日
Thanks for your reply. I've just made a script to do the items 1~3, as follows. I hope this will help you somehow.
Regarding your last question ("count the number of documents each word appear in"), I think you can do that by combining the following script with my previous one.
% Read your text file
fid = fopen('yourText.txt');
C = textscan(fid,'%s','TextType','string','Delimiter','\n','EndOfLine','\r\n');
C = C{1};
fclose(fid);
C = regexprep(C,'<[\w \=\/]+>',''); % Remove tags
C = regexprep(C,'[.,!?:;()]',''); % Remove punctuation and brackets
C = regexprep(C,'[0-9]+',''); % Remove numbers
C = lower(C); % Convert to lower case
% Extract every words
words = regexp(C,'[a-z\-]+','match');
words = [words{:}];
% (1) Count total number of words
numOfWords = numel(words); % --> 9
% (2) Count the total number of distinct words
numOfDistWords = numel(unique(words)); % --> 7
% (3) Find the number of times each word is used in the original text
wordList = unique(words);
wordCount = arrayfun(@(x) nnz(strcmp(x,words)), wordList);
% Show the result
figure
bar(wordCount)
xticklabels(wordList)
John
John 2017 年 11 月 7 日
Thanks. I am stuck running the counter.
for kk = 1:n
str = ['<p id=',num2str(kk),'>'];
idx_s = find(strcmp(C,str));
if ~isempty(idx_s)
idx_e = idx2(find(idx2>idx_s,1));
Doc=C(idx_s:idx_e); %May need to remove tags later
Doc = regexp(Doc,'[a-z0-9\-]+','match');
Doc = [Doc{:}];
Unique_Doc_count = arrayfun(@(x) nnz(strcmp(x,Doc)), Unique);
Unique_Doc_freq=[Unique;Unique_Doc_count];
end
end
I want to search if the elements in string array 'Unique' exist in 'Doc'. I got results in 'Unique_Doc_count' as the number of their occurrences but I need just 1 or 0 values (exist) or (not exist). The aim is to loop 'kk' over multiple documents and find the number of documents that contain each word in 'Unique'. Not even number of times the word occurs, but number of documents it appears in.

サインインしてコメントする。

その他の回答 (2 件)

Cedric
Cedric 2017 年 10 月 26 日

2 投票

Here is another approach based on pattern matching:
>> data = regexp(fileread('data.txt'), '(?<=<P[^>]+>\s*)[\w ]+', 'match' )
data =
1×3 cell array
{'A LITTLE BLACK BIRD'} {'Story about a bird'} {'Part 1'}
if you don't need the IDs (e.g. if in any case they will go from 1 to the number of P tags), you are done.
If you needed the IDs, you could get both IDs and content as follows:
>> data = regexp(fileread('data.txt'), '<P ID=(\d+)>\s*([\w ]+)', 'tokens' ) ;
data = vertcat( data{:} ) ;
ids = str2double( data(:,1) )
data = data(:,2)
ids =
1
2
3
data =
3×1 cell array
{'A LITTLE BLACK BIRD'}
{'Story about a bird' }
{'Part 1' }

6 件のコメント

John
John 2017 年 10 月 28 日
Thanks for your help. I proceeded with the task, and now I am stuck again. Please save me. I am trying to break the contents of the documents into tokens so I can, 1. Find the total number of words in the document 2. Find the total number of distinct words (repetitions are not re-counted) 3. Find the number of times each word is found in the main document, and total number of documents it's found in.
Here is where I lose it:
%Split string (Tokenizer)
fileName = 'Tokens.txt';
fid = fopen(fileName,'w');
Tk=strtok(C2);
fprintf(fid,'%s\r\n',Tk);
fclose(fid);
Here, Matlab makes token ONLY of the first word on each line, and I can't seem to proceed. I need help!!!
I tried to count the words with this, but it won't work;
for kk = 1:n
str = ['<p id=',num2str(kk),'>'];
del = find(strcmp(C2,str));
No_Delimiter= regexprep(C2,del,''); %trying to remove the document IDs so I don't count them
end
N = doclength(No_Delimiter);
Cedric
Cedric 2017 年 10 月 28 日
編集済み: Cedric 2017 年 10 月 28 日
data = regexp( lower( fileread( 'data.txt' )), '(?<=<p[^>]+>\s*)[^<]+', 'match' ) ;
data = regexp( data, '[a-z\-]+', 'match' ) ;
allWords = [data{:}] ;
[allUniqueWords, ~, ic] = unique( allWords ) ;
counts = accumarray( ic, 1 ) ;
After running this, you'll have all words in the allWords cell array (so numel(allWords) is the total number of words), a list of unique words in allUniqueWords (so numel(allUniqueWords) is the number of unique words), and a count of occurrence of unique words in counts.
MATLAB R2017b has a text analytics toolbox that may do this better, but I am not using it. Maybe Akira is and can develop on this. Now I think that your best option is learning the basics and studying well Akira's solution, which is the most natural approach using MATLAB base features. Mine relies more on pattern matching; while it is fairly concise, it will not teach you MATLAB to spend hours understanding regular expressions.
John
John 2017 年 10 月 31 日
Thanks a lot, I'm grateful.
Cedric
Cedric 2017 年 10 月 31 日
My pleasure!
John
John 2017 年 11 月 7 日
Thanks. I am stuck running a counter.
for kk = 1:n
str = ['<p id=',num2str(kk),'>'];
idx_s = find(strcmp(C,str));
if ~isempty(idx_s)
idx_e = idx2(find(idx2>idx_s,1));
Doc=C(idx_s:idx_e); %May need to remove tags later
Doc = regexp(Doc,'[a-z0-9\-]+','match');
Doc = [Doc{:}];
Unique_Doc_count = arrayfun(@(x) nnz(strcmp(x,Doc)), Unique);
Unique_Doc_freq=[Unique;Unique_Doc_count];
end
end
I want to search if the elements in string array 'Unique' exist in 'Doc'. I got results in 'Unique_Doc_count' as the number of their occurrences but I need just 1 or 0 values (exist) or (not exist). The aim is to loop 'kk' over multiple documents and find the number of documents that contain each word in 'Unique'. Not even number of times the word occurs, but number of documents it appears in.
Cedric
Cedric 2017 年 11 月 9 日
編集済み: Cedric 2017 年 11 月 9 日
If you have a count per document, finding the number of documents a keyword is in is easy:
counts = [7, 0 ,3] ;
hasKey = counts > 0 ; % [1,0,1]
nDocs = sum( hasKey ) ; % 2

サインインしてコメントする。

Christopher Creutzig
Christopher Creutzig 2017 年 11 月 2 日
編集済み: Christopher Creutzig 2017 年 11 月 2 日

0 投票

It's probably easiest to split the text and then check the number of splits created to count, using string functions:
str = extractFileText('file.txt');
paras = split(str,"</P>");
paras(end) = []; % the split left an empty last entry
paras = extractAfter(paras,">") % Drop the "<P ID=n>" from the beginning
Then, numel(paras) will give you the number of </P>.
If you do not have extractFileText, calling string(fileread('file.txt')) should work just fine, too.
In one of the comments, you indicated you also need to count the frequency of words in documents. That is what bagOfWords is for:
tdoc = tokenizedDocument(lower(paras));
bag = bagOfWords(tdoc)
bag =
bagOfWords with 13 words and 3 documents:
a little black bird .
1 1 1 1 1
1 0 0 1 0

2 件のコメント

John
John 2017 年 11 月 7 日
Thanks. I am stuck running a counter.
for kk = 1:n
str = ['<p id=',num2str(kk),'>'];
idx_s = find(strcmp(C,str));
if ~isempty(idx_s)
idx_e = idx2(find(idx2>idx_s,1));
Doc=C(idx_s:idx_e); %May need to remove tags later
Doc = regexp(Doc,'[a-z0-9\-]+','match');
Doc = [Doc{:}];
Unique_Doc_count = arrayfun(@(x) nnz(strcmp(x,Doc)), Unique);
Unique_Doc_freq=[Unique;Unique_Doc_count];
end
end
I want to search if the elements in string array 'Unique' exist in 'Doc'. I got results in 'Unique_Doc_count' as the number of their occurrences but I need just 1 or 0 values (exist) or (not exist). The aim is to loop 'kk' over multiple documents and find the number of documents that contain each word in 'Unique'. Not even number of times the word occurs, but number of documents it appears in.
shilpa patil
shilpa patil 2019 年 9 月 23 日
編集済み: shilpa patil 2019 年 9 月 23 日
how to rewrite the above code for a document image
instead of text file

サインインしてコメントする。

カテゴリ

ヘルプ センター および File ExchangeCharacters and Strings についてさらに検索

質問済み:

2017 年 10 月 24 日

編集済み:

2019 年 9 月 23 日

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by