searching a string for a word

Question

Lauren Harkness 2017 年 10 月 13 日

0
リンク

この質問への直接リンク

https://jp.mathworks.com/matlabcentral/answers/361096-searching-a-string-for-a-word

編集済み: Cedric 2017 年 10 月 13 日

So I have a text file, and i am looking for the frequency of appearance of those words in the text file. I have used strfind, but the problem is if one of the words I am searching for is small say "and" then it can be found in other words like "band", but I only want it to appear when it is standing alone. I tried searching for when the word only had a space before and after it (so when it stands alone) but this ignores if the word is first or last on a line in the text file. code is attached.

A = fileread(txt)
fh = fopen(txt,'r')
B = strfind(A, firstword);
line = fgetl(fh)
C = strfind(A,secondword);
vec = [length(B),length(C)];

1 件のコメント
-1 件の古いコメントを表示-1 件の古いコメントを非表示

Cedric 2017 年 10 月 13 日

編集済み: Cedric 2017 年 10 月 13 日

MATLAB Online で開く

Part of the code is useless. The following

A = fileread(txt)

already opens the file, reads it as text, and closes it. After it is executed, A contains the full content of the file. So then there is no need to open the file again and read one line (and forget to close it).

As explained by Per below, STRFIND matches strict occurrences of the text that you are looking for. You could observe that it is difficult to use it for matching patterns (situations a little more flexible than the simple occurrence of letters). Looking for white spaces before and after was a good first attempt, but there are cases where it fails .. and there is the upper/lower case issue.

All these considerations are a good signal that you need an approach a little more elaborate based on pattern matching, using regular expressions. This is what Per develops. Note that he uses REGEXPI and not REGEXP, to provide a case-insensitive solution.

Your code should look a bit like the following:

 textContent = fileread( textFile ) ;
 countWord1  = length( regexpi( textContent, ... )) ;
 countWord2  = length( regexpi( textContent, ... )) ;
 counts      = [countWord1, countWord2] ;

where ... are appropriate arguments (at least the pattern). Even better:

 wordsToFind = {'and', 'here', 'not'} ;
 textFile    = 'MyFile.txt' ;
 counts      = zeros( size( wordsToFind )) ;
 textContent = fileread( textFile ) ;
 for wordId = 1 : numel( wordsToFind )
    pattern = sprintf( '\\<%s\\>', wordsToFind{wordId} ) ;
    counts(wordId) = length( regexpi( textContent, pattern )) ;
 end

where we loop over a series of words defined in a cell array, and we build the pattern proposed by Per dynamically.

サインインしてコメントする。

サインインしてこの質問に回答する。

Answer 1

per isakson 2017 年 10 月 13 日

1
リンク

この回答への直接リンク

https://jp.mathworks.com/matlabcentral/answers/361096-searching-a-string-for-a-word#answer_285576

編集済み: per isakson 2017 年 10 月 13 日

MATLAB Online で開く

Try

>> regexpi( 'And, and other words and_ 2and and', '(^|\W)and(\W|$)', 'start' )
ans =
     1     5    31

The search term includes the character before the word, and. Thus the value returned will often point at a space.

Better

>> regexpi( 'And, and other words and_ 2and and', '\<and\>', 'start' )
ans =
     1     6    32

Why read line by line and not the entire text in one go

str = fileread( filespec );
pos = regexpi( str, '\<and\>', 'start' );

Doc says:

\W Any character that is not alphabetic, numeric, or underscore. For English character sets, \W is equivalent to [^a-zA-Z_0-9]
\<expr Beginning of a word.