Read csv strings, keep or create surrounding whitespace

7 ビュー (過去 30 日間)
Ben
Ben 2014 年 6 月 20 日
編集済み: Cedric 2014 年 6 月 23 日
I have a list of stop words that currently exists as a comma-separated list in a .txt file. The goal is to use that list to remove those words from some target text, but only when a given word (e.g. "and") appears by itself - remove "and", but don't make "sand" into "s". To that end, I tried manually putting spaces around all the words in the list, so "a,able,about" became " a , able , about ". However, the txtscan function stripped the spaces out. Is there a way to prevent it from doing that? Alternatively, if I use the original form of the list, can I tell txtscan to surround each string with spaces?
  1 件のコメント
Cedric
Cedric 2014 年 6 月 20 日
編集済み: Cedric 2014 年 6 月 20 日
Could you give an example, like a sample file, and indicate precisely what you want to achieve? This seems to be a task for REGEXPREP.

サインインしてコメントする。

採用された回答

Cedric
Cedric 2014 年 6 月 20 日
編集済み: Cedric 2014 年 6 月 20 日
Here is an example that I can refine if you provide more information. It writes some keywords in upper case..
key = {'lobster', 'and'} ;
str = 'Lobster anatomy includes the cephalothorax which fuses the head and the thorax, both of which are covered by a chitinous carapace, and the abdomen. The lobster''s head bears antennae, antennules, mandibles, the first and second maxillae, and the first, second, and third maxillipeds. Because lobsters live in a murky environment at the bottom of the ocean, they mostly use their antennae as sensors.' ;
for kId = 1 : length( key )
pat = sprintf( '(?<=\\W?)%s(?=(s |\\W))', key{kId} ) ;
str = regexprep( str, pat, upper( key{kId} ), 'ignorecase' ) ;
end
Running this, you get
>> str
str =
LOBSTER anatomy includes the cephalothorax which fuses the head AND the thorax, both of which are covered by a chitinous carapace, AND the abdomen. The LOBSTER's head bears antennae, antennules, mandibles, the first AND second maxillae, AND the first, second, AND third maxillipeds. Because LOBSTERs live in a murky environment at the bottom of the ocean, they mostly use their antennae as sensors.
The REXEXP-based approach makes it possible to code for..
  • only if framed by non alphanumeric characters (e.g. ,),
  • unless following character is an 's',
  • unless at the beginning of the string.
  21 件のコメント
Ben
Ben 2014 年 6 月 23 日
Ah, I hadn't realized that regexp functions don't do their work all at once, as stringrep does. That should do it. Thank you so much!
Cedric
Cedric 2014 年 6 月 23 日
編集済み: Cedric 2014 年 6 月 23 日
You're welcome! Note that it could do its job all at once if you were passing a pattern which contains all keywords in an OR operation. Yet, it's often more efficient to apply several times a simple pattern than passing once an extra-long/complex one. That could/should be profiled for your specific case though if you wanted to optimize.

サインインしてコメントする。

その他の回答 (0 件)

カテゴリ

Help Center および File ExchangeCharacters and Strings についてさらに検索

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by