Fastest way to replace multipe substrings with a single new string?

5 ビュー (過去 30 日間)
Omar Salah
Omar Salah 2020 年 6 月 6 日
コメント済み: Omar Salah 2020 年 6 月 18 日
Hello Everyone,
I'm trying to replace 7k different substrings with the same Tag in a 50 milllion words dataset (cell array of size 1 million of strings of average size 50 words). and as you can see, using replace or regexprep takes a long time. I tried using strrep the same way as replace but it gives me this error.
Error using strrep
All nonscalar inputs must be the same size.
I want to ask, what is the fastest and less memory consuming way to do it?
Here is the code:
%using replace
Tag='IMPORTANT'
substr={'very','much'} % a cell array of +7k words
reptag=cell(1,size(substr,2));
tagcell=cellfun(@(x) Tag,reptag,'Uniformoutput',false);
maintext=replace(maintext,substr,tagcell);
% using regexprep
ev='(';
for evi=1:size(substr,2)
ev=[ev substr '|'];
end
ev=[ev(1:end-1) ')'];
maintext=regexprep(maintext,ev,Tag);
  4 件のコメント
Omar Salah
Omar Salah 2020 年 6 月 10 日
@james I can actually work with both. Either a cella rray of character vectors or a cell of strings. I move between them easily. Is one type faster than the other?
Omar Salah
Omar Salah 2020 年 6 月 10 日
@stephen I never worked with C++ but I'm wondering, why would they be faster? Is it because they are compiled or because C++ functions are generally faster?

サインインしてコメントする。

回答 (1 件)

Mohammad Sami
Mohammad Sami 2020 年 6 月 11 日
After some experimentations I think that if you tokenize your sentences, you can use a hashmap to lookup the words to replace.
An example code is as follows. If you want case insensitive matching, use function lower on both the words and sentences.
substr = cellstr(substr);
w = containers.Map(substr,substr); %create a hashmap of substring you want to replace
m2 = cellstr(sentences);
m5 = cell(length(m2),1);
for i = 1:length(m2)
m3 = split(m2{i},' '); % tokenize the sentence
m4 = w.isKey(m3); % lookup which words to replace
m3(m4) = {'IMPORTANT'}; % replace the words
m5(i) = join(m3,' '); % store the updated sentence
end
  1 件のコメント
Omar Salah
Omar Salah 2020 年 6 月 18 日
Wow! thanks. that's definitely something to try. I will try it tonight ang get back to you :)

サインインしてコメントする。

カテゴリ

Help Center および File ExchangeVariables についてさらに検索

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by