How to find strings in a very large array of data?

4 ビュー (過去 30 日間)

Steven 2019 年 11 月 20 日

0
リンク

この質問への直接リンク

https://jp.mathworks.com/matlabcentral/answers/492192-how-to-find-strings-in-a-very-large-array-of-data

編集済み: per isakson 2019 年 11 月 23 日

I have a csv file containing a large number of numbers and a few random strings like 'zgdf'. I need to find them and set them to zero. I cannot use 'csvread' (due to strings), so I use 'textscan' to read the file.

I then turn the data to digits using str2double. MATLAB then turns the string values to NaN which is fine for me, but it takes a long time, specially because this has to be done for many similar files.

Any faster method to sort this out?

This is how I read the data (original file has two columns and large number or rows):

fileID = fopen(filename);
C = textscan(fileID,'%s %s','Delimiter',',');
fclose(fileID); 
for i = 1: length (C{1})
    D(i) = str2double(C{1}{i});
end

Thanks

10 件のコメント
8 件の古いコメントを表示8 件の古いコメントを非表示

Adam Danz 2019 年 11 月 21 日

編集済み: Adam Danz 2019 年 11 月 21 日

Knowing your matlab relase is usually helpful which is why it's included as an optional field when you're forming a question in this forum.

I've confirmed that the loop method of str2double() is indeed faster than the direct application to the cell array. Sometimes loops are faster.

See method 3 in my answer which applies your sscanf idea and avoids the error you described.

See method 4 for a FEX function that is like str2double() but much faster.

Method 5 is very fast but requires r2019a.

Lastly, whenever you build a variable within a loop, always pre-allocate the variable. Not pre-allocating the variable will definitely slow down your code.

Ridwan Alam 2019 年 11 月 21 日

編集済み: Ridwan Alam 2019 年 11 月 21 日

@Steven

I have updated my answer with the syntax for textscan with "TreatAsEmpty" option. It returns NaN in place of those known noisy chars. Using the ["EmptyValue",0] option will return 0 instead of NaN.

Not sure how much speed up will that do though :(

サインインしてコメントする。

サインインしてこの質問に回答する。

採用された回答

Adam Danz 2019 年 11 月 20 日

2
リンク

この回答への直接リンク

https://jp.mathworks.com/matlabcentral/answers/492192-how-to-find-strings-in-a-very-large-array-of-data#answer_402480

編集済み: Adam Danz 2019 年 11 月 21 日

MATLAB Online で開く

[This answer has been reorganized following the discussion in the comment section under the question]

Method 1

fid = fopen('myCSVfile.csv');             
C = textscan(fid,'%s %s','Delimiter',',');
fclose(fid);                              
A = str2double(C{1});  % Faster than doing the same thing in a loop.           

[update] the loop method below is actually faster

A = zeros(size(C{1})); % <--- always pre-allocate! 
for i = 1:numel(C{1})
    A = str2double(C{1}{i});
end

Method 2

Try this modification of the script produced by ImportData tool. Rather than importing your data and then converting it using str2double(), this imports the data as numeric and replaces non-numeric elements with NaN. I think it should be faster than your approach but I doubt it is much faster (or maybe it's not faster at all).

The only 2 variables you'll need to change to adapt to your data are

file (the filename, or, preferably, the full path to your file)
The NumerVariables value (number of columns of data)

%% Setup the Import Options and import the data
file = "C:\Users\name\Documents\MATLAB\myCSVfile.csv";   % Full path to your file (or just file name)
opts = delimitedTextImportOptions("NumVariables", 2);    % Number of columns of data
opts.VariableTypes(:) = {'double'};                      % read in all data as double (nan for strings)
opts.Delimiter = ",";
opts.ExtraColumnsRule = "ignore";
opts.EmptyLineRule = "read";       
Data = readtable(file, opts);                            % Read in as table
Data = Data{:,:};                                        % Convert to matrix

Method 3

D = zeros(size(C{1}));     % <--- pre-allocate!
for j = 1: length (C{1})
    s = sscanf(C{1}{j},'%f');
    if ~isempty(s)
        D(j) = s;
    end
end

This is 4.5x faster than method 1.

Method 4

This FEX function is designed to overcome the slow speed of str2double()

https://www.mathworks.com/matlabcentral/fileexchange/28893-fast-string-to-double-conversion

Method 5

A very fast solution is to read the data in using readmatrx() which automatically converts non-numeric elements to NaN but it requires r2019a.

file = 'myCSVfile.csv'; 
D = readmatrix(file);   %that's it, just 2 lines

3 件のコメント
1 件の古いコメントを表示1 件の古いコメントを非表示

Steven 2019 年 11 月 21 日

編集済み: Steven 2019 年 11 月 21 日

Thanks Adam,

I tried on 2018b and Method 2 was much faster! Thanks.

On my PC, this is how long each took for a given file:

Method 1: 5.8 s

Method 2: 0.6 s

Method 3: 3.1 s

I couldn't check method 5 though.

Great experience!

Thanks guys

Adam Danz 2019 年 11 月 21 日

Thanks for the feedback!

サインインしてコメントする。

その他の回答 (2 件)

Ridwan Alam 2019 年 11 月 20 日

1
リンク

この回答への直接リンク

https://jp.mathworks.com/matlabcentral/answers/492192-how-to-find-strings-in-a-very-large-array-of-data#answer_402477

編集済み: Ridwan Alam 2019 年 11 月 21 日

MATLAB Online で開く

Given, the list of noise is {'a', 'b', 'ee'}:

C = cell2mat(textscan(fileID,'%f %f','Delimiter',',','TreatAsEmpty',{'a','b','ee'},'EmptyValue',0));

Try this!!

%% Old Answer

Updated using Method 1 from Adam:

C = textscan(fileID,'%s %s','Delimiter',',');
C = [str2double(C{1}) str2double(C{2})];
C(isnan(C)) = 0;

9 件のコメント
7 件の古いコメントを表示7 件の古いコメントを非表示

Steven 2019 年 11 月 21 日

Thank you Ridwan.

Ridwan Alam 2019 年 11 月 21 日

Sure, Steven. Please vote up if you liked the conversation. Thanks!

サインインしてコメントする。

per isakson 2019 年 11 月 21 日

0
リンク

この回答への直接リンク

https://jp.mathworks.com/matlabcentral/answers/492192-how-to-find-strings-in-a-very-large-array-of-data#answer_402527

編集済み: per isakson 2019 年 11 月 23 日

MATLAB Online で開く

"random strings like 'zgdf'" If that means letters of the US alphabet, this code is rather fast.

%%
chr = fileread('cssm.txt');
chr = regexprep( chr, '[A-Za-z]+', '0.0' );
cac = textscan( chr, '%f%f', 'Delimiter',',', 'CollectOutput',true );
num = cac{1};

result

>> num(1:10,:)
ans =
      0.81472      0.15761
            0      0.97059
      0.12699      0.95717
      0.91338      0.48538
      0.63236      0.80028
      0.09754      0.14189
       0.2785            0
      0.54688      0.91574
            0      0.79221
      0.96489      0.95949

Where cssm.txt contains

81472, 0.15761
abc    , 0.97059
12699, 0.95717
91338, 0.48538
63236, 0.80028
09754, 0.14189
27850, def
54688, 0.91574
zgdf   , 0.79221
96489, 0.95949
et cetera

In response to comments

See the caveat in the first line of my answer.

I fail to find a regular expression for "not a legal number" and if one exists it might not be that fast.

It's straight forward to add a few (many becomes impractical) characters, e.g. '^â', and make sure that the string is followed by comma or end of line.

>> chr = regexprep( '12.3, abc, g^â, 1.0e5, def ', '(?m)[A-Za-zâ^]+(?=\x20*\r?(,|$))', '0.0' )
chr =
    '12.3, 0.0, 0.0, 1.0e5, 0.0 '
>>

Look ahead, e.g. '(?=\x20*\r?(,|$))', is reasonable fast, but look behind sometimes ruins the performance.

The above regex fails for 'def1', '1deg' and '10a'

fileread in combination with CRLF as newline character poses a problem when using regular expressions. The anchor $ doesn't recognise CRLF as newline. (Please tell me if I missed something.) The best way to avoid this problem is to replace fileread by a function that uses

[fid, msg] = fopen( filespec, 'rt' );
chr = fread( fid, inf, '*char' ); 

5 件のコメント
3 件の古いコメントを表示3 件の古いコメントを非表示

Steven 2019 年 11 月 21 日

編集済み: Steven 2019 年 11 月 21 日

Thanks Per.

Sometimes, characters include something like "g^â".

per isakson 2019 年 11 月 22 日

編集済み: per isakson 2019 年 11 月 22 日

I added a response to my answer.

サインインしてコメントする。

サインインしてこの質問に回答する。

カテゴリ

MATLAB Language Fundamentals Data Types Characters and Strings

Help Center および File Exchange で Characters and Strings についてさらに検索

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by

How to find strings in a very large array of data?

10 件のコメント
8 件の古いコメントを表示8 件の古いコメントを非表示

採用された回答

3 件のコメント
1 件の古いコメントを表示1 件の古いコメントを非表示

その他の回答 (2 件)

9 件のコメント
7 件の古いコメントを表示7 件の古いコメントを非表示

5 件のコメント
3 件の古いコメントを表示3 件の古いコメントを非表示

参考

カテゴリ

タグ

Community Treasure Hunt

How to find strings in a very large array of data?

10 件のコメント 8 件の古いコメントを表示8 件の古いコメントを非表示

採用された回答

3 件のコメント 1 件の古いコメントを表示1 件の古いコメントを非表示

その他の回答 (2 件)

9 件のコメント 7 件の古いコメントを表示7 件の古いコメントを非表示

5 件のコメント 3 件の古いコメントを表示3 件の古いコメントを非表示

参考

カテゴリ

タグ

Community Treasure Hunt

10 件のコメント
8 件の古いコメントを表示8 件の古いコメントを非表示

3 件のコメント
1 件の古いコメントを表示1 件の古いコメントを非表示

9 件のコメント
7 件の古いコメントを表示7 件の古いコメントを非表示

5 件のコメント
3 件の古いコメントを表示3 件の古いコメントを非表示