Why does textscan read only 95% of my data file?

Question

Janice Nelson 2017 年 9 月 12 日

0
リンク

この質問への直接リンク

https://jp.mathworks.com/matlabcentral/answers/356162-why-does-textscan-read-only-95-of-my-data-file

コメント済み: dpb 2017 年 9 月 13 日

採用された回答: Janice Nelson

MATLAB Online で開く

My dat file has 423,000+ ascii lines that look like:

2017-08-30 12:34:56 7.89

When I use textscan, each of the 7 cells in the returned variable only have 413315 values. If I do it line by line with fgetl, I get all 423,000+ values. Textscan takes a few seconds. fgetl takes several minutes.

DATAX=textscan(fid,'%d-%d-%d %d:%d:%d %f');

Next thing to try is to sed replace '-', ':', and ' ' with \t then try again. Unfortunately I have almost 400 files like this. An opportunity to improve my shell scripting...

Any help is greatly appreciated.

2 件のコメント
なしを表示なしを非表示

Janice Nelson 2017 年 9 月 12 日

編集済み: Janice Nelson 2017 年 9 月 12 日

MATLAB Online で開く

Here's what worked:

        filecontent = fileread(TheFileName);
        tokens = regexp(filecontent, ...
          '(?<date>[-0-9]+\s+[0-9:.]+)\s+(?<value>-?[0-9.]+e?+)*', ...
          'names');
        %dates = datetime({tokens.date});
        dates = datenum({tokens.date});
        values = str2double({tokens.value});

It gets both f & e formatted values. datenum seems slower than datetime but semilogy doesn't like datetime numbers.

Janice Nelson 2017 年 9 月 12 日

Thanks to all for your help!

サインインしてコメントする。

サインインしてこの質問に回答する。

Answer 1

Janice Nelson 2017 年 9 月 12 日

0
リンク

この回答への直接リンク

https://jp.mathworks.com/matlabcentral/answers/356162-why-does-textscan-read-only-95-of-my-data-file#answer_281229

MATLAB Online で開く

Here's what worked:

filecontent = fileread(TheFileName);
tokens = regexp(filecontent, ...
  '(?<date>[-0-9]+\s+[0-9:.]+)\s+(?<value>-?[0-9.]+e?+)*', ...
  'names');
%dates = datetime({tokens.date});
dates = datenum({tokens.date});
values = str2double({tokens.value});

It gets both f & e formatted values. I used datenum though it seems slower than datetime so semilogy can function. Plot() works with datetime number types, but semilogy doesn't.

2 件のコメント
なしを表示なしを非表示

Cedric 2017 年 9 月 12 日

編集済み: Cedric 2017 年 9 月 12 日

MATLAB Online で開く

Next time accept the answer of the people who helped you getting to some solution.

Depending how fast you need the approach to be, in my experience it is often faster to find/update discrepancies in a text buffer using a very short and efficient regular expression, and then to parse it using SSCANF, TEXTSCANF, DATENUM, etc.

If your content is really something like

 2017-08-30 12:34:56 7.89
 2017-08-30 12:34:56 7.89
 ..

why not just splitting on white spaces using STRPLIT or REGEXP with a \s+ pattern, reshaping, concatenating columns 1 and 2, and converting with DATENUM and STR2DOUBLE?

Walter Roberson 2017 年 9 月 12 日

semilogy is the same as plot() followed by set() 'yscale', 'log' on the axis

サインインしてコメントする。

Answer 2

Walter Roberson 2017 年 9 月 12 日

1
リンク

この回答への直接リンク

https://jp.mathworks.com/matlabcentral/answers/356162-why-does-textscan-read-only-95-of-my-data-file#answer_281186

MATLAB Online で開く

filecontent = fileread(TheFileName);
tokens = regexp(filecontent, '^(?<date>[-0-9]+\s+[0-9:.]+)\s+(?<value>-?[0-9.]+)', 'names');
dates = datetime({tokens.date});
values = str2double({tokens.value});

This code has been designed to permit negative numeric values, but it does assume that if there is a negative sign then the numeric values are immediately afterwards with no space. Also, this code is not designed to recognize exponential format.

5 件のコメント
3 件の古いコメントを表示3 件の古いコメントを非表示

Walter Roberson 2017 年 9 月 12 日

MATLAB Online で開く

When you use a %D for times within a day but without the date, then the result is taken relative to the date on which it was scanned. You cannot then just add that to the date portion. You have to do things like:

DatePortion + (TimePortion - shiftdate(TimePortion, 'start', 'day'))

dpb 2017 年 9 月 12 日

編集済み: dpb 2017 年 9 月 12 日

That's what strjoin is for--concatenate the date/time strings into one for datetime to parse as a whole.

I think it's a major wart in implementation of '%D' that it doesn't handle the cases directly; I'm hoping that's because it's still the new kid on the block and just isn't yet ripe but like wine will improve with aging...

Excepting owing to the cell structure it won't work as written as that will concatenate all elements in order of all dates followed by all times in one long string...it's a pain to deal with no matter what you do.

サインインしてコメントする。

Answer 3

dpb 2017 年 9 月 12 日

0
リンク

この回答への直接リンク

https://jp.mathworks.com/matlabcentral/answers/356162-why-does-textscan-read-only-95-of-my-data-file#answer_281176

There'll be a formatting discrepancy in the file at the offending line that causes textscan to fail. fgetl otoh reads the line as character string without formatting it so content is totally immaterial.

Since what you have is a date/time field, I'd suggest using the '%D' format string and return the data as datetime class instead of a string of variables. See the format field description for details on the format.

You might attach the last portion of the file with the offending line so folks can see what might be the actual issue -- only the section in the neighborhood of the place where the code fails is pertinent.

1 件のコメント
-1 件の古いコメントを表示-1 件の古いコメントを非表示

Walter Roberson 2017 年 9 月 12 日

The %D format specifier is a bit tricky because spaces cannot be present in the date, unless you have set 'whitespace' to exclude space. However if you set 'whitespace' to exclude space then it is not going to be able to recognize the spacing between the time and the value.

サインインしてコメントする。

Answer 4

Jeremy Hughes 2017 年 9 月 12 日

0
リンク

この回答への直接リンク

https://jp.mathworks.com/matlabcentral/answers/356162-why-does-textscan-read-only-95-of-my-data-file#answer_281271

編集済み: Jeremy Hughes 2017 年 9 月 13 日

MATLAB Online で開く

textscan can read time-of-day as datetime,

having the format

d = textscan(fid,'%D%D%f','Delimiter',' ','ReturnOnError',false);

might work. The you'd have to do something like this to post-process:

[date,time,n] = d{:};
date = date + timeofday(time);

textscan will simply stop reading when it encounters an error. To see what data is messing up the read, setting ReturnOnError=false will issue an error instead of just stopping. This should give an indication of the problem.

1 件のコメント
-1 件の古いコメントを表示-1 件の古いコメントを非表示

dpb 2017 年 9 月 13 日

It would be most interesting if OP would return and show us the offending record in the file to see just what broke what...

Are there plans to fix some of the observed warts with '%D' in the future with the embedded blank issue, etc., ... ?? It's a strong step forward but there are still "issues" that it doesn't handle well as well as the rest of the datetime class.

サインインしてコメントする。

Why does textscan read only 95% of my data file?

2 件のコメント
なしを表示なしを非表示

採用された回答

2 件のコメント
なしを表示なしを非表示

その他の回答 (3 件)

5 件のコメント
3 件の古いコメントを表示3 件の古いコメントを非表示

1 件のコメント
-1 件の古いコメントを表示-1 件の古いコメントを非表示

1 件のコメント
-1 件の古いコメントを表示-1 件の古いコメントを非表示

参考

カテゴリ

タグ

Community Treasure Hunt

Why does textscan read only 95% of my data file?

2 件のコメント なしを表示なしを非表示

採用された回答

2 件のコメント なしを表示なしを非表示

その他の回答 (3 件)

5 件のコメント 3 件の古いコメントを表示3 件の古いコメントを非表示

1 件のコメント -1 件の古いコメントを表示-1 件の古いコメントを非表示

1 件のコメント -1 件の古いコメントを表示-1 件の古いコメントを非表示

参考

カテゴリ

タグ

Community Treasure Hunt

2 件のコメント
なしを表示なしを非表示

2 件のコメント
なしを表示なしを非表示

5 件のコメント
3 件の古いコメントを表示3 件の古いコメントを非表示

1 件のコメント
-1 件の古いコメントを表示-1 件の古いコメントを非表示

1 件のコメント
-1 件の古いコメントを表示-1 件の古いコメントを非表示