Parsing Formatted Text File Quickly

4 ビュー (過去 30 日間)
John
John 2012 年 9 月 5 日
I have a text document with a known format, shown below. I have tried many ways to read this in quickly, but I think I'm not doing it right, and efficiency is becoming an issue.
It is essentially blocks of data with headers in a known format. The number of rows in the blocks of data varies but the number of columns does not. Any guidance here would be greatly appreciated. I have tried some tricks with textscan but I'm not sure I know the full power.
Example:
Time: Timestamp1
DataString1_1 DataDouble1_1 DataDouble2_1
.
.
.
DataString1_n DataDouble1_n DataDouble2_n
Time: Timestamp2
DataString1_1 DataDouble1_1 DataDouble2_1
.
.
.
DataString1_m DataDouble1_m DataDouble2_m
  6 件のコメント
per isakson
per isakson 2012 年 9 月 5 日
OK!
What about the lines, "Time: Timestamp2"?
John
John 2012 年 9 月 5 日
Thanks I will check that out.

サインインしてコメントする。

採用された回答

per isakson
per isakson 2012 年 9 月 6 日
編集済み: per isakson 2012 年 9 月 7 日
[Comment on textscan deleted.]
Below are three function, which read your file. You might want to modify a function so that text_file_name and number_of_data_columns are input arguments.
The functions return a structure array with one element per data block in the file.
I use a three year old vanilla Dell, with R2012a,64bit,Win7.
.
--- Reads the whole file to a string buffer and parses it in a second step ---
My test below returns a 0.9GB structure in 23 seconds.
>> tic, S = read_huge_CRLF_1(); toc
Elapsed time is 22.613608 seconds.
>> S
S =
1x6060 struct array with fields:
RowHeader
Data
Time
>> S(1)
ans =
RowHeader: {1042x1 cell}
Data: [1042x3 double]
Time: '2012-09-06'
>> whs = whos('S');
>> whs.bytes/1e9
ans =
0.9114
where read_huge_CRLF_1
function S = read_huge_CRLF_1()
str_buf = fileread( 'c:\MyData\Test\huge_CRLF.txt' );
ix_list = strfind( str_buf, 'Timestamp:' );
n_block = numel( ix_list );
n_col = 4;
frmt = cat( 2, '%s', repmat( '%f', [ 1, n_col-1 ] ) );
S = struct( 'RowHeader' , cell( 1, n_block ) ...
, 'Data' , cell( 1, n_block ) ...
, 'Time' , cell( 1, n_block ) ...
);
for ii = 1 : n_block
ix1 = ix_list(ii);
if ii == n_block
buf = str_buf( ix1 : end );
else
buf = str_buf( ix1 : ix_list(ii+1)-1 );
end
S(ii).Time = sscanf( buf, 'Timestamp:%s', 1 );
cac = textscan( buf , frmt ...
, 'CollectOutput' , true ...
, 'HeaderLines' , 1 ...
);
S(ii).RowHeader = cac{1};
S(ii).Data = cac{2};
end
end
and where C:\MyData\Test\huge_CRLF.txt is 0.24GB and contains row like
Header
Timestamp: 2012-09-06 01:15
row1 3.1416 6.2832 9.4248
row1 3.1416 6.2832 9.4248
Timestamp: 2012-09-06 01:15
row1 3.1416 6.2832 9.4248
row1 3.1416 6.2832 9.4248
row1 3.1416 6.2832 9.4248
Timestamp: 2012-09-06 01:15
row1 3.1416 6.2832 9.4248
row1 3.1416 6.2832 9.4248
row1 3.1416 6.2832 9.4248
row1 3.1416 6.2832 9.4248
row1 3.1416 6.2832 9.4248
row1 3.1416 6.2832 9.4248
.
Comments
  1. It possible to increase the speed somewhat.
  2. The cost of "'Headerlines', 1" is surprisingly high. That might say more about me than than about textscan:).
.
--- Afterthought ---
The state of the file cache was not well defined when the test above was performed. Thus, I ran the test three times in a row directly after restart of the computer. I assume the text file is not in the cache. There is no real difference.
% Restart computer
>> tic, S = read_huge_CRLF_1(); toc % Free dropped to zero
Elapsed time is 23.249289 seconds.
>> tic, S = read_huge_CRLF_1(); toc % 0.91GB S in base workspace
Elapsed time is 24.461189 seconds.
>> tic, S = read_huge_CRLF_1(); toc
Elapsed time is 23.933955 seconds.
.
--- Reading blocks from cache is 20% faster. Doesn't read block_header ---
In this test testscan reads the blocks of data from the Windows' file cache. In the previous test testscan reads from a string buffer in the function workspace. The text file, huge_CRLF_4.txt, which is a copy of huge_CRLF.txt, was not in the cache before this test.
>> clear('S'), tic, S = read_huge_CRLF_3('c:\...\huge_CRLF_4.txt'); toc
Elapsed time is 18.926222 seconds.
>> clear('S'), tic, S = read_huge_CRLF_3('c:\...\huge_CRLF_4.txt'); toc
Elapsed time is 17.150977 seconds.
>> clear('S'), tic, S = read_huge_CRLF_3('c:\...\huge_CRLF_4.txt'); toc
Elapsed time is 17.077009 seconds.
>>
where read_huge_CRLF_3 is
function S = read_huge_CRLF_3( file_spec )
if nargin == 0
file_spec = 'c:\MyData\Test\huge_CRLF_Sample.txt';
end
n_block_header_row = 1;
str_buf = fileread( file_spec );
ix_char_timestamp = strfind( str_buf, 'Timestamp:' );
ix_char_start_line = [ 1, strfind( str_buf, char([13,10]) ) + 2 ];
is_char_start_block = ismember( ix_char_start_line ...
, ix_char_timestamp );
ii_line_start_block = ( 1 : 1 : size( ix_char_start_line, 2 ) );
ii_line_start_block( not( is_char_start_block ) ) = [];
n_block = numel( ii_line_start_block );
n_col = 4;
frmt = cat( 2, '%s', repmat( '%f', [ 1, n_col-1 ] ) );
S = struct( 'BlockHeader' , cell( 1, n_block ) ...
, 'Data' , cell( 1, n_block ) ...
);
fid = fopen( file_spec', 'r' );
if ii_line_start_block(1) >= 2
cac = textscan( fid, '%[^\n\r]', ii_line_start_block(1)-1 );
end
iiBlock = 0;
while not( feof( fid ) )
iiBlock = iiBlock + 1;
if iiBlock == n_block
n_data_row = inf;
else
n_data_row = ii_line_start_block( iiBlock+1 ) ...
- ii_line_start_block( iiBlock ) ...
- n_block_header_row ;
end
cac = textscan( fid ...
, frmt , n_data_row ...
, 'CollectOutput' , true ...
, 'HeaderLines' , n_block_header_row ...
);
S(iiBlock).BlockHeader = cac{1};
S(iiBlock).Data = cac{2};
end
fclose( fid );
end
Comments
The functions works as far as I can tell. However, this construct is erroneous
fid = fopen( file_spec', 'r' );
...
cac = textscan( fid, '%[^\n\r]', ii_line_start_block(1)-1 );
After it is executed the "file position indicator" is to the left of the EOL characters. See my question fgetl, textscan, and the file position indicator.
Adding "t" to the permission string, i.e.
fid = fopen( file_spec', 'rt' );
does not solve the problem in my case. EOL is CRLF and the pointer will be positioned between the CR and LF. One solutions is adding "'Delimiter', '\n'" to the argument list of textscan.
.
--- My final function to read the file ---
Further refactored. The text file, huge_CRLF_5.txt, which is a copy of huge_CRLF.txt, was not in the cache before this test.
clear('S'), tic, S = read_huge_CRLF_5('c:\...\huge_CRLF_5.txt'); toc
Elapsed time is 19.284555 seconds.
clear('S'), tic, S = read_huge_CRLF_5('c:\...\huge_CRLF_5.txt'); toc
Elapsed time is 17.210736 seconds.
where read_huge_CRLF_5 is
function S = read_huge_CRLF_5( file_spec )
if nargin == 0
file_spec = 'c:\MyData\Test\huge_CRLF_Sample.txt';
end
n_block_header_row = 1;
str_buf = fileread( file_spec );
ix_char_timestamp = strfind( str_buf, 'Timestamp:' );
ix_char_start_line = [ 1, strfind( str_buf, char([13,10]) ) + 2 ];
is_char_start_block = ismember( ix_char_start_line ...
, ix_char_timestamp );
clear('str_buf')
ii_line_start_block = ( 1 : 1 : size( ix_char_start_line, 2 ) );
ii_line_start_block( not( is_char_start_block ) ) = [];
n_block = numel( ii_line_start_block );
n_col = 4;
frmt = cat( 2, '%s', repmat( '%f', [ 1, n_col-1 ] ) );
S = struct( 'BlockHeader' , cell( 1, n_block ) ...
, 'Data' , cell( 1, n_block ) ...
, 'Time' , cell( 1, n_block ) ...
);
fid = fopen( file_spec', 'r' );
cup = onCleanup( @() fclose(fid) );
if ii_line_start_block(1) >= 2
textscan( fid, '%s', ii_line_start_block(1)-1 ...
, 'Delimiter', '\n' );
end
iiBlock = 0;
while not( feof( fid ) )
iiBlock = iiBlock + 1;
if iiBlock == n_block
n_data_row = inf;
else
n_data_row = ii_line_start_block( iiBlock+1 ) ...
- ii_line_start_block( iiBlock ) ...
- n_block_header_row ;
end
S(iiBlock).Time = sscanf( fgetl(fid), 'Timestamp:%s' );
cac = textscan( fid ...
, frmt , n_data_row ...
, 'CollectOutput' , true ...
);
S(iiBlock).BlockHeader = cac{1};
S(iiBlock).Data = cac{2};
end
end
Comments
  1. clear('str_buf') frees memory. It is obvious faster to read from the file cache than to use the string buffer.
  2. "'Delimiter', ''" places the "file position indicator" right of the EOL characters, whether they are CRLF or LF and whether the the "t" is added to the "permission string" or not.
  3. fgetl(fid) places always the "file position indicator" right of the EOL characters.
  1 件のコメント
per isakson
per isakson 2012 年 9 月 7 日
Bump. Added a third function

サインインしてコメントする。

その他の回答 (0 件)

カテゴリ

Help Center および File ExchangeString Parsing についてさらに検索

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by