Parsing Formatted Text File Quickly

Question

John 2012 年 9 月 5 日

0
リンク

この質問への直接リンク

https://jp.mathworks.com/matlabcentral/answers/47477-parsing-formatted-text-file-quickly

I have a text document with a known format, shown below. I have tried many ways to read this in quickly, but I think I'm not doing it right, and efficiency is becoming an issue.

It is essentially blocks of data with headers in a known format. The number of rows in the blocks of data varies but the number of columns does not. Any guidance here would be greatly appreciated. I have tried some tricks with textscan but I'm not sure I know the full power.

Example:

Time: Timestamp1

DataString1_1 DataDouble1_1 DataDouble2_1

.

DataString1_n DataDouble1_n DataDouble2_n

Time: Timestamp2

DataString1_1 DataDouble1_1 DataDouble2_1

.

DataString1_m DataDouble1_m DataDouble2_m

6 件のコメント
4 件の古いコメントを表示4 件の古いコメントを非表示

per isakson 2012 年 9 月 5 日

OK!

What about the lines, "Time: Timestamp2"?

John 2012 年 9 月 5 日

Thanks I will check that out.

サインインしてコメントする。

サインインしてこの質問に回答する。

Answer 1

per isakson 2012 年 9 月 6 日

0
リンク

この回答への直接リンク

https://jp.mathworks.com/matlabcentral/answers/47477-parsing-formatted-text-file-quickly#answer_58036

編集済み: per isakson 2012 年 9 月 7 日

MATLAB Online で開く

[Comment on textscan deleted.]

Below are three function, which read your file. You might want to modify a function so that text_file_name and number_of_data_columns are input arguments.

The functions return a structure array with one element per data block in the file.

I use a three year old vanilla Dell, with R2012a,64bit,Win7.

.

--- Reads the whole file to a string buffer and parses it in a second step ---

My test below returns a 0.9GB structure in 23 seconds.

    >> tic, S  = read_huge_CRLF_1(); toc
    Elapsed time is 22.613608 seconds.
    >> S
    S = 
    1x6060 struct array with fields:
        RowHeader
        Data
        Time
    >> S(1)
    ans = 
        RowHeader: {1042x1 cell}
             Data: [1042x3 double]
             Time: '2012-09-06'    
    >> whs = whos('S');
    >> whs.bytes/1e9
    ans =
        0.9114

where read_huge_CRLF_1

    function    S = read_huge_CRLF_1()
        str_buf = fileread( 'c:\MyData\Test\huge_CRLF.txt' );
        ix_list = strfind( str_buf, 'Timestamp:' );
        n_block = numel( ix_list );
        n_col   = 4; 
        frmt    = cat( 2, '%s', repmat( '%f', [ 1, n_col-1 ] ) );
        S   = struct( 'RowHeader'   , cell( 1, n_block )    ...
                    , 'Data'        , cell( 1, n_block )    ...
                    , 'Time'        , cell( 1, n_block )    ...
                    );
        for ii = 1 : n_block
            ix1 = ix_list(ii);
            if ii == n_block
                buf = str_buf( ix1 : end );
            else
                buf = str_buf( ix1 : ix_list(ii+1)-1 );
            end
            S(ii).Time  = sscanf( buf, 'Timestamp:%s', 1 );
            cac = textscan( buf             , frmt  ...
                        ,   'CollectOutput' , true  ...
                        ,   'HeaderLines'   , 1     ...
                        );
            S(ii).RowHeader = cac{1}; 
            S(ii).Data      = cac{2};
        end
    end

and where C:\MyData\Test\huge_CRLF.txt is 0.24GB and contains row like

    Header
    Timestamp: 2012-09-06 01:15 
    row1     3.1416    6.2832    9.4248
    row1     3.1416    6.2832    9.4248
    Timestamp: 2012-09-06 01:15 
    row1     3.1416    6.2832    9.4248
    row1     3.1416    6.2832    9.4248
    row1     3.1416    6.2832    9.4248
    Timestamp: 2012-09-06 01:15 
    row1     3.1416    6.2832    9.4248
    row1     3.1416    6.2832    9.4248
    row1     3.1416    6.2832    9.4248
    row1     3.1416    6.2832    9.4248
    row1     3.1416    6.2832    9.4248
    row1     3.1416    6.2832    9.4248

.

Comments

It possible to increase the speed somewhat.
The cost of "'Headerlines', 1" is surprisingly high. That might say more about me than than about textscan:).

.

--- Afterthought ---

The state of the file cache was not well defined when the test above was performed. Thus, I ran the test three times in a row directly after restart of the computer. I assume the text file is not in the cache. There is no real difference.

    % Restart computer
    >> tic, S  = read_huge_CRLF_1(); toc     % Free dropped to zero
    Elapsed time is 23.249289 seconds.       
    >> tic, S  = read_huge_CRLF_1(); toc     % 0.91GB S in base workspace 
    Elapsed time is 24.461189 seconds.
    >> tic, S  = read_huge_CRLF_1(); toc
    Elapsed time is 23.933955 seconds.

.

--- Reading blocks from cache is 20% faster. Doesn't read block_header ---

In this test testscan reads the blocks of data from the Windows' file cache. In the previous test testscan reads from a string buffer in the function workspace. The text file, huge_CRLF_4.txt, which is a copy of huge_CRLF.txt, was not in the cache before this test.

    >> clear('S'), tic, S  = read_huge_CRLF_3('c:\...\huge_CRLF_4.txt'); toc
    Elapsed time is 18.926222 seconds.
    >> clear('S'), tic, S  = read_huge_CRLF_3('c:\...\huge_CRLF_4.txt'); toc
    Elapsed time is 17.150977 seconds.
    >> clear('S'), tic, S  = read_huge_CRLF_3('c:\...\huge_CRLF_4.txt'); toc
    Elapsed time is 17.077009 seconds.
    >>

where read_huge_CRLF_3 is

    function    S = read_huge_CRLF_3( file_spec )
        if nargin == 0
            file_spec  = 'c:\MyData\Test\huge_CRLF_Sample.txt';
        end
        n_block_header_row  = 1;
        str_buf             = fileread( file_spec );
        ix_char_timestamp   = strfind( str_buf, 'Timestamp:' );
        ix_char_start_line  = [ 1, strfind( str_buf, char([13,10]) ) + 2 ];
        is_char_start_block = ismember( ix_char_start_line ...
                                      , ix_char_timestamp  ); 
        ii_line_start_block = ( 1 : 1 : size( ix_char_start_line, 2 ) );
        ii_line_start_block( not( is_char_start_block ) ) = [];
        n_block = numel( ii_line_start_block ); 
        n_col   = 4; 
        frmt    = cat( 2, '%s', repmat( '%f', [ 1, n_col-1 ] ) );
        S   = struct( 'BlockHeader' , cell( 1, n_block )    ...
                    , 'Data'        , cell( 1, n_block )    ...
                    );
        fid = fopen( file_spec', 'r' );
        if ii_line_start_block(1) >= 2
            cac = textscan( fid, '%[^\n\r]', ii_line_start_block(1)-1 ); 
        end
        iiBlock = 0;
        while not( feof( fid ) )
            iiBlock     = iiBlock + 1;
            if iiBlock == n_block 
                n_data_row = inf;
            else
                n_data_row  = ii_line_start_block( iiBlock+1 ) ...
                            - ii_line_start_block( iiBlock )   ...
                            - n_block_header_row               ; 
            end
            cac = textscan( fid                                     ...
                        ,   frmt            , n_data_row            ...
                        ,   'CollectOutput' , true                  ...
                        ,   'HeaderLines'   , n_block_header_row    ...
                        );
            S(iiBlock).BlockHeader  = cac{1}; 
            S(iiBlock).Data         = cac{2};
        end
        fclose( fid );
    end

Comments

The functions works as far as I can tell. However, this construct is erroneous

    fid = fopen( file_spec', 'r' );
    ...
    cac = textscan( fid, '%[^\n\r]', ii_line_start_block(1)-1 );

After it is executed the "file position indicator" is to the left of the EOL characters. See my question fgetl, textscan, and the file position indicator.

Adding "t" to the permission string, i.e.

fid = fopen( file_spec', 'rt' );

does not solve the problem in my case. EOL is CRLF and the pointer will be positioned between the CR and LF. One solutions is adding "'Delimiter', '\n'" to the argument list of textscan.

.

--- My final function to read the file ---

Further refactored. The text file, huge_CRLF_5.txt, which is a copy of huge_CRLF.txt, was not in the cache before this test.

    clear('S'), tic, S  = read_huge_CRLF_5('c:\...\huge_CRLF_5.txt'); toc
    Elapsed time is 19.284555 seconds.
    clear('S'), tic, S  = read_huge_CRLF_5('c:\...\huge_CRLF_5.txt'); toc
    Elapsed time is 17.210736 seconds.

where read_huge_CRLF_5 is

    function    S = read_huge_CRLF_5( file_spec )
        if nargin == 0
            file_spec  = 'c:\MyData\Test\huge_CRLF_Sample.txt';
        end
        n_block_header_row  = 1;
        str_buf             = fileread( file_spec );
        ix_char_timestamp   = strfind( str_buf, 'Timestamp:' );
        ix_char_start_line  = [ 1, strfind( str_buf, char([13,10]) ) + 2 ];
        is_char_start_block = ismember( ix_char_start_line ...
                                      , ix_char_timestamp  ); 
        clear('str_buf')
        ii_line_start_block = ( 1 : 1 : size( ix_char_start_line, 2 ) );
        ii_line_start_block( not( is_char_start_block ) ) = [];
        n_block = numel( ii_line_start_block ); 
        n_col   = 4; 
        frmt    = cat( 2, '%s', repmat( '%f', [ 1, n_col-1 ] ) );
        S   = struct( 'BlockHeader' , cell( 1, n_block )    ...
                    , 'Data'        , cell( 1, n_block )    ...
                    , 'Time'        , cell( 1, n_block )    ...
                    );
        fid = fopen( file_spec', 'r' );
        cup = onCleanup( @() fclose(fid) );
        if ii_line_start_block(1) >= 2
            textscan( fid, '%s', ii_line_start_block(1)-1 ...
                        ,   'Delimiter', '\n'             ); 
        end
        iiBlock = 0;
        while not( feof( fid ) )
            iiBlock     = iiBlock + 1;
            if iiBlock == n_block 
                n_data_row = inf;
            else
                n_data_row  = ii_line_start_block( iiBlock+1 ) ...
                            - ii_line_start_block( iiBlock )   ...
                            - n_block_header_row               ; 
            end
            S(iiBlock).Time  = sscanf( fgetl(fid), 'Timestamp:%s' );
            cac = textscan( fid                             ...
                        ,   frmt            , n_data_row    ...
                        ,   'CollectOutput' , true          ...
                        );
            S(iiBlock).BlockHeader  = cac{1}; 
            S(iiBlock).Data         = cac{2};
        end
    end

Comments

clear('str_buf') frees memory. It is obvious faster to read from the file cache than to use the string buffer.
"'Delimiter', ''" places the "file position indicator" right of the EOL characters, whether they are CRLF or LF and whether the the "t" is added to the "permission string" or not.
fgetl(fid) places always the "file position indicator" right of the EOL characters.