How do I extract a single column from a LARGE text file? Preferably fast.

Question

Erik Lorentzen 2013 年 6 月 27 日

0
リンク

この質問への直接リンク

https://jp.mathworks.com/matlabcentral/answers/80444-how-do-i-extract-a-single-column-from-a-large-text-file-preferably-fast

Hello!

I have a (tab-delimited) textfile with genetic data. The file is about 8 Gb in total and has 963 rows and about 1.5 million columns. (I have another 17Gb file that I have to tackle later...)

The format is: format = '%s %s repmat('%f ', 1, 1.5*10^6)'.

Now, I have to extract all 'float'-columns, transpose them and concatenate them with some (3) additional (string)values, and write them to a file as rows.

So, in effect, my problem is how to transpose a large text-file dataset.

Obviously I dont want to import the whole file into a cell-array, so I have been trying to do it column by column (or row by row in the outfile).

More specifically, I have been trying (for the first float-column):

col = textscan(fid, '%*s %*s %f %*[^\n]', 'bufsize', large_value);

and then repeat this in a for-loop, with different FORMAT for each pass.

This works, of course, but textscan still has to read the whole file every round in the loop (i think, hence the required BUFSIZE). So it takes a VERY long time.

One textscan takes approximately 100 seconds, so 100s * 1.5 million = 4.5 years.

PLEASE HELP! Is there any way to make this fast?

Do I HAVE to load full file into a cell-array?

Or is there maybe some way to WRITE COLUMNS to a textfile(?). That would do the trick, I think.

Maybe someone have a cool Pearl-script that take less time to extract the column?

Cheers! / Erik

1 件のコメント
-1 件の古いコメントを表示-1 件の古いコメントを非表示

Matt J 2013 年 6 月 27 日

編集済み: Matt J 2013 年 6 月 27 日

so I have been trying to do it column by column ... One textscan takes approximately 100 seconds

Whichever file reading method you use, it's probably not a good idea to try to read one column at a time. You should read large chunks of columns of the largest manageable size. Transpose each chunk in the MATLAB workspace and then write/append it to your destination file.

サインインしてコメントする。

サインインしてこの質問に回答する。

Answer 1

Matt J 2013 年 6 月 27 日

0
リンク

この回答への直接リンク

https://jp.mathworks.com/matlabcentral/answers/80444-how-do-i-extract-a-single-column-from-a-large-text-file-preferably-fast#answer_90131

MATLAB Online で開く

Using DLMREAD, you can read blocks of data of any shape you want

M = dlmread(filename, delimiter, R, C)
M = dlmread(filename, delimiter, range)

2 件のコメント
なしを表示なしを非表示

Erik Lorentzen 2013 年 6 月 27 日

I thought DLMREAD only worked on numeric data?

Matt J 2013 年 6 月 27 日

編集済み: Matt J 2013 年 6 月 27 日

MATLAB Online で開く

But a sub-block of your text file does consist of numeric data, right? It's the floats that you want to read and transpose. As long as you use the 'range' argument to designate only a region of numeric data in the file, it should be fine.

I tested it on a text file called 'test.m'containing this

kkkkk 1 2
llllll 3 4
mmmm 5 6
nnnn 7 8

and it worked fine.

    >> M=dlmread('test.m',' ',[0 1 3 1])
    M =
         1
         3
         5
         7

Of course, you should really be doing this to read batches of columns, instead of single columns as mentioned in my Comment above.

サインインしてコメントする。

How do I extract a single column from a LARGE text file? Preferably fast.

1 件のコメント
-1 件の古いコメントを表示-1 件の古いコメントを非表示

回答 (1 件)

2 件のコメント
なしを表示なしを非表示

参考

カテゴリ

タグ

製品

Community Treasure Hunt

How do I extract a single column from a LARGE text file? Preferably fast.

1 件のコメント -1 件の古いコメントを表示-1 件の古いコメントを非表示

回答 (1 件)

2 件のコメント なしを表示なしを非表示

参考

カテゴリ

タグ

製品

Community Treasure Hunt

1 件のコメント
-1 件の古いコメントを表示-1 件の古いコメントを非表示

2 件のコメント
なしを表示なしを非表示