フィルターのクリア

Large data file with mixed character strings and numerical formats

1 回表示 (過去 30 日間)
Orion
Orion 2017 年 2 月 2 日
編集済み: Stephen23 2017 年 2 月 2 日
I have a large data text file with 2,000,000 rows and 100 columns. Some columns have numerical values and some are character strings with variable length. I don't need all the data at once but I need to be able to import different columns (character columns and numerical columns) for my analysis. How should I do that?
The issue is with the size of the file rather than the mixed formats. MATLAB datastore function only reads 20,000 rows at a time and I don't know if converting the data into a SQL datatable would help.
Thanks in advance

採用された回答

Stephen23
Stephen23 2017 年 2 月 2 日
編集済み: Stephen23 2017 年 2 月 2 日
Do not use fgetl or fgets: on such a large file as this would be very slow. Use textscan, exactly as the MATLAB documentation recommends:
The third textscan input lets you specify a block size, which sets a limit to how many lines to read. So to read your required data into MATLAB without reading all of the data at once, you need to do the following in a loop (pseudocode, see link above):
out = {};
k = 0 ;
while ~feof(fileID)
k = k+1;
C = textscan(fileID,formatSpec,N);
out{end+1} = the columns you need
end
This reads each block, you extract and store the columns that you need, and then the rest of the data is discarded. In this way, all of the data is read into MATLAB, just not simultaneously! This is really quite fast :)
  2 件のコメント
KSSV
KSSV 2017 年 2 月 2 日
Okay...does this textscan go line by line?
Stephen23
Stephen23 2017 年 2 月 2 日
編集済み: Stephen23 2017 年 2 月 2 日
@KSSV: textscan could read just one line, but why is that relevant to the question or even to my answer? Reading one line at-a-time would be incredibly inefficient compared to textscan reading blocks of data. That is why my answer reads blocks of data at once, and gets the requested columns without importing the entire file at once into MATLAB memory.

サインインしてコメントする。

その他の回答 (1 件)

Aaditya Kalsi
Aaditya Kalsi 2017 年 2 月 2 日
You could use datastore to select the columns and read only those columns in.
ds = datastore('filepath',...);
ds.SelectedVariables = {'Var1', ...};
tbl = readall(ds);

カテゴリ

Help Center および File ExchangeLarge Files and Big Data についてさらに検索

タグ

製品

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by