Way of conserving memory when extracting data from CSV

2 ビュー (過去 30 日間)

Mate 2u 2013 年 4 月 12 日

0
リンク

この質問への直接リンク

https://jp.mathworks.com/matlabcentral/answers/71837-way-of-conserving-memory-when-extracting-data-from-csv

Hi everybody I have few questions. I have some HUGE CSV files which I need in Matlab for analysis. The CSV it self has 5 columns. The columns of relevance are:

Column 1 is our date starting from early 2007 all the way till till mid 2011 in the form of mm/dd/yyyy.

Column 3 is our respective prices

Column 5 is the number of trades.

The questions I have are these:

1) How can I extract these 3 columns into a Matrix in MATLAB without taking too much memory (bear in mind that some of these CSV files have around 60 million rows)? Is there a way to decrease the memory of each cell Matlab allocates for the matrix? Please help with code.

2) How can I extract all the information into a non-string matrix (for analysis) for a specific year....ie only for 2009. So I would require to store in Matrix all information for 2009 (bearing in mind the memory limitations in 1).

Thanks so much.

13 件のコメント
11 件の古いコメントを表示11 件の古いコメントを非表示

Mate 2u 2013 年 4 月 13 日

編集済み: Mate 2u 2013 年 4 月 13 日

In this case volume (column 5) the maximum would never exceed 2500 (5000 to be sure)

Column 3 the maximum would never exceed 250-350

per isakson 2013 年 4 月 13 日

MATLAB Online で開く

And price will never exceed

    >> intmax('uint32')
    ans =
      4294967295

cents ????

サインインしてコメントする。

サインインしてこの質問に回答する。

採用された回答

per isakson 2013 年 4 月 13 日

1
リンク

この回答への直接リンク

https://jp.mathworks.com/matlabcentral/answers/71837-way-of-conserving-memory-when-extracting-data-from-csv#answer_82075

編集済み: per isakson 2013 年 4 月 13 日

MATLAB Online で開く

Something like this will do it

    function mate2u
      day_number = zeros( 60*1e6, 1, 'uint16' );  % day_number = 1 for 1/1/2007
      price      = zeros( 60*1e6, 1, 'uint32' );  % 1/100 of cents 
      volume     = zeros( 60*1e6, 1, 'uint16' );  % volume
      pivot_day   = datenum( '1/1/2007', 'mm/dd/yyyy' );
      chunk_size  = 10;  % choose 5*1e6
      fid = fopen( 'mate2u.txt' );
      while not( feof( fid ) )
          cac = textscan( fid, '%s%*s%f32%*s%u16', chunk_size, 'Delimiter', ',' );
          uint16( datenum( cac{1}, 'mm/dd/yyyy' ) - pivot_day )
          uint32( cac{2}*10000 )
          cac{3}
      end
      fclose( fid );
    end

where mate2u.txt is

    04/29/2008,38:52.0,71.35,CTN08,2
    04/29/2008,38:53.0,71.35,CTN08,2
    04/29/2008,38:56.0,71.35,CTN08,3
    04/29/2008,38:56.0,71.35,CTN08,1
    04/29/2008,38:56.0,71.35,CTN08,1
    04/29/2008,38:57.0,71.35,CTN08,1

prints to command window

11 件のコメント
9 件の古いコメントを表示9 件の古いコメントを非表示

Mate 2u 2013 年 4 月 13 日

編集済み: Mate 2u 2013 年 4 月 13 日

Hi Per Isakson.....your example works.....but here let me demonstrate some examples where it doesent work:

Input as shown from CTTEST20.txt:

01/03/2007,15:30:06.000,55.90,CTH07,0

01/03/2007,15:30:30.000,55.75,CTH07,0

01/03/2007,15:30:42.000,55.80,CTH07,0

01/03/2007,15:30:53.000,55.85,CTH07,0

01/03/2007,15:30:57.000,55.75,CTH07,0

01/03/2007,15:31:17.000,55.70,CTH07,0

01/03/2007,15:31:23.000,55.65,CTH07,0

01/03/2007,15:31:36.000,55.55,CTH07,0

01/03/2007,15:31:38.000,55.60,CTH07,0

01/03/2007,15:31:43.000,55.55,CTH07,0

01/03/2007,15:31:44.000,55.60,CTH07,0

01/03/2007,15:31:50.000,55.70,CTH07,0

01/03/2007,15:32:07.000,55.55,CTH07,0

01/03/2007,15:32:07.000,55.90,CTH07,0

01/03/2007,15:40:41.000,55.30,CTH07,0

01/03/2007,15:40:43.000,55.40,CTH07,0

01/03/2007,15:40:52.000,55.30,CTH07,0

01/03/2007,15:40:54.000,55.50,CTH07,0

01/03/2007,15:41:33.000,55.15,CTH07,0

01/03/2007,15:41:34.000,55.20,CTH07,0

Output in cac:

'01/03/2007' '01/03/2007' '01/03/2007' '01/03/2007' '01/03/2007' '01/03/2007' '01/03/2007' '01/03/2007' '01/03/2007' '01/03/2007'

55.599998 55.700001 55.549999 55.900002 55.299999 55.400002 55.299999 55.500000 55.150002 55.200001

0 0 0 0 0 0 0 0 0 0

As we can see we are missing 10 entries.....in our larger txt/csv files we get many more missing entries. Additionally look at the output prices...I am not sure why they are varying to the input prices (even if it is marginal)

per isakson 2013 年 4 月 13 日

編集済み: per isakson 2013 年 4 月 13 日

Firstly, make some experiments with the [{}Code] button.

Secondly:

convert the script to a function (I've done it in my answer)
step through my code with the debugger and analyze what it does
notice that the twenty lines are indeed printed in the command window - two chunks of ten entries each
the prices are hurt by the single precision "%f32" - you could change f32 to f64

サインインしてコメントする。

その他の回答 (1 件)

Image Analyst 2013 年 4 月 12 日

1
リンク

この回答への直接リンク

https://jp.mathworks.com/matlabcentral/answers/71837-way-of-conserving-memory-when-extracting-data-from-csv#answer_82073

What are the classes of each column? Are they all 8 byte (64 bit) doubles? For example, the number of trades might be able to be a 4 byte integer, and most of the floating point numbers could probably be single instead of double. By retrieving it a line at a time and using sscanf() you can place each value into the smallest type of variable that is appropriate for that number. For example, assuming no stock price is over $655.35 you could read in the number and multiply by 100 so that all stock prices are in cents rather than dollars. That way you can use 16 bit unsigned integer instead of a 32 bit single.

I don't have the toolboxes, but perhaps the Financial Toolbox or the Fixed Point Designer may have efficient ways of handling numbers like prices of stocks.

Like Matt said, perhaps you don't need all 60 million rows in memory at once - hopefully you can process it in chunks.

4 件のコメント
2 件の古いコメントを表示2 件の古いコメントを非表示

Mate 2u 2013 年 4 月 13 日

Thank you....2 things....1) Price is in the form of 55.1500, does this make a difference?

Additionally 2) Is there a way to to convert to unit16 etc before it gets into MATLAB to avoid a out of memory message?

Image Analyst 2013 年 4 月 13 日

For example, maybe someone asks about 2010 prices, so you scan the file line by line, throwing away data if it belongs to any other year than 2010. Only if the year is 2010 do you use put it into your array. Other years just go into single variables because you used sscanf but you re-use (overwrite) those variables. So on a line by line basis you will have variables thisPrice, thisDay, thisVolume, thisYear, and only when this year = 2010 do you add thisPrice, thisDay, thisVolume to priceArray, dayArray, volumeArray.

サインインしてコメントする。

サインインしてこの質問に回答する。

カテゴリ

MATLAB Language Fundamentals Entering Commands

Help Center および File Exchange で Entering Commands についてさらに検索

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by

Way of conserving memory when extracting data from CSV

13 件のコメント
11 件の古いコメントを表示11 件の古いコメントを非表示

採用された回答

11 件のコメント
9 件の古いコメントを表示9 件の古いコメントを非表示

その他の回答 (1 件)

4 件のコメント
2 件の古いコメントを表示2 件の古いコメントを非表示

参考

カテゴリ

タグ

Community Treasure Hunt

Way of conserving memory when extracting data from CSV

13 件のコメント 11 件の古いコメントを表示11 件の古いコメントを非表示

採用された回答

11 件のコメント 9 件の古いコメントを表示9 件の古いコメントを非表示

その他の回答 (1 件)

4 件のコメント 2 件の古いコメントを表示2 件の古いコメントを非表示

参考

カテゴリ

タグ

Community Treasure Hunt

13 件のコメント
11 件の古いコメントを表示11 件の古いコメントを非表示

11 件のコメント
9 件の古いコメントを表示9 件の古いコメントを非表示

4 件のコメント
2 件の古いコメントを表示2 件の古いコメントを非表示