How to import big data files
現在この質問をフォロー中です
- フォローしているコンテンツ フィードに更新が表示されます。
- コミュニケーション基本設定に応じて電子メールを受け取ることができます。
エラーが発生しました
ページに変更が加えられたため、アクションを完了できません。ページを再度読み込み、更新された状態を確認してください。
古いコメントを表示
0 投票
Is there any fast way to import a hugh dataset (approx. 10Mio. rows) into Matlab? I tried importing my csv. file with the help of the import function, but its been running for a couple of hours by now. Has someone an useful advice?
2 件のコメント
Rik
2017 年 10 月 29 日
How large are those rows? This shouldn't take this long.
One way to reduce time is to figure out the most direct function to do the job, dlmread or csvread in this case.
Leo
2017 年 10 月 29 日
I have approximately 10 million rows. I want to import the data including their headers.
採用された回答
per isakson
2017 年 10 月 29 日
編集済み: per isakson
2017 年 10 月 29 日
"running for a couple of hours" that doesn't sound right.
- How many columns are there?
- How much RAM do you have?
- If it's pure numerical data try load -ascii otherwise textscan they are faster.
14 件のコメント
There are 10 million rows. My RAM is only 8 GB, and I am working on a Mac. I will try it your suggested way , thanks!
Walter Roberson
2017 年 10 月 29 日
Consider using tall arrays.
per isakson
2017 年 10 月 29 日
編集済み: per isakson
2017 年 10 月 29 日
- 8GB should be enough - I assume you don't run a bunch of other programs at the same time
- The file doesn't have hundreds of columns - I assume
- With more than something like 25 columns you might try to read to single-precision or integers rather than double (is default) with textscan
Leo
2017 年 10 月 29 日
I did quit all the other programs. There are only 5 columns. Thats why I dont know what the problem is. How do I use tall array?
Walter Roberson
2017 年 10 月 29 日
10 million rows of 5 columns of double precision should take only about 400 megabytes, which should fit easily in your memory.
If you fire up Activity Monitor and look to see what the CPU use and Memory use are, what kind of behaviour are you seeing?
Does your data involve any strings, or just numbers? Are there any missing observations
Could you confirm that your data is a text file? Is it comma separated ?
per isakson
2017 年 10 月 29 日
編集済み: per isakson
2017 年 10 月 30 日
- See Large Files and Big Data, that's a starting point. However, I think it is overkill.
- I'm convinced textscan will do the job without problems. Did you try? What happend?
- Does the file consists of a few headerlines and five columns of numerical data? Can you show the first ten rows of the file?
- Do you need all the data in memory simultaneously to work efficiently with it.
Leo
2017 年 10 月 30 日
編集済み: per isakson
2017 年 10 月 30 日
Sample,ID,Date,X,Y,Z
0,"00036020",30jun2002,1.869,12147.04333624268,.0167
0,"00036020",01jul2002,1.869,1156.482648479462,.0169
0,"00036020",02jul2002,1.869,1145.771792739868,.0169
0,"00036020",03jul2002,1.869,1294.462498138428,.0169
0,"00036020",04jul2002,1.869,1294.462498138428,.0169
0,"00036020",05jul2002,1.869,141.5848011779785,.0169
0,"00036020",06jul2002,1.869,141.5848011779785,.0169
0,"00036020",07jul2002,1.869,141.5848011779785,.0169
0,"00036020",08jul2002,1.869,1307.917965469361,.017
This is an extract from my data. ID contains string variables and I have a date variable. Other than that, the rest of the variables are numerical data.
I tried the following code importing this file
fid = fopen('File.csv');
Data = textscan(fid, '%f %C %d %f %f %f', 'delimiter','\t');
fclose(fid);
And I tried a couple of times the import add in of matlab.
per isakson
2017 年 10 月 30 日
編集済み: per isakson
2017 年 10 月 30 日
First I created a 2.4 million row file, leo.csv by adding many copies of your sample rows.
>> sad = dir('h:\m\cssm\leo.csv');
>> sad.bytes/1e9
ans =
0.1274 (GB)
Then I read the file with this script. (I use 'ReturnOnError',false to make textscan return an error together with a message if it fails.
tic
fid = fopen( 'c:\tmp\leo.csv', 'r' );
cac = textscan( fid, '%f%s%s%f%f%f', 'Delimiter',',' ...
, 'ReturnOnError',false, 'CollectOutput',true );
fclose( fid );
toc
That took five seconds on my old vanilla desktop with a spinning harddisk. (The timings might be a bit of cheating, because part or even the entire text file is already in the system cache. )
Elapsed time is 5.013201 seconds.
>> whos cac
Name Size Bytes Class Attributes
cac 1x3 693633360 cell
Remains to parse the strings
>> cac{2}{1,:}
ans =
"00036020"
ans =
30jun2002
Doc says that one may use %q to read strings within double quotation marks, but that gave an unexpected result on my R2016a/Win7.
Four times five seconds is twenty. That's reasonable.
Four times 0.69GB is nearly 3GB, which your 8GB computer should be able to handle. It's the strings stored in cell arrays, which eats memory.
"running for a couple of hours by now" is still a mystery to me.
per isakson
2017 年 10 月 30 日
tic
fid = fopen( 'c:\tmp\leo.csv', 'r' );
cac = textscan( fid, '%f%s%{ddMMMyyyy}D%f%f%f', 'Delimiter',',' ...
, 'ReturnOnError',false, 'CollectOutput',true );
fclose( fid );
toc
Letting textscan parse the date increases the time but saves on memory
Elapsed time is 35.009933 seconds.
>> whos cac
Name Size Bytes Class Attributes
cac 1x4 424673851 cell
Abhishek Singh
2019 年 5 月 21 日
Hi,
I have the same kind of question. I have 3.5 million rows and little over 50 columns. Do you think this process will work or should I try something else?
per isakson
2019 年 5 月 21 日
That depends
- what's in the columns?
- how much RAM?
If it's pure numerical data the size of the resulting matrix will be 1.4GB
>> 3.5*1e6 * 50 * 8 / 1e9
ans =
1.4
Abhishek Singh
2019 年 5 月 22 日
編集済み: per isakson
2019 年 5 月 22 日
You are right, they are mostly some numbers in all the columns. I have 8 GB RAM. Also I tried to remove colummns which I may not need and reduced it to 38 columns now.
Basically I do not want to import and save it a .mat file rather I would like a piece of snippet which I could run and and it filters the data accordingly to my workspace. Something which I think this answer does.
per isakson
2019 年 5 月 23 日
編集済み: per isakson
2019 年 5 月 23 日
In this context it's a big difference between "pure numerical data" and "mostly some numbers". Either it is 100% numerical or it's not.
Proposal: Post a new question with a good title and more details on the format of the file. Attach an excerpt of the file. A few lines is enough.
If you post a comment here announcing the question, I'll find it.
Abhishek Singh
2019 年 5 月 24 日
Yes, I guess my question is a little bit different to the one here. Yes, the columns are purely numerical.
その他の回答 (0 件)
カテゴリ
ヘルプ センター および File Exchange で Standard File Formats についてさらに検索
タグ
参考
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!Web サイトの選択
Web サイトを選択すると、翻訳されたコンテンツにアクセスし、地域のイベントやサービスを確認できます。現在の位置情報に基づき、次のサイトの選択を推奨します:
また、以下のリストから Web サイトを選択することもできます。
最適なサイトパフォーマンスの取得方法
中国のサイト (中国語または英語) を選択することで、最適なサイトパフォーマンスが得られます。その他の国の MathWorks のサイトは、お客様の地域からのアクセスが最適化されていません。
南北アメリカ
- América Latina (Español)
- Canada (English)
- United States (English)
ヨーロッパ
- Belgium (English)
- Denmark (English)
- Deutschland (Deutsch)
- España (Español)
- Finland (English)
- France (Français)
- Ireland (English)
- Italia (Italiano)
- Luxembourg (English)
- Netherlands (English)
- Norway (English)
- Österreich (Deutsch)
- Portugal (English)
- Sweden (English)
- Switzerland
- United Kingdom (English)
