現在この質問をフォロー中です
- フォローしているコンテンツ フィードに更新が表示されます。
- コミュニケーション基本設定に応じて電子メールを受け取ることができます。
How to increase reading speed from a Gigabyte large file ?
2 ビュー (過去 30 日間)
古いコメントを表示
farzad
2019 年 6 月 17 日
Hi all
how do I increase reading speed from an Excel file that contains rows and columns with a volume of some GigaBytes?
18 件のコメント
dpb
2019 年 6 月 17 日
Dunno...'pends on what the data are and how saved...getting it out of Excel and into a .mat or stream file would undoutedly be the fastest.
farzad
2019 年 6 月 17 日
The data are float and let's say 5 Gigabytes.
why .mat and why stream file ? how would the code be like ?
is using the table useful ?
dpb
2019 年 6 月 17 日
'Cuz both .mat and stream files are binary representations of the actual bytes in memory, thus eliminating the need for conversion.
You've still not said which form of file it actually is; if it is .xls(x), then the xlsread is fairly slow.
A table would be one choice for internal storage in Matlab; how useful depends entirely on what the data are and how they need to be processed which like the actual file itself, you're keeping us totally in the dark so all we can do is guess...
dpb
2019 年 6 月 17 日
Well, with .xlsx files you have the choice between xlsread and readtable. You'll just have to test which is faster--one presumes probably readtable. If you have R2019a, you can try the new readmatrix which is now recommended instead of xlsread.
For csv files, the historic ways are csvread, textscan, fscanf altho again with the caveat of requiring R2019a, readmatrix is the TMW-recommended alternative now.
I don't have R2019a installed yet, so I can't comment on the relative performance between it and alternatives.
Still, if speed and doing this more than once will be required, then doing it once and then using .mat or stream files will undoubtedly beat any of the alternatives.
You could, if your application can live with single precision, cut the file size in half by saving single instead of double. That's purely a case of what is required of the data itself as to whether would be a viable alternative or not.
Walter Roberson
2019 年 6 月 18 日
編集済み: Walter Roberson
2019 年 6 月 18 日
I wrote out 1e6 by 50 of doubles = 4 gigabytes in binary form, and tested how long loading took.
When saved as space-delimited double using save -ascii -double, then using load() of the 12501000000 bytes of text file took 1416 seconds.
textscan() of that same file took 265 seconds.
fscanf() of the same file took 371 seconds.
When saved as a .csv file using dlmwrite() with precision 16, then using load() took 1107 seconds.
When saved as -v7.3 .mat, then using load() of the 3796914266 bytes of file took 25 seconds.
When saved as a pure binary file, then fread(fid, [1e6 500],'*double') took 14 1/4 seconds the first time, and 2.1 seconds the second time (file in operating system cache.) fread(fid, [1 inf], '*double') takes 4.6 seconds when the file is in operating system cache, which tells us that there is more memory management overhead when the size is unknown.
(I will update as I generate more times.)
farzad
2019 年 6 月 18 日
Thank you very much Walter
That is very much what's I was searching for. How do you save as mat?
Walter Roberson
2019 年 6 月 18 日
data = rand(1e6, 50);
save testdata.mat data -v7.3
but this relies upon having the data in the first place to write out as .mat.
Walter Roberson
2019 年 6 月 18 日
I am having difficulty creating a excel file that large. I wrote the file as .csv but my Excel complains about running out of memory when trying to import it, which does not make sense to me.
Walter Roberson
2019 年 6 月 18 日
I have been updating the timings; you might want to have another look, above.
dpb
2019 年 6 月 18 日
All of which continues to say "ditch Excel" entirely for such large files...
I do find it interesting that textscan manages to beat fscanf -- one would think would boil down to the same C runtime library call. Just out of curiosity, what were the two specific commands used, Walter? Oh--did you include overhead to cast the cell array from textscan to double?
Walter Roberson
2019 年 6 月 18 日
編集済み: Walter Roberson
2019 年 6 月 18 日
I created a format with repmat of '%f' 50 times. I fopen and then
datacell = textscan(fid, fmt, 'collectoutput', 1);
Because this puts everything into a single cell the overhead to extract the array is trivial.
The timing with collectoutput 0 without joining the columns after, was a hair higher but not statistically significant.
dpb
2019 年 6 月 18 日
Yeah, that's kinda' what I suspected, thanks for confirming, Walter.
I still find it more than strange that there's 30% reduction over fscanf -- what are they doing wrong with it then is the question that there's that much room for improvement?
These timings couldn't possibly be related to caching issues, I presume; you're too careful for that! :)
回答 (0 件)
参考
カテゴリ
Help Center および File Exchange で Spreadsheets についてさらに検索
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!エラーが発生しました
ページに変更が加えられたため、アクションを完了できません。ページを再度読み込み、更新された状態を確認してください。
Web サイトの選択
Web サイトを選択すると、翻訳されたコンテンツにアクセスし、地域のイベントやサービスを確認できます。現在の位置情報に基づき、次のサイトの選択を推奨します:
また、以下のリストから Web サイトを選択することもできます。
最適なサイトパフォーマンスの取得方法
中国のサイト (中国語または英語) を選択することで、最適なサイトパフォーマンスが得られます。その他の国の MathWorks のサイトは、お客様の地域からのアクセスが最適化されていません。
南北アメリカ
- América Latina (Español)
- Canada (English)
- United States (English)
ヨーロッパ
- Belgium (English)
- Denmark (English)
- Deutschland (Deutsch)
- España (Español)
- Finland (English)
- France (Français)
- Ireland (English)
- Italia (Italiano)
- Luxembourg (English)
- Netherlands (English)
- Norway (English)
- Österreich (Deutsch)
- Portugal (English)
- Sweden (English)
- Switzerland
- United Kingdom (English)
アジア太平洋地域
- Australia (English)
- India (English)
- New Zealand (English)
- 中国
- 日本Japanese (日本語)
- 한국Korean (한국어)