Loading in Data from large files
古いコメントを表示
Hello everyone,
Going to try and explain what i'm trying to do below and I welcome any suggestions the community can provide.
- Problem: I am trying to read in a large (>10GB) binary file and parse specific data. I can already parse the data, but I run out of RAM on MATLAB when parsing such large amounts of data.
- Current Logic: I have been using memmap to load in the data, and it worked up until I started having to deal with large file sizes. I am aware that memmap can skip a specific amount of memory and start at a later point, but i need to load a specific amount of bits at a time. I'm trying to avoid using fread since it takes so long.
- Goal: I am looking to parse through these binary files in smaller sections by using a command or some programmed logic to read in a small percentage of the binary file at a time, run the computations, grab what I need and store it elsewhere, then grab the next small percentage of the binary file and repeat. I've included some detailed notes below for my intent to try and help; but this is a project I am not allowed to share code too (copyright).
Procedure for what I want to do:
- Load in the header info of the file. this is static info that is easy to load and I can do this pretty well already.
- Load in a percentage of the file. For this specific logic, let's assume my file is 10 GB and I want to read it in in 500 MB sections.
- Parse the 500 MB I read in. I'm aware that I may only read 492 MB, in which case I need to make sure I read in 508 MB the next time.
- Store the parsed data in a structure
- clear used variables to reuse for next section of code logged.
- repeat
I hope this helps. I'll try to keep an eye on this post moving forward, but it might take me some time to respond as i'll be traveling.
11 件のコメント
Ilya Dikariev
2022 年 5 月 20 日
Thats huuge size. No way you can run it whole on a normal computer. What kinda data type you are putting in? and how many entries out there?
I would suggest to take every column as a single variable, store and use it for every single task separately
Need more info on what "Parsing" entails -- and why a structure? What's the end result to be?
You might look at the "roll your own memmap" solution to a lookup of a large set of data from some time back I posted -- it was a fixed-record length file with a computable length to the next record(s) of interest. Direct counting and fseeK beat any other solution using builtin large file operations by about 10X. Somesuch ideas might be useful here if we had any klew what the data file structure and operations needed were.
MJFcoNaN
2022 年 5 月 21 日
Hello, you said, "I'm trying to avoid using fread since it takes so long".
Will you explain it in details?
PS: I suggest you post some certain lines of code, for example how you called memmap or fread.
Jan
2022 年 5 月 21 日
"I'm aware that I may only read 492 MB" - why? If you instruct Matlab to read 500 MB it does read 500 MB.
"clear used variables to reuse for next section of code logged." - if you mean the clear command: This is rarely useful in Matlab.
I'd definitely solve this this with fread and not with a memory mapped file.
" I'm trying to avoid using fread since it takes so long."
Wrong. External i/o don't get no faster than with fread, anything else adds more overhead on top of straight bit transfer that the OS/hardware will buffer.
It well may be that having attempted to read very large files into memory you've run into memory issues and started page faulting or something similar that makes it look slow, but the use of fread itself won't have been the culprit.
It's going to be essentially impossible for anybody in the forum to help significantly on your specific problem if you can't figure out a way to obfuscate the data/program sufficiently to post enough details that we can actually see what it is you're dealing with and where you need to get to.
Failing that, your alternative would seem to be to hire a consultant with whom you can sign nondisclosure agreements or figure out something on your own.
Walter Roberson
2022 年 5 月 22 日
fread() with a small size request can be slow, if you are looping like that. Larger sizes are more efficient, at least until you start threshing. Remember to use the "*" precision marker to indicate that no datatype conversion is to be done.
fread(fid, [1 ELEMENTS], '*uint8')
dpb
2022 年 5 月 22 日
"..., at least until you start threshing."
<Grin> "Threshing" is what we do during wheat harvest, Walter (if there were going to be any this year which is looking more and more doubtful owing to extreme drought). It's "thrashing" here, though... <VBG>
Walter Roberson
2022 年 5 月 22 日
Ah, but you start threshing "when the cows come home".
dpb
2022 年 5 月 22 日
The cows are all gone for this year...ran out of grass w/ drought and nothing to cut for hay...it's pretty grim out here. They do have chances for tomorrow night thru Tuesday AM; we'll just have to see. So far this since late last summer SW KS has mostly been missed by one supposed-to-be-good-chances after another; the typical scenario in drought. The NWS guy in Dodge has the mantra, "When in doubt, drought!" which isn't at all heartening when it comes true so often.
My dad, when asked in such times as these "Is it going to rain?" had the stock answer of "It always has..."
He, of course went thru the Dirty-30s as well as the early 50s and several other periods; I only remember from the 50s onwards and was gone for 25-30 yr although have been back for 20 now.
Walter Roberson
2022 年 5 月 23 日
Where I am, precipitation is running 2 to 3 times normal, and farmers are (overall) happy because this is helping refill the aquifiers after years of borderline drought.
dpb
2022 年 5 月 23 日

We're in that big D4 area that stretches from KS all down E NM and W TX.
回答 (0 件)
カテゴリ
ヘルプ センター および File Exchange で Standard File Formats についてさらに検索
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!