Internal problem while evaluating tall expression (requested 40.5 GB array)

Hi, I'm working with a large data set with approximately 500k rows and 6k columns. I'm using a datastore and tall array to handle the loading. The file itself is comma separated file while with most of its values coded with integers or strings. I have a dictionary for decoding these values. What I am trying to do is to replace codes with the actual meaning and save the decoded file to local.
Below I copied a structure of my program
classdef myTable < handle
% ...
methods
function this = myTable
end
% ...
end
methods
function loadCsv(this)
% ...
ds = datastore(this.csvSource);
ds.SelectedFormats = repmat({'%q'}, 1, length(ds.VariableNames));
this.csvTable = tall(ds);
end
% ...
function decoding(this)
% ...
end
function export(this)
% ...
write([this.outputDir '/' this.csvTableName '_decoded_*.csv'], this.csvTable, 'WriteFcn', @myWriter);
end
end
end
%% helper
function myWriter(info, data)
filename = info.SuggestedFilename;
writetable(data, filename, 'FileType', 'text', 'Delimiter', ',')
end
Error occured at this.export:
Error using digraph/distances
Internal problem while evaluating tall expression. The problem was:
Requested 73733x73733 (40.5GB) array exceeds maximum array size preference. Creation of arrays greater than this limit
may take a long time and cause MATLAB to become unresponsive.
Question: I was thinking that the write function should be partitioning the data while exporting. Isn't that true? Why did MATLAB still try to create such a big array?
I am using a windows machine with 16GB RAM. MATLAB R2020a (tried on 19a first and just upgraded to 20a).
Thank you!

16 件のコメント

Peng Li
Peng Li 2020 年 3 月 23 日
Hey, is there anybody who could help me this out please? Thank you!
Peng Li
Peng Li 2020 年 3 月 23 日
hoping someone could take a look at this issue lol
Peng Li
Peng Li 2020 年 3 月 24 日
Hey. anybody stop by and help with this please?
Peng Li
Peng Li 2020 年 3 月 24 日
Any MathWorks stuff members who can shed me some light please?
per isakson
per isakson 2020 年 3 月 24 日
編集済み: per isakson 2020 年 3 月 24 日
You are asking for too much. I've have looked at your code and I have made a working example based on an example in the documentation. It seems to work. I fail to understand what's going wrong for you. Your code include a lot of irrelevant stuff.
Proposal
Sean de Wolski
Sean de Wolski 2020 年 3 月 24 日
Yes, please provide a few sample rows.
Peng Li
Peng Li 2020 年 3 月 24 日
Well, very briefly my question is related to write a tall array to local. No matter how big my data set is, my understanding is that the writing function for a tall array should partition the data and write them one by one. I’ve tested my write helper using other data set which works fine. So I don’t quite understand why it happened for this case that matlab needs to create an array of over 40g. That’s where the issue is. Hope this is clear now.
Peng Li
Peng Li 2020 年 3 月 25 日
Thank you both for your valuable time and attention on my question!
Sean de Wolski
Sean de Wolski 2020 年 3 月 25 日
Your understanding is correct.
But we need to know why digraph is trying to create a 73733x73733 array. It could be you have something shadowed so it's not calling a builtin, it could be expected and you need to partition differently, I don't know.
Peng Li
Peng Li 2020 年 3 月 25 日
Thanks Sean. I tried my program using a subset of the data which contains 100k rows instead of 500k rows, and 1k columns instead of 6k columns. It went through. The program exported 10 .csv files into the folder I defined.
With this, it means there shouldn't be anything that shadowed the builtin. Is that true?
Peng Li
Peng Li 2020 年 3 月 25 日
And why does it go to digraph? is that a required module for tall array writing? Or is that what you thought that something might be conflicting each other?
Sorry that I'll see if I can have a sample file updated. It's about human data so I need to be careful.
Walter Roberson
Walter Roberson 2020 年 3 月 25 日
A complete error message showing traceback would help.
Peng Li
Peng Li 2020 年 3 月 25 日
Thanks Walter. Agree! I'm trying to rerun this. I think I might have figured out the reason. It might be due to the categorical function that I used to replace codes with their actual meaning. Some of my coding book includes over 10k of codes and the corresponding meanings. I'm not sure whether categorical function needs to create a bunch of other arrays in order to assign categories to each inputValues. Instead of using categorical, I switched to replace function. Testing right now.
Sean de Wolski
Sean de Wolski 2020 年 3 月 26 日
Tall uses a digraph to figure out the fewest number of lower level operations that need to be done so it can efficiently traverse the data set as few a times and without repetition as possible.
Peng Li
Peng Li 2020 年 3 月 26 日
a complete error message. For some reason, it changes from time to time. It is now requesting over 500Gb array...
Error using digraph/distances (line 72)
Internal problem while evaluating tall expression. The problem was:
Requested 269757x269757 (542.2GB) array exceeds maximum array size preference. Creation of arrays greater than this
limit may take a long time and cause MATLAB to become unresponsive.
Error in matlab.bigdata.internal.lazyeval.LazyPartitionedArray>iGenerateMetadata (line 814)
allDistances = distances(cg.Graph);
Error in matlab.bigdata.internal.lazyeval.LazyPartitionedArray>iGenerateMetadataFillingPartitionedArrays (line 795)
[metadatas, partitionedArrays] = iGenerateMetadata(inputArrays, executorToConsider);
Error in ...
Error in tall/write (line 248)
iDoWrite(location, ta, writeFunction);
Error in myTable/export (line 94)
write([this.outputDir '/' this.csvTableName '_decoded_*.csv'], this.csvTable, 'WriteFcn', @myWriter);
Error in myTable/update (line 33)
this.export;
Error in myTest (line 19)
tab.update;
Caused by:
Error using matlab.internal.graph.MLDigraph/bfsAllShortestPaths
Requested 269757x269757 (542.2GB) array exceeds maximum array size preference. Creation of arrays greater than
this limit may take a long time and cause MATLAB to become unresponsive.
Peng Li
Peng Li 2020 年 3 月 27 日
An update:
Is that possible that for a tall array consists of various strings as elements, the majority of them are quite long, MATLAB couldn't handle this using the default partition method when writing to disk? This error happened every time at LazyPartitionArray which called a distances function. This function creates a distance matrix, which is always bigger than 10k*10k or even 100k*100k size for my case.

サインインしてコメントする。

回答 (0 件)

カテゴリ

ヘルプ センター および File ExchangeMatrix Indexing についてさらに検索

質問済み:

2020 年 3 月 21 日

コメント済み:

2020 年 3 月 27 日

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by