Make classification with huge dataset

I'm trying to make classification with huge dataset containing 6 persons for training and here I'm getting this error from only 1 person dataset: "Requested 248376x39305 (9.1GB) array exceeds maximum array size preference." First of all I'm trying Bagged Tree and Neural Network classificators and I want to ask how can I do it? It's possible to learn these classificators in portions of datasets (learn saved classification model again)?

9 件のコメント

Greg Heath
Greg Heath 2016 年 11 月 7 日
Please explain how 248376 x 39305 constitutes a 1 person data set
[ I N ] = size(input)
[ O N ] = size(target)
Thanks,
Greg
Mindaugas Vaiciunas
Mindaugas Vaiciunas 2016 年 11 月 7 日
編集済み: Walter Roberson 2016 年 11 月 7 日
Input matrix size 248376 x 765
Target matrix size 248376 x 1
Then I'm trying to make Tree Bagged mdl it makes 248376 x 39305 size matrix. P.s. as you see 1 frame got 765 features.
Walter Roberson
Walter Roberson 2016 年 11 月 7 日
Please show your Tree Bagging code. https://www.mathworks.com/help/stats/treebagger.html does not return matrices.
Mindaugas Vaiciunas
Mindaugas Vaiciunas 2016 年 11 月 7 日
Right it doesn't return matrices cause he can't start due following error about ram problem, code simple:
Mdl = TreeBagger(50,Features,FeaturesTarget);
So I'm thinking about decomposing all test data into lower size files, but I didn't know how to learn classificator again and again with that portions of data. Need something that let me update a classifier with new data, without retraining the entire thing from scratch.
Walter Roberson
Walter Roberson 2016 年 11 月 7 日
Have you considered reducing the number of trees?
Mindaugas Vaiciunas
Mindaugas Vaiciunas 2016 年 11 月 8 日
Tree number reducing not helping, had tried reduce test data for two different models, make them compact and combine, from first view it helps, but can't reach high recognition ratio. I think I need "online" algorithm , that can learn saved model using testing data.
Greg Heath
Greg Heath 2016 年 11 月 9 日
編集済み: Greg Heath 2016 年 11 月 9 日
I still don't get it
39305/765
ans =
51.3791
Regardless, I think you should use dimensionality reduction via feature extraction.
Hope this helps,
Greg
Mindaugas Vaiciunas
Mindaugas Vaiciunas 2016 年 11 月 9 日
This is solution to take some of features average for dimensionality reduction, but it may affect recognition percent.
Greg Heath
Greg Heath 2016 年 11 月 10 日
Of course it will affect it. However, the way to choose is to set a limit on the loss of accuracy.

サインインしてコメントする。

回答 (1 件)

Walter Roberson
Walter Roberson 2016 年 11 月 7 日

0 投票

Add more memory (RAM) to you computer. Then check or adjust Preferences -> MATLAB -> Workspace -> MATLAB array size limit.
Or, you could set the division ratios so that a much smaller fraction is used for training and validation, with most of it left for test. This effectively uses only a small subset of the data, but a different small subset each time it trains.

6 件のコメント

Mindaugas Vaiciunas
Mindaugas Vaiciunas 2016 年 11 月 7 日
More memory not solution for this, it would be need around 36 Gb of RAM with all training data. With division ratios I would be able to learn same saved model with small portions of test data again and again ?
Walter Roberson
Walter Roberson 2016 年 11 月 7 日
Amazon Web Services, among other providers, make available machines with more than 36 Gb of RAM. If you had that much RAM your program would run; therefore adding RAM is a solution for the problem.
Mindaugas Vaiciunas
Mindaugas Vaiciunas 2016 年 11 月 8 日
This project not commercial it's for university master degree, adding RAM is not solution for me, but thanks for answer.
Walter Roberson
Walter Roberson 2016 年 11 月 8 日
https://www.mathworks.com/products/parallel-computing/matlab-parallel-cloud/ 16 workers, 60 Gigabytes, $US 4.32 per hour educational pricing, including compute services.
Or if you provide your own EC2 instance, https://www.mathworks.com/products/parallel-computing/parallel-computing-on-the-cloud/distriben-ec2.html $0.07 per worker per hour for the software licensing from MATLAB. For example you could do https://aws.amazon.com/ec2/pricing/on-demand/ m4.4xlarge, 16 cores, 64 gigabytes, $US 0.958 per hour for the EC2 service. Between that and the $0.07 per worker from Mathworks it would come in less than $US2.50 per hour. About the price of a Starbucks "Grande" coffee.
Remember, your time is not really "free". At the very least you need to take into account "opportunity costs" -- like an hour spent fighting a memory issue is an hour you could have been working on a minimum wage job.
Mindaugas Vaiciunas
Mindaugas Vaiciunas 2016 年 11 月 9 日
Thanks for advice, keep this in mind if there would be no other solution
Walter Roberson
Walter Roberson 2016 年 11 月 9 日
Let me put it this way:
  • You do not with to reduce the number of trees or the data because doing so might decrease the recognition rate
  • We do not have a magic low-memory implementation of the TreeBagger available.
  • You do not have enough memory on your system to run the classification using the existing software
Your choices would seem to be:
  • write the classifier yourself, somehow not using as much memory; or
  • obtain more memory for your own system; or
  • obtain use of a system with more memory

サインインしてコメントする。

カテゴリ

ヘルプ センター および File ExchangeLicensing on Cloud Platforms についてさらに検索

質問済み:

2016 年 11 月 6 日

コメント済み:

2016 年 11 月 10 日

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by