multiple for loops split data

Question

0 投票

hey guys,

currently my function is really slow because of the mass of the data and because it uses only one thread.

Since i have a multicore Processor (Ryzen 5 3600, 6 Cores / 12 Threads), i want to make use of it by splitting my data and using multiple times the same function on these data and putting them back together.

I have found the spmd and parfor command

The raw steps which i want to to:

split the Data (tables) n times
give each worker enough parts of the splitted data and the raw data (which i need for the function)
run a function which modifies the splitted data on each worker
put all the splitted data back together

Also i am limited to functions in Matlab 2015b for my use.

How can i do that? Can you please help me?

This is what i tried:

workers = 12;
divider = ceil(specs.numberOfRows/workers);
split1 = data((data.ID <= divider),:);
split2 = data((data.ID > divider) & (data.ID <= divider*2),:);
split3 = data((data.ID > divider*2) & (data.ID <= divider*3),:);
split4 = data((data.ID > divider*3) & (data.ID <= divider*4),:);
split5 = data((data.ID > divider*4) & (data.ID <= divider*5),:);
split6 = data((data.ID > divider*5) & (data.ID <= divider*6),:);
split7 = data((data.ID > divider*6) & (data.ID <= divider*7),:);
split8 = data((data.ID > divider*7) & (data.ID <= divider*8),:);
split9 = data((data.ID > divider*8) & (data.ID <= divider*9),:);
split10 = data((data.ID > divider*9) & (data.ID <= divider*10),:);
split11 = data((data.ID > divider*10) & (data.ID <= divider*11),:);
split12 = data((data.ID > divider*11) & (data.ID <= specs.numberOfRows),:);
dataset_array={split1, split2,split3,split4,split5,split6,split7,split8,split9,split10,split11,split12};
parfor i=1:12
    newDataset_array(i) = myFunction(dataset_array(i),data);
end
for i = 1:1:12
   newData = [newData;newDataset_array(i)] 
end

Thanks in Advance

11 件のコメント
9 件の古いコメントを表示 9 件の古いコメントを非表示

Guillaume 2020 年 1 月 15 日

編集済み: Guillaume 2020 年 1 月 15 日

MATLAB Online で開く

Can't really comment on the parfor bit as I don't have the parallel toolbox. As far as I know, your parfor code probably works as you want, but it's not clear why you're passing both a portion of data (as dataset_array(i)) and the whole of data.

With regards to your code. Numbered variables are always a bad idea, even temporary ones. For a start it forces you to needlessly repeat the same code several times (witness all your splitx = ... lines).

At the very least you should use a loop

workers = 12;
divider = ceil(specs.numberOfRows/workers);
%so much simpler than numbered variables
dataset_array = cell(1, numel(workers))
for idx = 1:workers
    dataset_array{idx} = data((data.ID > divider*idx-1) & (data.ID <= divider*idx), :);
end

Probably better:

workers = 12;
destination = discretize(data.ID, workers) ;  %split ID into workers bins
dataset_array = cell(1, numel(workers))
for idx = 1:workers
    dataset_array{idx} = data(destination == idx, :);
end

or:

workers = 12;
destination = discretize(data.ID, workers) ;  %split ID into workers bins
dataset_array = splitapply(@(rows) {data(rows, :)}, (1:height(data))', destination);

15 lines of code down to 3! And if you want to change the number of workers, you just have one line to edit instead of lots of copy/paste or deletions required.

Most likely, your myFunction takes a table as input, not a 1x1 cell array of table, in which case your parfor should be:

newDataset_array = cell(size(dataset_array))
parfor i=1:numel(dataset_array)  %don't hardcode values
    newDataset_array{i} = myFunction(dataset_array{i});  %Use {} indexing to get the content of the cell
end

Owner5566 2020 年 1 月 15 日

okay i already did it that way, just wanted to know if i missed anything.

But thanks. Works like a charm ;)

Guillaume 2020 年 1 月 15 日

Comment by Owner5566 mistakenly posted as an Answer moved here:

Now i just need a way, to make the big data Available to all workers

The way i do it now, they all get it in the function, which leads to a lot of memory use.

Cant i make it available to all?

I need it for filtering in the functions

サインインしてコメントする。

サインインしてこの質問に回答する。

Follow Question

Answer 1

Guillaume 2020 年 1 月 15 日

MATLAB Online で開く

1 投票

For the record, this is my suggested modification to the original code:

workers = 12;
destination = discretize(data.ID, linspace(min(data.ID), max(data.ID), workers + 1));  %split ID into workers bins
dataset_array = splitapply(@(rows) {data(rows, :)}, (1:height(data))', destination);

which is a good demonstration of why numbered variables are bad. 3 lines instead of 15 and dead easy to change the number of workers.

However that doesn't help at all with your parallel computation. I'm not entirely clear why you'd want to pass the whole dataset to each worker. If all the data is needed by each, then you're sort of losing the benefit of parallelisation. In addition, it may well be that the overhead of passing the data to each worker cancels any speed up in parallelisation.

If you need to pass the whole table to each worker, then there's not much benefit of passing a section of the table at the same time. You're better off just passing the row indices that the worker should work on and let the worker extract these rows. That should result in less overhead:

workers = 12;
destination = discretize(data.ID, linspace(min(data.ID), max(data.ID), workers + 1));  %split ID into workers bins
processeddata = cell(1, workers);
parfor i = 1:numel(workers)
    processeddata{i} = dowork(data, destination == i);  %pass the whole of data and a logical vector indicating which row the worker should work on
end

with

function result = dowork(data, workingrows)
    datatoworkon = data(workingrows, :);
    %...
end

But, if you can I would strongly recommend you upgrade to a more recent version of matlab. R2016b introduced tall arrays and tables which are basically tables designed for big data. Operations on these are automatically parallelised if you have the parallel toolbox.

Finally, for processing big data you also have the mapreduce functions which should be available in your version. Again, mapreduce automatically parallelise the work for you. mapreduce is not suitable to all kind of processing and can be a bit of a learning curve if you've never used it but it may be useful for what you're doing.

9 件のコメント
7 件の古いコメントを表示 7 件の古いコメントを非表示

Guillaume 2020 年 1 月 16 日

MATLAB Online で開く

Do'h! Didn't notice the numel which is clearly a typo. numel(workers) is always going to be one. It should indeed have been

parfor i = 1:workers
    %...
end

or

parfor i = 1:numel(processeddata)
    %...
end

Owner5566 2020 年 1 月 16 日

okay, then thank you again!

サインインしてコメントする。

multiple for loops split data

11 件のコメント
9 件の古いコメントを表示 9 件の古いコメントを非表示

採用された回答

9 件のコメント
7 件の古いコメントを表示 7 件の古いコメントを非表示

その他の回答 (0 件)

カテゴリ

製品

リリース

タグ

Community Treasure Hunt

multiple for loops split data

11 件のコメント 9 件の古いコメントを表示 9 件の古いコメントを非表示

採用された回答

9 件のコメント 7 件の古いコメントを表示 7 件の古いコメントを非表示

その他の回答 (0 件)

カテゴリ

製品

リリース

タグ

参考

Community Treasure Hunt

11 件のコメント
9 件の古いコメントを表示 9 件の古いコメントを非表示

9 件のコメント
7 件の古いコメントを表示 7 件の古いコメントを非表示