random select n elements from n arrays

I need random select 50 elements from 50 lists to build a new array, each list includes 100 elements, then apply some post sorting processing on all possible new arrays to find the indexes of all 50 elements for the best array.
The question is there will be possibility for the combination of the new array. How to realize this work in MATLAB? Can we try Mapreduce?
this problem feels like I have 50 classes and each class have 100 students, and i need find the 'best' team with 50 students from 50 classes. each student has a unique score/contribution for the team, so i scan all possible teams and sort them to find the one i want as well as the student ID in each class.

1 件のコメント

Feng Cheng
Feng Cheng 2021 年 8 月 4 日
編集済み: Feng Cheng 2021 年 8 月 4 日
The sorting function is kind of complex but easy to calculate which will only care about two parameters: students scores and classes indexes.
I don't need to store the entire possible solutions in memory. I have done a Mapreduce test to sort huge nums separatly stored in 100Gb text files, and this test does not fit traditional memory calculation either. For my case, I think it's samilar but i have no idea how to handle with much bigger combinations. I know this num is beyond calculation. It's an extreme example. I can simplify my problem into a less num, e.g., , which might be total 19Gb data size. Then it comes back to the same question.
I can set up 8 nested for-loop to save all combination indexes into multiple files into disk and apply Mapreduce to solve the problem, although it feels stupid.
But it's hard to extend the workflow for bigger data that is too big to calculation, like the one I mentioned . I was thinking there might be some way to get rid of this big number. That's why i come here to find some better tools. I have no diea about the Optimal Stopping Strategy but i can check it anyway.
I am sorry to waste your guys so much energy. Let's stop here.

サインインしてコメントする。

 採用された回答

Image Analyst
Image Analyst 2021 年 8 月 4 日

0 投票

I believe what they may be after is the (what should be more famous) "Optimal Stopping Strategy", or the "37% rule". They might have discussed it in your course. Basically it says that it says that you review the first 37% of the possibilities without selecting any of them. Then you start reviewing the possibilities with the rule that the first one you encounter that is better than any of them you have seen before is the one you should pick. If you do that then there is a 37% chance you have picked the optimal possibility.
Pretty weird, right? However there is actually some theory behind it.
Here is a web page with more information including a table and more complicated twists on the problem.
However even 37% of 100^50 is too big, but if your course did not have some particular strategy in mind, Wikpedia has some you can try: https://en.wikipedia.org/wiki/Optimal_stopping

その他の回答 (4 件)

Image Analyst
Image Analyst 2021 年 8 月 4 日

0 投票

How do you figure 100^50? You have 50 lists of 100 elements from which you are supposed to select half the elements randomly. So you will have 50 output vectors. You can use randperm()
For one vector, let's call it vec1, you can do
randomIndexes = randperm(length(vec1), 50); % Get 50 random indexes.
% Now extract those into vec1Output:
vec1Output = vec1(randomIndexes);
% Apply "some sorting processing":
vec1Output = sort(vec1Output, 'ascend'); % Or whatever you want to do.
There, just do that also for vec2 through vec50 and you'll have 50 output vectors.

7 件のコメント

Chunru
Chunru 2021 年 8 月 4 日
randperm generates distinctive indexes. the question seems to say index over different lists can be repeated (so that to have 100^50 possiblities).
Feng Cheng
Feng Cheng 2021 年 8 月 4 日
yeah, I'm sorry for that i did not make this clear. the problem is like I have 50 classes and each class have 100 students, and i need find the 'best' team with 50 students from 50 classes. each student has a unique score/contribution for the team, so i scan all possible teams and sort them to find the one i want as well as the student ID in each class.
Walter Roberson
Walter Roberson 2021 年 8 月 4 日
Suppose you have a specific list of 50 students, a proposed team. Then how do you find out how much contribution each particular student makes to the team? You cannot possibly have a lookup table that holds information for each possible combination, so there must be some kind of algorithm.
Feng Cheng
Feng Cheng 2021 年 8 月 4 日
yeah, this student team is just a figuration. so let's say each student has one score about something, then i will have an algorithm (let's say function findbestS) to evaluate the overall score of the team, for example, the samness among the team, and the std of the team. I think this algorithm is dependent with this question, since it will be non-linear algorithm will cannot use to save the calculation in this problem, if that is what you concern.
Walter Roberson
Walter Roberson 2021 年 8 月 4 日
Exactly what is computed in findbestS() is important for solving the problem in a reasonable amount of time.
There are strategies such as Dynamic Programming https://en.wikipedia.org/wiki/Dynamic_programming that can make some kind of searches feasible that would otherwise be intractable.
Image Analyst
Image Analyst 2021 年 8 月 4 日
@Chunru, originally, when I answered, it did not say to get the "best" or try multiple sets of random indexes, as the original poster admitted.
Peter O
Peter O 2021 年 8 月 4 日
@Feng Cheng are you able to give us a little bit more information on what the sorting function needs to achieve and what the base list parameters are? Are they students with scores or are they actually just a big parameter grid?
As @Walter Roberson brings up, this is sounding a lot more like an optimization problem than a sorting problem, and DP looks like a good candidate. If the cost function can't be solved through a dynamic programming approach then something like a stochastic optimizer (genetic algorithms, particle swarm, simulated annealing, differential evolution) might also be suited to the task. In either case the algorithms are solving to find the "best" solution without needing to keep the entire possible solution space in memory.

サインインしてコメントする。

Chunru
Chunru 2021 年 8 月 4 日
編集済み: Chunru 2021 年 8 月 4 日

0 投票

nele = 100; nlist = 50;
data = randn(nele, nlist); % data in the list
idx = randi([1 nele], 1 ,nlist); % row index for each list
idx = (0:nele:nele*nlist-1) + idx; % linear index
x = data(idx); % data selected

7 件のコメント

Feng Cheng
Feng Cheng 2021 年 8 月 4 日
thanks for your response. Unfortunately, i think i need keep all () the possible x, but the case you presented here is just one of them. Because I need apply a later linear algorithm one each x to find the best x fit my target, then the indexes to compose this optimal x is what i need.
Chunru
Chunru 2021 年 8 月 4 日
編集済み: Chunru 2021 年 8 月 4 日
As mentioned, 100^50=10^100 = 10^91 Giga elements. Even each element takes 1Byte. You need 10^91 GB of memory.
Feng Cheng
Feng Cheng 2021 年 8 月 4 日
yeah, I have no idea if I can use Mapreduce in this case, but it’s indeed a tough task.
Chunru
Chunru 2021 年 8 月 4 日
The anwser is definetely no. Think something else.
Feng Cheng
Feng Cheng 2021 年 8 月 4 日
that's sad, but good to know.
Chunru
Chunru 2021 年 8 月 4 日
Try to appreciate the big number:
Let's say you have 10^100 x 50 array of double. The total size in memory is 10^100x50*8 =4x10^102.
Let's say you can save the data in Hard disks (not RAM). Assume each hard disk is 4TB = 4x10^12. You will need (4*10^102)/(4*10^12) = 10^90 such hard disks. Assum each hard disks is 100g. The total weight will be 10^89 kg. Just compare the weight of Sun 1.989 × 10^30 kg, all the hard disks can be ~10^69 times heavior than Sun.
Does the whole universe weight more than 10^69 Sun?
Walter Roberson
Walter Roberson 2021 年 8 月 4 日
Mass of the observable universe is apparently around 10^53 kg.

サインインしてコメントする。

Peter O
Peter O 2021 年 8 月 4 日

0 投票

With a little programming and a for loop you can handle all possible combinations.
If you want to be able to draw repeated elements from the lists (e.g. the new list could include index 10 multiple times), use randi. Otherwise, use randperm.
% Generate some fake lists as a single array, each having 100 entries.
n_lists = 50;
n_entries = 100;
Lists = rand(100,n_lists);
n_pick = 50;
% We know the size of the draw. Preallocate.
NewLists = zeros(n_pick, n_lists);
% Generate the indices to draw at random.
for ix=1:n_lists
Indexes = randperm(size(Lists,1), n_pick);
% or if you want to allow repeats:
% Indexes = randi(size(Lists,1),n_pick,1);
NewLists(1:n_pick, ix) = Lists(Indexes, ix);
end
% Sort as-needed here. Since the lists here are ordered as a single matrix
% you can handle it with a single call. sort called without addiitonal
% arguments will sort individual columns
NewLists_sorted = sort(NewLists)
NewLists_sorted = 3×5
0.5339 0.0871 0.2878 0.4678 0.2783 0.5875 0.4006 0.3817 0.5071 0.8016 0.7901 0.7552 0.9092 0.9484 0.8142

3 件のコメント

Feng Cheng
Feng Cheng 2021 年 8 月 4 日
編集済み: Feng Cheng 2021 年 8 月 4 日
thanks for your response. I am sorry for that, I did not make this point clear. for each list, I select one element the chance will be , the index for different lists can be duplicate but the final array should be unique; then the chance for the final array will be .
this is the problem that i need to face. After i get all this new arrays, I need do some post-processing (let's say sorting) to find the best array fitting my target. I have no idea how to address this with for-loop.
Chunru
Chunru 2021 年 8 月 4 日
編集済み: Chunru 2021 年 8 月 4 日
You most likely can not get the 100^50 array, which is too big. You need to generate one sample vector by one sample vector (or some mangeable number of samples). You will never go through all possibilities.
Peter O
Peter O 2021 年 8 月 4 日
Yes, @Chunru, I agree. The solution above is for a different problem, which a few of us thought the OP was trying to solve. The OP has updated the question to help us better understand what they are trying to do.

サインインしてコメントする。

Walter Roberson
Walter Roberson 2021 年 8 月 4 日
編集済み: Walter Roberson 2021 年 8 月 4 日

0 投票

Lists = arrayfun(@(idx) randi(99,1,100), 1:50, 'uniform', 0);
one_random_selection = cellfun(@(L) L(randi(100)), Lists)
one_random_selection = 1×50
16 51 65 2 32 74 79 48 82 58 14 42 17 5 8 32 19 33 99 32 64 58 35 96 60 72 5 73 30 27
sort(one_random_selection)
ans = 1×50
2 3 5 5 6 8 12 14 16 17 19 19 21 27 28 30 32 32 32 33 35 37 37 42 43 46 47 48 51 51
If you want to find all of the unique sort()'s, that would be a different matter entirely.

7 件のコメント

Feng Cheng
Feng Cheng 2021 年 8 月 4 日
unfortunately, i have to. I just update my question to make this clear. I know it would be tough question.
Walter Roberson
Walter Roberson 2021 年 8 月 4 日
Lists = mat2cell(randperm(50*100), 1, 100*ones(1,50));
one_random_selection = cellfun(@(L) L(randi(100)), Lists)
one_random_selection = 1×50
3199 1573 4497 3208 4038 486 776 2509 861 3657 3454 3795 1957 1641 2675 1038 4649 4105 714 2991 3460 1900 3872 427 4474 2694 4331 666 1773 1735
sort(one_random_selection)
ans = 1×50
344 427 486 666 706 714 776 861 934 1038 1146 1206 1243 1428 1573 1641 1735 1773 1900 1957 1975 2051 2509 2675 2694 2741 2784 2788 2843 2991
Feng Cheng
Feng Cheng 2021 年 8 月 4 日
thanks, Walter. if I apply a huge num loop on your code block, i guess i might have the chance to approach the optimal array? like the method of exhaustion?
Walter Roberson
Walter Roberson 2021 年 8 月 4 日
I can provide you with code that will enumerate every possible team -- I have posted code in the past that can easily be adapted for the purpose.
However..
The Universe is roughly 4.4E+17 seconds old. If you had started your calculation at the time the Universe was born, and there were 4.4E+17 molecules that could be used for computation, and you had managed to complete 4.4E+17 operations every second on each of the molecules, and if there were 4.4E+17 different universes and you were running the same calculation in all of them... and you were to do all of that for 4.4E+17 times as long as the universe has existed...
Then you still would not have finished, by a factor of roughly 1/2 a billion.
So I can provide code, but the Sun is expected to have expanded into a Red Giant and destroyed the Earth before even a tiny fraction of the arrangements could be gone through, even using all the computers in world together.
Walter Roberson
Walter Roberson 2021 年 8 月 4 日
Sigh. Looks like you still want code that produces all 100^50 possibilities.
I give code at https://www.mathworks.com/matlabcentral/answers/623358-get-a-combination-of-unique-paths-for-given-pair-of-numbers#comment_1082638 that will iterate through all possibilities where in each position the set of values to be considered can be different than in the other positions. The code there does not assume that the elements of the different positions are the same data type; there is some (relatively small) optimization available for the case where they are all the same data type.
The code would have no inherent problem iterating through all 100^50 combinations of something like
StudentNumbers = mat2cell(randperm(5000), 1, 100*ones(50,1));
The code will handle it, given enough time.
... the estimated time requirement being on the order of pi*10^86 years, which is roughly the age of the universe, raised to the 5th power.
Feng Cheng
Feng Cheng 2021 年 8 月 4 日
To be honest, your codes are useless for me. We're talking about totally different questions. I was just asking for some possible solutions for this kind of problem. I know there could be no answer for this question but I appreciate everyone for the postive discussions. please make sure that you understand my question before your mocking.
Walter Roberson
Walter Roberson 2021 年 8 月 4 日
You are mistaken. I have provided a link to code that can produce EVERY combination from the arrays, which none of the other people have given you. You run the initialization routine that I provided there, then you run the loop that I provided there. Each time, the next combination is returned; you evaluate the combination according to your scoring system, and if it is the best combination so far, you keep a record of it. You keep looping until the code says there are no more possibilities available; at that point you return the best value.
I can set up 8 nested for-loop to save all combination indexes into multiple files into disk and apply Mapreduce to solve the problem, although it feels stupid.
My code provides a mechanism to have a single for loop that will evaluate every possibility.
But it's hard to extend the workflow for bigger data that is too big to calculation
My code extends to any dataset that in itself takes less than approximately 1/4 of your available memory. 5000 items (50 classes of 100 students) is not a problem for my code, given enough time.
Mapreduce
Using mapreduce implies that you want to evaluate all possibilities. My code evaluates all possibilities given enough time.
If mapreduce is even conceptually in your mental toolbox to solve this problem, then my code should be a serious contender.
For problems like this, there are two possibilities:
  1. That the results from one combination cannot be used to make predictions about the results from other combinations, so all of the combinations need to be tried; in such a case, my code can handle problems too big for mapreduce given enough time; or
  2. That the results from one combination can be used to make predictions about the results from other combinations; in such a case, mapreduce is not typically a useful strategy, but techniques such as Dynamic Programming could potentially be very effective
You also have to ask whether you need the best combination, or just a "reasonably good" combination. For example, it would be entirely reasonable to ask "If I put a limit of 10^9 combinations evaluated (to keep the computation time limited), then what techniques should I use to get the best combination I am likely to get?" . And typically mapreduce() is not one of the techniques used to get the best combination within a budget.

サインインしてコメントする。

カテゴリ

ヘルプ センター および File ExchangeFile Operations についてさらに検索

製品

リリース

R2019a

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by