- Is the first time really 300, then 2, or is that a typo or are there multiple years in the dataset and the times are day of year?
- How do you get 2 for 1.? Looks to me like was 300--there's something reported for first time in the dataset (BAC1).
excel table data analysis
1 回表示 (過去 30 日間)
古いコメントを表示
Hi guys
My data is (are) in an excel table 20000 by 3. I have a list of 358 patients. There are 48 different bacteria that I am looking for. each patent is tested to see if a bacteria is found in their blood. The day each test is conducted is recorded.
1) when was the first day the patient tested positive (regradless of bacteria type) (ex:first time patient A was positive is day 2)
2) List of all the bacterias that have been found for each individual patient.( ex: Patient A shows bactesias:BAC1 BAC2 BAC3)
3) When was the first time each patient tested possitive for each type of bacteria.(ex: First time BAC1 was found is day 4)
4) In what days was each patient exsibiting each bacteria (ex: BAC 1 was forund for patient A in days 300, 4)
The table is in the following format:
patient day they were tested bacteria found
A 300 BAC1
A 2 BAC2
A 4 BAC1
A 8 BAC3
B 66 BAC5
B 55 BAC1
C 208 BAC2
C 77 BAC2
C 51 BAC9
D 90 NAN
I have been struggling with this for a while. I would appriciate any input.
Please let me know if clarifications are needed. Thanks so much for the help in advance!
2 件のコメント
dpb
2019 年 5 月 10 日
With a little clarifcation on what the dataset parameters are, doesn't look terribly complex. Convert the Patient and Bacteria to categorical variables, put in a table and use findgroups to identify patients...the when was the first? questions are simple lookups but again what the time range and definition is may muddle the logic if there's wraparound as looks like might be will need to fix that issue first.
The "how many days?" questions are also simple lookup; with categorical data matches are exact so ismember or similar will answer those trivially.
採用された回答
dpb
2019 年 5 月 10 日
編集済み: dpb
2019 年 5 月 10 日
It's not trivial to learn the tricks, granted...but here's a start for the first...with it and some study of the examples, should get an idea -- it's late; I've got to turn in at this point, though..sorry :)
tBAC=readtable('bacteria.dat','ReadVariableNames',1); % read the data in
tBAC.patient=categorical(tBAC.patient); % fix data types
tBAC.found=categorical(tBAC.found);
[ig,Patient]=findgroups(tBAC.patient); % group index, group names
FirstInfection=splitapply(@(t,r) {min(t(r~='NAN'))},tBAC.testday,tBAC.found,ig); % find first infection any kind
FirstInfection(cellfun(@(x) length(x)==0,FirstInfection))={inf}; % clean up missing (no infection)
[table(Patient) cell2table(FirstInfection)] % display result in a table form
results in for the first problem statement...I chose "inf' as the indicator of no infection instead of NaN -- the decimals show up because I'd been doing financial work and had format bank in effect.
ans =
4×2 table
Patient FirstInfection
_______ ______________
A 2.00
B 55.00
C 51.00
D Inf
>>
I slightly modified your data to a file with the header line "patient, testday, found" and made a csv-file of your plain text...I presume you have a file format of your own.
Modifications to the functional for the remainder should suffice I think...altho 3 and 4 need grouping by both patient and bacteria ID.
ADDENDUM:
Added the identifier for which bacteria ID was the first. Having the standalone function means can clean up the return data there instead of afterwards--so there is some payback for the extra code. :)
NB: I got the return arguments from min in wrong order last night; the index is the optional second, not the first. This produces the amplified table:
tBAC=readtable('bacteria.dat','ReadVariableNames',1); % read the data in
tBAC.patient=categorical(tBAC.patient); % fix data types
tBAC.found=categorical(tBAC.found);
% 1. First occurrence of any in each patient
[ig,Patient]=findgroups(tBAC.patient); % group index, group names
%FirstInfection=splitapply(@(t,r) {min(t(r~='NAN'))},tBAC.testday,tBAC.found,ig); % find first infection any kind
[FirstInfection,Infection]=splitapply(@firstinfected,tBAC.testday,tBAC.found,ig);
table(Patient,FirstInfection,Infection)
% 2. All occurrences in each patient
[AllInfections]=splitapply(@allinfections,tBAC.found,ig);
[table(Patient) cell2table(AllInfections)]
% 3. First occurrence of each bacterium in each patient
[ig,Patient,Bacterium]=findgroups(tBAC.patient,tBAC.found);
[FirstInfection,Infection]=splitapply(@firstinfected,tBAC.testday,tBAC.found,ig2);
table(Patient,Bacterium,FirstInfection,Infection)
% 4. Ooccurrences of each bacterium in each patient
% EXERCISE FOR STUDENT :)
function [tFirst,bFirst]=firstinfected(t,r)
% return first time, infection
[tFirst,iFirst]=min(t(r~='NAN'));
if isempty(tFirst)
tFirst=nan;
bFirst='NAN';
else
bFirst=r(iFirst);
end
end
function [b]=allinfections(r)
% return all infections for each
b={unique(r(r~='NAN')).'};
if isempty(b)
b='NAN';
end
end
>> table(Patient,FirstInfection,Infection)
ans =
4×3 table
Patient FirstInfection Infection
_______ ______________ _________
A 2.00 BAC2
B 55.00 BAC1
C 51.00 BAC9
D NaN NAN
ans =
4×2 table
Patient AllInfections
_______ _________________
A [1×3 categorical]
B [1×2 categorical]
C [1×2 categorical]
D [1×0 categorical]
ans =
8×4 table
Patient Bacterium FirstInfection Infection
_______ _________ ______________ _________
A BAC1 4.00 BAC1
A BAC2 2.00 BAC2
A BAC3 8.00 BAC3
B BAC1 55.00 BAC1
B BAC5 66.00 BAC5
C BAC2 77.00 BAC2
C BAC9 51.00 BAC9
D NAN NaN NAN
>>
Unfortunately, the builtin table display function won't show the actual categorical variable values for each patient since they're not the same length of each array--and a table has to be regular in number of variable sfor each row/observation so can't create multiple variables without a lot of ugly NAN values scattered around.
The really cute part is the firstinfection function works for any chosen grouping so that don't have to do anything except use the other grouping variables. You could choose to not populate the tble with the second return or not use the second ID variable since they are the same...
Now, your mission, should you choose to accept it, is last item, #4... :)
11 件のコメント
dpb
2019 年 5 月 14 日
OK. that makes sense -- since you are actually counting days independent of the calendar, simply arranging by those numeric days is all that is needed.
I've got to go do some other stuff at the moment, but I'll mull over the idea...the general idea I'm thinking of is that one of the statistics grpstats returns in number elements--if there is a grouping variable for each category, then the answer of logical True is equivalent to number of elements in the group.
However, it seems to me you need a grouping variable for each of the combinations; at least otomh I don't see an easy way to combine results other than just the one case of 1-2 is 1 + 2 but none of the rest are composited uniquely from others so have to group each separately.
その他の回答 (0 件)
参考
カテゴリ
Help Center および File Exchange で Data Preprocessing についてさらに検索
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!