Compile data from excel sheets into a dataset with multiple tags

11 ビュー (過去 30 日間)
Charlie Chen
Charlie Chen 2022 年 8 月 17 日
コメント済み: Peter Perkins 2022 年 8 月 24 日
***** The title is edited to better describe the problem*****
Hello everyone,
The purpose of my data analysis is to compile all the data from excel sheets and plot them.
My approach is to read data from all the sheets in the format of table, and dump them in a cell array as a "container", and then retrive them to another table array while keeping their variable names. I want to preallocate a space to my matrix for a higher efficiency. Since each data matrix has different number of rows, I need to read the maximum row number in the dataset and create a table of that height to hold all the data.
Another special thing is that in each container I have 2 columns of data, and I need to transfer them to the new table column by column, while doing a little modification of the the Variable Names.
The attached code is the best I can do. I dumped them in a "cell" container as tables, and then transferred the data. However the tables are dumped in a matrix instead of a table, hence all the headers are lost. Is there a way I can keep them during the transfer, or can I add a header to a matrix somehow and convert them into a table?
Below is the code and attached is the original file.
% Use pop up box to compile data.
[filename, filepath ]= uigetfile(".xlsx","MultiSelect","on"); % note filename is a cell struture, you cannot use dot reference for this structure.
cd (filepath);
% index the sheets
sheetnumber = numel(sheetnames(filename)) % count the number of sheets in workbook,
% assign enough columns to hold the data. (2 columns per sheet)
% Loop all the sheets and read data in a cell array
for i = 1:sheetnumber
% set import options: Fill level start point
opts_startpoint = spreadsheetImportOptions;
opts_startpoint.Sheet = i;
opts_startpoint.VariableNames = 'Fill level (start';
opts_startpoint.DataRange = 'B2';
opts_startpoint.VariableTypes = 'single';
% preview(filename,opts_startpoint) % preview to QC the output
% read data into variable: startpoint
startpoint = readtable(filename,opts_startpoint,"ReadVariableNames",false); % how to surpress the "header warning"?
% set import options: PPR
opts_PPR = spreadsheetImportOptions;
opts_PPR.Sheet = i;
opts_PPR.VariableNames = 'PPR';
opts_PPR.DataRange = 'F2';
opts_PPR.VariableTypes = 'single';
% preview(filename,opts_PPR) % preview to QC the output
% read data into variable: PPR
PPR = readtable(filename,opts_PPR,"ReadRowNames",false);
% concatenate the tables into 1 block
table = cat(2,startpoint,PPR);
datapool{i} = table;
% to retrieve the data in the cell struture, you need to use {} brackets.
% The result is the content of the indexed cell.
end
% determine maximum row numbers
n1 = max(cellfun(@numel, datapool)) % n1 is total elements in the table, which is twices as the needed number.
n1 = n1/2 % actual number of rows.
% reintialize tData and refill it with valid data
%% here is where I have the issues.
tData = nan(n1,sheetnumber*2) % create the matrix with the right size
for i = 1:sheetnumber
j = i*2-1;
t1 = table2array(datapool{i}(:,1));
T1 = [t1;nan(n1-height(t1),1)]; % fill emptycells with NaNs in the end,
% so all columns has the same number of rows.
tData (:,j) =T1; % put data in the target location.
t2 = table2array(datapool{i}(:,2));
T2 = [t2;nan(n1-height(t2),1)]; % fill emptycells with NaNs in the end,
% so all columns has the same number of rows.
tData (:,j+1) =T2; % put data in the target location.
end
%set color order so lines become light to dark
% Red newcolor = {'#dfcccf','#d5bbbf','#caabb0','#bf9ba1','#b48b92','#a97b83','#9f6b75','#935c67','#884d5a','#7d3e4c','#722e3f','#661e33'}
% Green newcolor = {'#cdddcd','#bcd1bd','#acc6ad','#9cbb9d','#8cb08e','#7ba57e',% '#6b9a6f','#5b8f61','#4b8452',% '#3a7944','#276f36','#0e6428'}
% Blue newcolor = {'#ced2dd','#bdc3d1','#adb5c6','#9ea7bb','#8e99b0','#7e8ba5','#6f7e9a','#607190','#506485','#41577b','#304b70','#304b70'}
%newcolor = {'#ff1111','#6db130','#005dae','#ff4444','#8ed053','#0389ff','#ff7777','#b0de87','#57b1ff','#ffaaaa','#d2ecba','#abd8ff'};
%colororder(newcolor)
%plot the dataset
hold on
for i = 1:sheetnumber
j = 2*i-1
plot(tData(:,j),tData(:,j+1),'LineWidth',2) % x = Start fill level, y = PPR
end
PS: During the coding, I realize that I am struggling with the data types in Matlab. Often I wanted to use a function or to refer to the "data", I ran into some syntax issues. Is there some resource that builds a solid understanding to the data structures?
Thanks a lot,
Charlie
  2 件のコメント
Charlie Chen
Charlie Chen 2022 年 8 月 17 日
Thank you dpb for the response. Your insight is much appreciated.
The idea of your example is very clean and efficient. I definetly will incorprate it into my code.
I think all the hussle I had with the code is that all the manipulation can be easily done with matrices, but I HAD to do it with tables because it can retain some context information in variable names.
As background information, the test is to see the consistence of a particular instrument in different conditions. On the sheet name, there are 4 pieces of information(meta data?) : Test Date "0809", Test order "-1", Powder type "SM", Valve type "FL". On each sheet, 2 data are needed, Fill level (input) and PPR (output).
After thinking about your questions, I think a better way to do this is to design a data structure to present all the data together. Something looking like:
Fill level | PPR | Valve type | Powder type | Date | Order |
Double| Double | Character | Character | Date | Double |
When I plot the data, I would group all the data with the same valve type, and sort it by the date and order.
Do you think it's feasible? If yes, what data structure would you suggest to use?
Best regards,
Charlie

サインインしてコメントする。

採用された回答

dpb
dpb 2022 年 8 月 17 日
編集済み: dpb 2022 年 8 月 17 日
Sure it's possible. Just parse the sheet names when reading them and do similarly as I did excepting with Date, Valve, Powder variables. I figured that was likely to be the case with the format of the sheet names. It's ok for spreadsheets, but definitely not the way to build a dataset for analysis.
Definitely put the metadata data in the table as variables even though the same values are repeated; then they're available for grouping variables for the rowfun kind of analysis by whatever (combination of) variables are needed from one-at-a-time to in aggregate total depending upon what you need. Or they're available as factors or levels for ANOVA or such if it comes to testing some hypothesis of equality of means or the like.
The results from addressing a table variable with the "dot" notation are arrays so matrix operations still come naturally; there's no need to do any other machinations to build corollary array variables to use the table -- there are, almost certainly, already all the builtin tools to handle table data that you'll need for anything you have in mind or realize is a new idea when get into it.
ADDENDUM
" it can retain some context information in variable names..."
I didn't see it here, but same thing goes for variable names as for the sheet names -- if there's a variable encoded in a variable name by "X1, X2" where the "1,2" stands for a different test level or controlled variable or whatever, then turn it into just "X" and and the other data as a variable (maybe categorical).

その他の回答 (1 件)

Peter Perkins
Peter Perkins 2022 年 8 月 23 日
Charlie, I'm confused.
I haven't followed all of what you code does, but it seems that you are lining up tables of different heights horizontally, and then NaN padding to make up the different heights. Unless you have a good reason, that seems funny. I would recommend that you stack them up vertically with an indicator variable showing which sheet they came from, like this:
fname = "FS SM+WP Dataset.xlsx";
nsheets = numel(sheetnames(fname));
for i = 1:nsheets
opts = detectImportOptions(fname,"Sheet",i,"Range","B:F","VariableNamingRule","preserve","TextType","string");
t_i = readtable(fname,opts);
t_i.Sheet = repmat(i,height(t_i),1);
c{i} = t_i;
end
t = vertcat(c{:})
which gives me
t =
949×6 table
Fill level (start Fill level (finish) Note Powder dosed(g) PPR Sheet
_________________ ___________________ _________________________________________________________ _______________ ______ _____
3042.1 3011.1 <missing> 31 3.875 1
3011.1 2982.7 <missing> 28.4 3.55 1
2982.7 2954.9 <missing> 27.8 3.475 1
2954.9 2922 <missing> 32.9 4.1125 1
2922 2887.3 <missing> 34.7 4.3375 1
2887.3 2848 <missing> 39.3 4.9125 1
2848 2803.2 <missing> 44.8 5.6 1
2803.2 2757.6 <missing> 45.6 5.7 1
: : : : : :
3944.1 3813.4 <missing> 130.7 3.2675 7
3813.4 3681.8 "Motor stopped working after that, due to a cable issue." 131.6 3.29 7
NaN NaN <missing> NaN NaN 7
NaN NaN <missing> NaN NaN 7
NaN NaN <missing> NaN NaN 7
NaN NaN <missing> NaN NaN 7
NaN NaN <missing> NaN NaN 7
NaN 40 <missing> NaN NaN 7
Display all 949 rows.
From there, you can do grouped operations on each sheet, including plotting:
hold on
plotEm = @(startFillLevel,PPR) plot(startFillLevel,PPR,'LineWidth',2);
rowfun(plotEm,t,"InputVariables",["Fill level (start" "PPR"],"GroupingVariable","Sheet");
hold off
I may not have captured all of what you are doing, but this comes close with much less code.
  3 件のコメント
Peter Perkins
Peter Perkins 2022 年 8 月 24 日
Glad this was helpful. This example talks more about grouped calculations in tables and timetables. It's not exactly the same as what you have, but it might be helpful.

サインインしてコメントする。

カテゴリ

Find more on Resizing and Reshaping Matrices in Help Center and File Exchange

タグ

製品


リリース

R2022a

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by