When applying the Simple Filter Approach (t-test) for feature selection, if all features have p-values of 0, does it mean that all features have strong discrimination power?

Question

Hussein 2024 年 4 月 20 日

0
リンク

この質問への直接リンク

https://jp.mathworks.com/matlabcentral/answers/2109536-when-applying-the-simple-filter-approach-t-test-for-feature-selection-if-all-features-have-p-valu

コメント済み: Hussein 2024 年 5 月 9 日

Hello all,

I have frequency response function (FRF) dataset related to pipeline SHM stored in a 6500x4000 matrix (6500 samples (signals) and 4000 features each}. The dataset corresponds to 11 groups or class labels (pipeline conditions). 1500 samples labeled as 'Fault-free', 500 samples labeled as 'BL_C1', 500 samples labeled as 'BL_C2', 500 samples labeled as 'BL_C3', 500 samples labeled as 'BL_C4', 500 samples labeled as 'SD_C1', 500 samples labeled as 'SD_C2', 500 samples labeled as 'SD_C3', 500 samples labeled as 'SC_C1', 500 samples labeled as 'SC_C2', and 500 samples labeled as 'SC_C3'.

I used this code for feature selection using Simple Filter Approach (t-test):

-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

% applying t.test for feature selection

% Define the class labels and sample counts

class_labels = {'Fault-free', 'BL_C1', 'BL_C2', 'BL_C3', 'BL_C4', 'SD_C1', 'SD_C2', 'SD_C3', 'SC_C1', 'SC_C2', 'SC_C3'};

sample_counts = [1500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500];

%sample_counts = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]; % using the average signal for each scenario

% Construct the 'groups' variable vector based on the class labels and sample counts

num_samples = sum(sample_counts);

groups = zeros(num_samples, 1);

start_idx = 1;

for i = 1:length(class_labels)

end_idx = start_idx + sample_counts(i) - 1;

groups(start_idx:end_idx) = i;

start_idx = end_idx + 1;

end

% Applying the Simple Filter Approach (t-test)

t_scores = zeros(1, size(data, 2));

p_values = zeros(1, size(data, 2));

alpha = 0.05;

for feature = 1:size(data, 2)

[h, p, ci, stats] = ttest2(data(:, feature), groups, 'Vartype', 'unequal');

t_scores(feature) = stats.tstat;

p_values(feature) = p;

end

% Select features based on p-values below the significance level

selected_features = find(p_values < alpha);

ecdf(p);

xlabel('P value');

ylabel('CDF value')

-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

The code returned 0 p-value for all features. I used the avarage signal for each scenario reducing the dataset from 6500x4000 to 11x4000 corresponds to 11 sample (signals) representing 11 conditions, again with 4000 feature each, but still 0 p-values returned.

Is this acceptable?

Does it mean that all features have strong discrimination power? I doubt it, to be hounest!

Can anyone clear the doubt, rectify the code if I'm wrong somewher, or help me with a better code for a better technique that works well with my dataset?

Thank you very much in advance!

0 件のコメント
-2 件の古いコメントを表示-2 件の古いコメントを非表示

サインインしてコメントする。

サインインしてこの質問に回答する。

Answer 1

Ayush Aniket 2024 年 5 月 7 日

1
リンク

この回答への直接リンク

https://jp.mathworks.com/matlabcentral/answers/2109536-when-applying-the-simple-filter-approach-t-test-for-feature-selection-if-all-features-have-p-valu#answer_1453676

MATLAB Online で開く

Hi Hussein,

The t-test is traditionally used to compare the means between two groups. Your dataset involves 11 groups, which suggests that a one-way ANOVA (Analysis of Variance) might be more appropriate for comparing means across multiple groups.

Additionally, the 'ttest2' function is designed for comparing the means of two independent samples. In your code, you're comparing 'data(:, feature)' against 'groups', which is conceptually incorrect because 'groups' is not a dataset but a vector of class labels. For feature selection in a multi-class scenario, you would typically compare features across pairs of groups or use techniques designed for multi-class discrimination.

The correct approach for comparing two different 'groups' is as following:

% Define the two groups based on your binary class labels
group1_idx = groups == 1; % Indices for class 1
group2_idx = groups == 2; % Indices for class 2
% Preallocate arrays for t-scores and p-values
t_scores = zeros(1, size(data, 2));
p_values = zeros(1, size(data, 2));
% Loop through each feature to perform t-test
for feature = 1:size(data, 2)
    [h, p, ci, stats] = ttest2(data(group1_idx, feature), data(group2_idx, feature), 'Vartype', 'unequal');
    t_scores(feature) = stats.tstat;
    p_values(feature) = p;
end

You may refer to the following documentation to read more about the arguments of 'ttest2' function and one-way ANOVA which should be more suitable for your analysis:

Hope it helps.

1 件のコメント
-1 件の古いコメントを表示-1 件の古いコメントを非表示

Hussein 2024 年 5 月 9 日

@Ayush Aniket

It really helps. Thank you very much for your clarification and introducing the ANOVA technique. Greatly appreciated.

サインインしてコメントする。

When applying the Simple Filter Approach (t-test) for feature selection, if all features have p-values of 0, does it mean that all features have strong discrimination power?

0 件のコメント
-2 件の古いコメントを表示-2 件の古いコメントを非表示

採用された回答

1 件のコメント
-1 件の古いコメントを表示-1 件の古いコメントを非表示

その他の回答 (0 件)

参考

カテゴリ

タグ

製品

リリース

Community Treasure Hunt

When applying the Simple Filter Approach (t-test) for feature selection, if all features have p-values of 0, does it mean that all features have strong discrimination power?

0 件のコメント -2 件の古いコメントを表示-2 件の古いコメントを非表示

採用された回答

1 件のコメント -1 件の古いコメントを表示-1 件の古いコメントを非表示

その他の回答 (0 件)

参考

カテゴリ

タグ

製品

リリース

Community Treasure Hunt

0 件のコメント
-2 件の古いコメントを表示-2 件の古いコメントを非表示

1 件のコメント
-1 件の古いコメントを表示-1 件の古いコメントを非表示