Crosstab by using the same input for both arguments

Question

Sim 2024 年 8 月 15 日

0
リンク

この質問への直接リンク

https://jp.mathworks.com/matlabcentral/answers/2145434-crosstab-by-using-the-same-input-for-both-arguments

編集済み: Sim 2024 年 8 月 25 日

If I run one of the examples of crosstab, I get the same result as indicated in the crosstab webpage:

rng default;  % for reproducibility
x1 = unidrnd(3,50,1);
x2 = unidrnd(3,50,1);
[table,chi2,p] = crosstab(x1,x2)
table = 3x3
     1     6     7
     5     5     2
    11     7     6
<mw-icon class=""></mw-icon>
<mw-icon class=""></mw-icon>
chi2 = 7.5449
p = 0.1097

However, if I use the same input for both arguments of crosstab, I get a p-value basically equal to zero:

rng default;  % for reproducibility
x1 = unidrnd(3,50,1);
x2 = unidrnd(3,50,1);
[table,chi2,p] = crosstab(x1,x1)
table = 3x3
    14     0     0
     0    12     0
     0     0    24
<mw-icon class=""></mw-icon>
<mw-icon class=""></mw-icon>
chi2 = 100
p = 9.8366e-21

Shouldn't I get a p-value higher if I use the same input for both arguments of crosstab? (I was thinking about a p-value close to 1 actually)

4 件のコメント
2 件の古いコメントを表示2 件の古いコメントを非表示

Rahul 2024 年 8 月 16 日

Hi @Sim,

The crosstab function creates a contingency table and performs a chi-square test of independence and the chi-square test evaluates whether there is a significant association between the two categorical variables.

The p value is a number, calculated from a statistical test, that describes how likely you are to have found a particular set of observations if the null hypothesis were true. The null hypothesis is that there is no relationship between your variables of interest or that there is no difference among groups.

The p value gets smaller as the test statistic calculated from your data gets further away from the range of test statistics predicted by the null hypothesis, or if there is increasing association between the variables.

Since, p-value is calculated under the null hypothesis that the two variables are independent. A perfect association (like in our case of identical categorical variables) strongly rejects this null hypothesis, should result in a very small p-value, close to zero.

Rahul 2024 年 8 月 16 日

編集済み: Rahul 2024 年 8 月 16 日

MATLAB Online で開く

Hey,

By 'association of the two categorical variables' (or column vector data), I meant that one group's values affect the other variable's values, which in our case, holds true, since one variable's value is equal to other at every row. This means that our initial null hypothesis, which states that the groups are unrelated, doesnt seem to hold. This is clearly evident with the high output value of resultant 'X^2' statistic value:

chi2 = 100

which exceeds the standard critical value of 9.488 for df (degree of freedom) = (3 - 1)*(3 - 1) = 1 and alpha = 0.05 (say). (You can look up this critical value in chi-squared right probability table).

This confirms that our data doesn't support null hypothesis and hence, p-value should be lesser than alpha (= 0.05), or close to 0.

Sim 2024 年 8 月 16 日

編集済み: Sim 2024 年 8 月 16 日

MATLAB Online で開く

Thanks for your comment :-)

Let me summarise to see if I understood correctly.... Even though I have still many doubts, here below explained... :-)

"The null hypothesis is that there is no relationship between your variables of interest or that there is no difference among groups.". In other words, we can say taht "null hypothesis = no relationship / no difference / no association". And therefore that the "alternative hypothesis = yes relationship / yes difference / yes association".
In my case, where I perform the "crosstab(x1,x1)", we have that "one variable's value is equal to other at every row. This means that our initial null hypothesis, which states that the groups are unrelated, doesnt seem to hold."
Therefore, if the "initial null hypothesis doesnt seem to hold" when using "crosstab(x1,x1)", the alternative hypothesis holds. This means that "x1" has a relationship / a difference / an association with/to "x1", i.e. with/to itself. But how is it possible that "x1" is different from "x1"?
Also, in other statistical tests, as far as I know, a p-value less than alpha (= 0.05) is typically considered to be statistically significant, in which case the null hypothesis should be rejected, in favour of the alternative hypothesis. In other words, if a p-value is less than alpha (= 0.05), or close to 0, means that "x1 is different from x1", which is a contradiction. Right?
In the "opposite" case, where we have p-value greater than alpha (= 0.05), that deviation from the null hypothesis is not statistically significant, and the null hypothesis is not rejected. Therefore, we would have a likelihood that "x1 is similar to x1". The extreme case is when the "p-value = 1", leading to a very high likelyhod of "x1 = x1". Right? This is supported by other tests, like the two-sample KS test:

x1 = unidrnd(3,50,1);
[h,p,ks2stat] = kstest2(x1,x1)
h = logical
   0
p = 1
ks2stat = 0

So, if what I wrote is correct, it looks like that "crosstab(x1,x1)" gives a "p-value = 0", while "kstest2(x1,x1)" gives a "p-value = 1". Isn't it a contradiction if the null hyptohesis and the meaning of p-value is the same for all the statistical tests?

サインインしてコメントする。

サインインしてこの質問に回答する。

Answer 1

dpb 2024 年 8 月 17 日

1
リンク

この回答への直接リンク

https://jp.mathworks.com/matlabcentral/answers/2145434-crosstab-by-using-the-same-input-for-both-arguments#answer_1500109

編集済み: dpb 2024 年 8 月 17 日

"@Sim - if what I wrote is correct..."

Your statements above were written under the null hypothesis that "both KSTEST2() and CROSSTAB() test that two datasets are indendent samples". However, that hypothesis is incorrect; one (crosstab) tests for independence whereas the other (kstest2) tests for being samples from the same distribution.

kstest2 - [tests] that the data in vectors x1 and x2 are from the same continuous distribution...

crosstab - crosstab tests that tbl is independent in each dimension.

Ergo the two are testing alternate hypotheses of each other. p-values do have the same interpretation, but only against the specific null hypothesis under which the given test statistic is derived.

1 件のコメント
-1 件の古いコメントを表示-1 件の古いコメントを非表示

Sim 2024 年 8 月 17 日

Ahhhhh, so the null hypothesis is opposite for those two tests!! I was super confused and I did not understand what @Rahul said because of my false premise....! Thanks a lot to @dpb and to @Rahul!!

So... with crosstab, a p-value = 0 means the two distributions are likely identical (or better to say: we reject the null hypothesis that the two distributions are different, in favour of the alternative hypothesis that the two distributions are identical to each other), while a p-value = 1 means the two distributions are likely different (or better to say: we fail to reject the null hypothesis that the two distributions are different)...... right?

サインインしてコメントする。

Answer 2

Sim 2024 年 8 月 25 日

0
リンク

この回答への直接リンク

https://jp.mathworks.com/matlabcentral/answers/2145434-crosstab-by-using-the-same-input-for-both-arguments#answer_1504654

編集済み: Sim 2024 年 8 月 25 日

MATLAB Online で開く

I feel the right way should be to have the equivalent of the following R function (and maybe someone @MathWorks Support Team could confirm it):

chisq.test(cbind(x, y))

where "x" and "y" are two observed frequency datasets, i.e. the binned datasets I want to compare one against the other.

Let's see a couple of examples in R.

(1) We can see that if x=y, in R we get a test statistics equal to 0 and a p-value equal to 1, as expected:

> x = c(1000,100,10,1)
> y = c(1000,100,10,1)
> chisq.test(cbind(x, y))
	Pearson's Chi-squared test
data:  cbind(x, y)
X-squared = 0, df = 3, p-value = 1

(2) In another example performed with R, we get the following result :

> x <- c(287,202,127,78,65,44,37,27,22,20,10,18,11,6,8,4,6,3,5,1)
> y <- c(348,171,124,64,51,49,30,33,19,13,11,12,11,8,8,6,7,4,1,2)
> chisq.test(cbind(x,y))
	Pearson's Chi-squared test
data:  cbind(x, y)
X-squared = 19.959, df = 19, p-value = 0.3971

The equivalent of chisq.test(cbind(x, y)) in Matlab.

Now, lets try the following Matlab code (not mine!) for both example (1) and example (2), and we will see that we will get the same results of the "chisq.test(cbind(x, y))" R function.

% Example (1), where x=y
x = [1000,100,10,1];
y = [1000,100,10,1];
observed = [x(:) y(:)];
row_totals = sum(observed, 2);
col_totals = sum(observed, 1);
grand_total = sum(row_totals);
expected = (row_totals * col_totals) / grand_total;
chi2 = sum((observed(:) - expected(:)).^2 ./ expected(:));
df = (size(observed, 1) - 1) * (size(observed, 2) - 1);
p = 1 - chi2cdf(chi2, df);
disp(['Chi-squared statistic: ', num2str(chi2)])
Chi-squared statistic: 0
disp(['Degrees of freedom: ', num2str(df)])
Degrees of freedom: 3
disp(['p-value: ', num2str(p)])
p-value: 1
% Exampe (2), a generic one
x = [287,202,127,78,65,44,37,27,22,20,10,18,11,6,8,4,6,3,5,1];
y = [348,171,124,64,51,49,30,33,19,13,11,12,11,8,8,6,7,4,1,2];
observed = [x(:) y(:)];
row_totals = sum(observed, 2);
col_totals = sum(observed, 1);
grand_total = sum(row_totals);
expected = (row_totals * col_totals) / grand_total;
chi2 = sum((observed(:) - expected(:)).^2 ./ expected(:));
df = (size(observed, 1) - 1) * (size(observed, 2) - 1);
p = 1 - chi2cdf(chi2, df);
disp(['Chi-squared statistic: ', num2str(chi2)])
Chi-squared statistic: 19.9586
disp(['Degrees of freedom: ', num2str(df)])
Degrees of freedom: 19
disp(['p-value: ', num2str(p)])
p-value: 0.39707

And what about the Matlab function crosstab?

Well, the Matlab function crosstab is equivalent to the following R function without "cbind()":

chisq.test(x, y)

Lets see the same two examples with "chisq.test(x, y)" in R and crosstab in Matlab:

> x = c(1000,100,10,1);
> y = c(1000,100,10,1);
> chisq.test(x, y)
	Pearson's Chi-squared test
data:  x and y
X-squared = 12, df = 9, p-value = 0.2133
Warning message:
In chisq.test(x, y) : Chi-squared approximation may be incorrect
> x <- c(287,202,127,78,65,44,37,27,22,20,10,18,11,6,8,4,6,3,5,1);
> y <- c(348,171,124,64,51,49,30,33,19,13,11,12,11,8,8,6,7,4,1,2);
> chisq.test(x,y)
	Pearson's Chi-squared test
data:  x and y
X-squared = 325, df = 306, p-value = 0.2178

...and in Matlab:

x = [1000,100,10,1];
y = [1000,100,10,1];
[~,chi2,p] = crosstab(x,y)
chi2 =
    12
p =
      0.21331
x = [287,202,127,78,65,44,37,27,22,20,10,18,11,6,8,4,6,3,5,1];
y = [348,171,124,64,51,49,30,33,19,13,11,12,11,8,8,6,7,4,1,2];
[~,chi2,p] = crosstab(x,y)
chi2 =
          325
p =
      0.21784

2 件のコメント
なしを表示なしを非表示

dpb 2024 年 8 月 25 日

You can always suggest an enhancement.

The Statistics module in MATLAB has seemingly "just growed" rather than there being an attempt to reproduce some known package capabilities with a uniform interace and output.

It's a hard nut to crack w/ MATLAB in conventional MATLAB syntax; it would be theoretically, at least, possible to build a complete app in MATLAB code that provides yet another statistics package, but my contention is if one is doing heavy statistical computing, the better route is to use one of the many available packages which are available and have all the features; particularly the output formatting that is the difficult issue with MATLAB.

Sim 2024 年 8 月 25 日

編集済み: Sim 2024 年 8 月 25 日

Thanks a lot @dpb for your comment!

I just wrote an additional answer in case someone else was facing the same doubts as me...

I think the MATLAB "crosstab(x,y)" function for creating contingency tables is not wrong per-se. Also, it gives the same output of the "chisq.test(x,y)" function of R.

However, I did not find - in Matlab - the exact same functionality of the R function previously mentioned, i.e. the "chisq.test(cbind(x, y))". That feature, i.e. the two-sample chi-squared test on contingency tables, can be very useful to have a first assessment on the comparison of two datasets/distributions... I guess that experts in Mathworks will then judge if suitable enough to implement it among the standard features (unless it already exists and I did not find/understand it in the MATLAB documentation)... :-)

サインインしてコメントする。

Crosstab by using the same input for both arguments

4 件のコメント
2 件の古いコメントを表示2 件の古いコメントを非表示

回答 (2 件)

1 件のコメント
-1 件の古いコメントを表示-1 件の古いコメントを非表示

2 件のコメント
なしを表示なしを非表示

参考

カテゴリ

タグ

Community Treasure Hunt

Crosstab by using the same input for both arguments

4 件のコメント 2 件の古いコメントを表示2 件の古いコメントを非表示

回答 (2 件)

1 件のコメント -1 件の古いコメントを表示-1 件の古いコメントを非表示

2 件のコメント なしを表示なしを非表示

参考

カテゴリ

タグ

Community Treasure Hunt

4 件のコメント
2 件の古いコメントを表示2 件の古いコメントを非表示

1 件のコメント
-1 件の古いコメントを表示-1 件の古いコメントを非表示

2 件のコメント
なしを表示なしを非表示