Possible Incorrect Documentation on ksdensity

3 ビュー (過去 30 日間)
David Gillcrist
David Gillcrist 2024 年 10 月 8 日
回答済み: Umar 2024 年 10 月 9 日
I'm trying to implement a custom version of ksdensity. In the documentation the default way of calculating the bandwidth is said to be via Silverman's Rule-of-Thumb, i.e. for a bandwidth h this rule would give
This is according to the wikipedia article on Kernal Density Estimation. However, upon rooting about in matlab files the default bandwidth is calculated in the matlab function: matlab.internal.math.validateOrEstimateBW (run open matlab.internal.math.validateOrEstimateBW if you want to view it in its entirety). Lines 64–68 are shown below and are what is relevant
64 if isequal(bw, 'normal-approx')
65 if all(sigma>0)
66 % Default window parameter is optimal for normal distribution
67 % Scott's rule
68 bw = sigma * (4/((d+2)*N))^(1/(d+4));
69 else
70 ... % Unimportant
71 end
72 else
73 ... % Unimportant
74 end
The 'normal-approx' is the default setting for bandwidth estimation and it should be the rule presented above, however, it is clearly different and is referenced as "Scott's Rule". This could be an issue of wikipedia referencing the wrong bandwidth calculation and that Scott's Rule is, in fact, the same as Silverman's Rule-of-Thumb, but it's been hard to find proper confirmation of this—for example this presentation from UBC has different rule labelled as Silverman's Rule-of-Thumb—as I cannot find Silverman's original paper where he preportedly first introduced this rule. If someone could confirm that this is in fact an error in code or an error in my understanding of the bandwidth calculation, I would be greatly appreciative.
  2 件のコメント
Torsten
Torsten 2024 年 10 月 8 日
You should address this question to the MATLAB development team, not to the forum members as poor end users.
the cyclist
the cyclist 2024 年 10 月 9 日
This question triggered a distant memory. I searched and found this question and answer from 8 years ago.
Spoiler: It's not going to help.

サインインしてコメントする。

回答 (1 件)

Umar
Umar 2024 年 10 月 9 日

Hi @David Gillcrist,

After going through your comments and studying the documentation provided at the link below

https://www.mathworks.com/help/stats/ksdensity.html?s_tid=doc_ta#btpl6_1-1

To clarify your inquiry regarding the bandwidth estimation for kernel density estimation (KDE) in MATLAB versus traditional statistical rules, let me delve into each component:

Understanding Silverman’s and Scott’s Rules

Both formulas aim to optimize density estimation under different distributional assumptions.

MATLAB's Bandwidth Calculation

In your provided MATLAB snippet from matlab.internal.math.validateOrEstimateBW, it appears that MATLAB defaults to a bandwidth estimation method labeled as "normal-approx," which aligns more closely with Scott's Rule rather than Silverman's:

bw = sigma * (4/((d+2)*N))^(1/(d+4));

This formula indeed suggests that it uses Scott’s approach by employing a constant derived from normal distribution assumptions.

Clarification on Literature References

The confusion often arises because both Silverman and Scott provide estimates based on similar principles but differ slightly in their constants due to their unique derivations. For instance: Silverman adjusts his constants to achieve optimality across various distributions, while Scott focuses specifically on normal distributions and reference you mentioned from UBC likely conflates these methods or may be contextualizing them differently.

Practical Implications

Your personal experience resonates with common practice among statisticians. Many practitioners prefer adjusting bandwidth downwards (e.g., using factors like 0.5 or lower) to avoid over-smoothing, especially with smaller sample sizes where finer details are crucial.

Here are some additional insights I would like to share with you.

Depending on your data distribution characteristics (e.g., skewness or presence of outliers), you might want to explore robust bandwidth selectors beyond Silverman’s or Scott’s rules. For instance, adaptive methods can provide better performance in heterogeneous data contexts. Also, bear in mind that different statistical software packages may implement these rules with slight variations, leading to discrepancies in output. Therefore, when comparing results across platforms (e.g., R vs MATLAB), it's essential to understand these underlying implementations.

I do agree with @Torsten’s comments about, “You should address this question to the MATLAB development team, not to the forum members as poor end users.”

Hope this helps.

カテゴリ

Help Center および File ExchangeStatistics and Machine Learning Toolbox についてさらに検索

製品


リリース

R2024a

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by