# ‘Kønne’ formler uden noget nedenunder: en advarende historie ‘Good-looking’ statistics with nothing underneath: a warning tale Jørgen Hilden

## Presentation on theme: "‘Kønne’ formler uden noget nedenunder: en advarende historie ‘Good-looking’ statistics with nothing underneath: a warning tale Jørgen Hilden"— Presentation transcript:

‘Kønne’ formler uden noget nedenunder: en advarende historie ‘Good-looking’ statistics with nothing underneath: a warning tale Jørgen Hilden jhil@sund.ku.dk Notes have been added in the NOTES field

Statistical counselling  What do you really want to know / measure? Meaningless estimand?? Nonsense arithmetic? ? 1. floor: 1 st moment issues. Bias? 2. floor: 2 nd moment. Stand. errors, etc. Higher order refinements

Ground floor examples ”The pH was doubled” – OOSH! How do I define / calculate the mean waiting time to liver transplantation in 2012 ? – Tricky, or impossible. Dr. NN, otologist: #(consultations) / #(patients seen) = 2.7 in 2012, = 1.5 in JAN-MAR 2013. – An interpretable change ?

Statistical counselling  What do you really want to know / measure? Meaningless estimand?? Nonsense arithmetic? ? 1. floor: 1 st moment issues. Bias? 2. floor: 2 nd moment. Stand. errors, etc. Higher order refinements …THEORY…

…on the dangers of inventing and popularizing new statistical (epidemiological) measures which are based entirely on ’nice looks’ and have no proper theoretical underpinning

New biochemical marker Standard clinical data Better? Oracle ’Old’ oracle risk q risk p Consider prognostics as to survival vs. death (D) χ 2 = 2Σ i {(lnq – lnp)D + (ln @ q – ln @ p) @ D} i @: complement is high; odds ratio or hazard ratio, etc., highly significant

The statistics IDI = integrated discrimination improvement & its ‘little brother,’ the NRI = net reclassification index, were designed to measure of the incremental prognostic impact that a new marker will have when added to a battery of prognostic markers for assessing the risk of a binary outcome. Intuitively plausible? – Yes, they are. But their popularity is undeserved, nonetheless.

New biochemical marker Standard clinical data Better? Oracle ’Old’ oracle risk q risk p χ 2 = 2Σ i {(lnq – lnp)D + (ln @ q – ln @ p) @ D} i Proposed ’measures’ of the superiority of the new oracle: NRI ≈ E{sign(q – p)|Death} + E{sign(p – q)|Survives} IDI ≈ E{q – p | Death} + E{p – q | Survives} Pencina & al. (2008+)

New biochemical marker Standard clinical data Better? Oracle ’Old’ oracle risk q risk p χ 2 = 2Σ i {(lnq – lnp)D + (ln @ q – ln @ p) @ D} i Standard measures of prognostic gain: Δ(logarithmic score) = (1/n) Σ{ ln(q/p)D + ln( @ q/ @ p) @ D } = χ 2 /( 2n ) ; 2Δ( Harrell’s C ) = Σ ij (q i – q j )(D i – D j ) / Σ ij (D i – D j ) – (do.with p’s).

IDI and NRI were proposed because the C Index was regarded as the standard measure of prognostic performance, and it turned out to be “insensitive to new information”: “Look, the hazard ratio was as high as 2.5 and strongly significant (P = 0.0001), yet C only increased from 0.777 to 0.790 !”

Main flaws of the NRI/IDI family of statistics … gradually uncovered by various investigators: Attic: sampling distributions much farther from Gaussian than originally thought. 2 nd floor: original SE formulae wrong and seriously off the mark (when training data = evaluation data). 1 st floor: biased towards attributing prognostic power to uninformative predictors, at least in logistic regression models (Monte Carlo), so they may fool their users; bias otherwise undefined or irrelevant (see **). Ground floor: …

Main flaws of the NRI/IDI family of statistics (cont’d) Ground floor: NRI/IDI do reflect prognostic gain, but **what do they measure? What optimality ideal do they portray? users may also be deliberately fooled by an opponent who wants to sell the q’s (i.e., sell the new marker equipment) and who already knows the p’s of patients in the sample [dishonesty pays; keyword: non-proper scoring rule]. Essence: they reward overconfidence, i.e., large risks are too large, small risks too small.

Deliberately fooled?? Recall: p i = patient’s ’old’ risk of ’event’, q i = ’new’ risk. IDI (parameter) graphically defined: IDI = E{ q – p | event } – E{q – p | no event } = sum of arrows 01 = 100% Event No event q mean p Risk Alas! – The IDI is vulnerable to deliberate ( or accidental ) overconfidence…

The p rule can be “improved on” simply by making its predictions more extreme: For patient i, the cheater may report a fake q i {let’s call it Z} = either 100% or = zero: Zero to the left 100% to the right of the red line. Event No event q p marginal event frequency approximately known to cheater

Proof Consider IDI: The cheater tries to ”optimize” Z: he expects the i’th patient to contribute to IDI: +(Z – p i )/#D with prob. p i and –(Z – p i )/(n – #D) with prob. (1 – p i ); i.e., a linear function of Z, maximizable by setting Z := 1 (0) for p i > (<) #D/n = the marginal frequency approx. known to him. If in doubt, he may play safe by setting Z := p i. (Z – p i ) { p i / #D – (1 – p i ) / (n – #D) }, Event No event

Adoption of IDI → Spurious results may arise when risks are overconfident ( instead of being well calibrated ) as may happen with an unlucky choice of regression program. (Cheaters beat the best probabilistic model, so…) A supporter of a new lab test may sell it without ever doing it !* Simply by exploiting knowledge of the assessment machinery, a poor prognostician can outperform a good prognostician. * cf. The Emperor’s New Clothes

Ideally, clinical innovations should be rated in human utility terms. In particular: New information sources should be valued in terms of the clinical benefit that is expected to accrue from ( optimized use of ) the enlarged body of information: Value-of-Information ( VOI ) statistics. All VOI-type, (quasi-) utility expectation statistics are Proper Scoring Rules ( PSRs ). Key properties of a PSR: Good performance cannot be faked. It pays for a prognostician to strive to fully use the data at hand and to honestly report his assessment. He cannot increase his performance score by ‘strategic votes,’ not even by exploiting his knowledge of the scoring machinery. Stepping back – what do we really want?

IDI can be faked ↓ IDI is not a PSR ↓ IDI is not a VOI criterion ↓ One cannot construct a decision scenario – not even a ridiculously artificial one – that has the IDI as its utility-expectation criterion. Strengthened conclusion Even in the absence of cheating, it cannot be claimed that IDI measures something arguably useful or constitutes a dependable yardstick. Conversely :

(1)They knew no better than embracing the C Index as their measure of prognostic power. (2) C turns out disappointing owing to its unexpected resilience to ’well supported’ novel prognostic markers [they mix up weight of effect & weight of evidence]. (3) They [undeservedly] discard C as ’insensitive to new information.’ (4) They propose NRI, IDI and variants as being more sensitive to new information [overlooking that these are also sensitive to null or pseudo information]. (5) They rashly suggest SE formulae and make vague promises of Gaussian distribution in reasonably large samples [both wrong]. Summing up the horror story: What went wrong in the Boston group?

Thank you

Download ppt "‘Kønne’ formler uden noget nedenunder: en advarende historie ‘Good-looking’ statistics with nothing underneath: a warning tale Jørgen Hilden"

Similar presentations