Hematology and Pathology Devices Panel Meeting October 22-23, 2009

Statistical Considerations in the Evaluation of Digital Pathology Devices
Hematology and Pathology Devices Panel Meeting October 22-23, 2009 Shanti Gomatam, Ph.D. Mathematical Statistician FDA/CDRH/OSB/DBS

Outline Intended Use Clinical Study Design Issues
Q 0 Outline Outline Intended Use Clinical Study Design Issues Study Design Examples Assessing Results Precision Studies

Intended Use The intended use under discussion is for primary diagnosis of surgical pathology microscope slides in lieu of an optical microscopy (OM). Broad application -- not organ or disease specific. The Intended Use Population (IUP) is the population of subjects on whom the device is intended to be used.

Supporting Evidence Sponsors would be required to provide evidence to support safety and effectiveness of WSI under its intended use. Clinical studies assess how well WSI performs with respect to OM under clinical use. Precision studies characterize imprecision (variability) in WSI results.

Supporting Evidence Flowchart
Analyze Results Establish Performance

Bias and Variance Low bias, high variance Large Bias but low variance
Low bias, low variance

Bias and Variance Bias is about hitting the right target.
Variance or imprecision is about how close together your repeated attempts are. Right data (right study design) helps reduce bias; more data does not help. More data can help reduce uncertainty (imprecision).

Clinical Study Design Issues
Factors to consider: Diagnostic reference standard Time of specimen collection Comparing modalities Paired design Reader design Sample Selection

Diagnostic Reference Standard (Reference Diagnosis)
Clinical Study Design Diagnostic Reference Standard (Reference Diagnosis) Diagnostic accuracy is based on determination of “truth” via a diagnostic reference standard (See FDA Diagnostic Guidance1). Diagnostic reference standard allows determination of accuracy (e.g, TP, FP, TN, FN). The diagnostic reference standard should not be based on the device being evaluated for accuracy. When diagnostic reference is based on control device (OM), there can be potential bias in comparison. 1 FDA Guidance document: Statistical Guidance on Reporting Results from Studies Evaluating Diagnostic tests.

Q 3.2 Clinical Study Design Time of Specimen Collection Prospective Studies Prospective studies are those in which specimens (cases/slides) are prospectively collected and assessed by each modality (WSI or OM). Prospective planning required. Common protocol used across specimens. Prospective studies less likely to be biased. Study duration is potentially longer. Final collection of study specimens may not contain all specimens of interest.

Retrospective Studies
Q 3.2 Clinical Study Design Time of Specimen Collection Retrospective Studies Retrospective studies are based on specimens that were previously collected from the patient. Easier to enrich. Potential for bias - selection criteria; hidden missing sample/data issues; variation in pre-analytical processes Potential lack of clinical, demographic, and other information for specimens (case/slide)

Clinical Study Design Comparing Modalities Best to compare WSI to OM (“control”) on same samples. Avoid potential bias due to change in clinical practice, change in other time- or location-dependent factors. Difficult to evaluate WSI without comparison to control device OM.

Clinical Study Design Paired designs When each specimen (case/slide) is tested with both WSI and OM the study design is paired. Paired designs have good properties. Design considerations: Memory of first reading can affect next reading (non-washout). Order of WSI and OM readings should be randomized.

Clinical Study Design Paired Designs OM reading

Clinical Study Design Paired Designs OM reading WSI reading

Clinical Study Design Paired Designs WSI reading

Clinical Study Design Paired Designs WSI reading OM reading

Paired designs S P E C I M N TIME Clinical Study Design Paired Designs
OM WSI WSI OM WASHOUT TIME

Reader design Pathologists are “readers” for this indication.
Clinical Study Design Reader design Pathologists are “readers” for this indication. Reader effect makes a difference to results obtained. Reader designs: every reader reads every specimen under every modality … each reader reads a different subset of specimens under a single modality. The first design is most efficient.

Q 3.2 Clinical Study Design Sample Selection Non-representative samples may lead to conclusions that are not generalizable to the IUP (bias may be high and variance estimates may be incorrect). Random selection from IUP is preferred statistical choice. Consecutive (sequential) selection from IUP may be reasonable (under suitable conditions) . Enrichment may be necessary to have rare conditions represented in sufficient numbers.

Q 3.2 Clinical Study Design Sample Selection Adequate representation of non-disease and benign disease cases needed. Factors to be considered while picking sample: Organ/disease for which specimens are collected; Type of specimen (needle biopsy, resection etc.); Potential spectrum effect (level of difficulty -- case-mix); Clinical center/site from which samples are obtained. Ideally statistical mechanism for drawing specimen does not introduce bias; pre-specification preferred.

Potential Study Design Examples

Common Elements of all Design Examples
Study Design Examples Common Elements of all Design Examples Specimens picked from regular clinical practice at multiple sites. Paired design; Specimen order and order of read are randomized. Diagnostic reference standard available for statistical analysis. Readers read de-identified specimens. Results from specimens are compared on diagnoses.

Study I: Prospective Clinical Study
Study Design Examples Study I: Prospective Clinical Study Prospective study using consecutive clinical specimens. R pathologists at each site read all specimens at site with WSI and OM with appropriate washout.

Study II: Retrospective Enriched Clinical Study
Study Design Examples Study II: Retrospective Enriched Clinical Study Prospectively planned retrospective study using enriched clinical specimens randomly picked from those available. R pathologists at each site read all specimens at all sites with WSI and OM Non-study pathologist reads specimens to implement enrichment; Study pathologists blinded to enrichment read.

Study III: Retrospective Clinical Study
Study Design Examples Study III: Retrospective Clinical Study Prospectively planned retrospective study using consecutive clinical specimens. R pathologists at each site read all specimens at all sites with WSI and OM.

Study I: Prospective Clinical Study
Study Design Examples Study I: Prospective Clinical Study Pros: Representative of intended use Ensures planning (prospective) Common protocol (prospective) Reduction in bias (prospective) Reader design not as efficient Cons: Potential implementation challenges (prospective) May take longer (non-enriched, prospective) Reader behavior could be affected (multiple reads)

Study II: Retrospective Enriched Clinical Study
Study Design Examples Study II: Retrospective Enriched Clinical Study Pros: Easier to implement (retrospective) Potentially smaller sample size (enrichment) Ensures some planning (prospectively planned) Reader effect efficient (All cases read with both) Cons: Lack of common protocol (retrospective) Potential bias (retrospective) Reader behavior could be affected (enrichment + multiple reads)

Study III: Retrospective Clinical Study
Study Design Examples Study III: Retrospective Clinical Study Pros: Ensures some planning Potentially shorter duration (retrospective) Potentially larger sample size (non-enriched) Reader design efficient Cons: Lack of common protocol (retrospective) Potential bias (retrospective) Reader behavior could be affected (multiple reads)

Additional Clinical Design Issues Assessing Results

Assessing Results Attributes/measurements to be evaluated
Hypotheses on Attributes Study success criterion Study sizing

Assessing Results Examples Two organ systems will be used as examples in the following slides. Breast: CAP Breast IC protocol checklist Lung: CAP Lung IC Biopsy protocol checklist

CAP Protocol for Breast IC Macroscopic Elements
Assessing Results CAP Protocol for Breast IC Macroscopic Elements Specimen Type Lymph Node Sampling Specimen Size Laterality Tumor Site

CAP Protocol for Breast IC (cont.) Microscopic elements
Assessing Results CAP Protocol for Breast IC (cont.) Microscopic elements Size of invasive component Histologic Type (check all that apply): ___ Noninvasive carcinoma (NOS) ___ Ductal carcinoma in situ ___ Lobular carcinoma in situ … ___ Other(s) (specify): ____________________________ ___ Carcinoma, type cannot be determined

CAP Protocol for Breast IC(cont.) Microscopic elements
Assessing Results CAP Protocol for Breast IC(cont.) Microscopic elements Histologic Grade: Nottingham Histologic Score (Tubule formation; nuclear Pleomorphism; Mitotic count) OR Other Grading System + Mitotic count Pathologic Staging Margins Venous/Lymphatic Invasion Microcalcifications Additional Pathologic Findings

CAP Protocol for Lung IC Biopsy Microscopic elements
Assessing Results CAP Protocol for Lung IC Biopsy Microscopic elements Histologic Type: ___ Carcinoma, non-small cell type ___Small cell carcinoma ___ Squamous cell carcinoma … ___ Other(s) (specify): ____________________________ ___ Carcinoma, type cannot be determined

CAP Protocol for Lung IC Biopsy Microscopic elements
Assessing Results CAP Protocol for Lung IC Biopsy Microscopic elements Histologic Grade: ___ Not applicable ___ GX: Cannot be assessed ___ G1: Well differentiated ___ G2: Moderately differentiated ___ G3: Poorly differentiated ___ G4: Undifferentiated ___ Other (specify): ______ Visceral Pleura Invasion Venous Invasion Lymphatic Invasion Additional Pathologic Findings

Measurements Measurements vary by tissue-type.
Assessing Results Measurements Measurements vary by tissue-type. Measurements vary by pathological findings. Lots of potential measurements per specimen. What results/findings should one use to assess device performance?

Selecting Measurements
Assessing Results Selecting Measurements Should one assess on the basis of: Case (multiple slides); or single whole slide? Should microscopic and/or macroscopic findings be assessed? Pathological report has multiple “lines” of results each potentially containing information on type, grade, size, … How many “lines” is it sufficient to assess agreement on?

Selecting Measurements
Assessing Results Selecting Measurements What fields within each “line” should be compared? Histologic type Histologic grade Histologic determination of size (for case) using multiple slides … Results are tissue-type/disease dependent.

Potential Measurements for Performance Comparison
Assessing Results Potential Measurements for Performance Comparison Disease/non-disease status Primary diagnosis only (main diagnosis for specimen); Some diagnoses from pathological evaluation; All diagnoses from pathological evaluation Any of the above will have multiple measurements of different kinds: Type is nominal, grade is ordinal, size is interval, …

“Primary” and “Secondary” Measurements
Q 3.5 Assessing Results “Primary” and “Secondary” Measurements Agreement on which measurements is key for regulatory decisions? (“Primary” measurements) What additional comparisons are useful to report? (“Secondary” measurements)

Assessing Accuracy Scales
Assessing Results Assessing Accuracy Scales Accuracy and comparative performance can be assessed at various levels and for different outcomes: On binary scale (eg. disease/non-disease) On nominal scale (eg. Histologic type) On ordinal scale (eg. Histologic grade) On continuous scale (eg. Tumor size or probability of being diseased)

More on Assessing Accuracy
Assessing Results More on Assessing Accuracy Sensitivity and specificity can be used for assessments on the binary scale Agreements on ordinal scale can be evaluated using sensitivities/specificities conditioning on category and ROC-based methods. Many methods exist for assessing agreement on a continuous scale. Nonparametric methods for assessing diagnostic accuracy on all scales*. * Obuchowski (2005), Acad. Radiol.

Assessing Nominal Accuracy
Assessing Results Assessing Nominal Accuracy Histologic type is important attribute for performance assessment KxK* table for nominal types WSI with Reference and OM with Reference Example using Breast IC histologic types * K is the number of types of responses

Assessing Nominal Accuracy K by K tables
Assessing Results Assessing Nominal Accuracy K by K tables NIC: Non-invasive carcinoma; DCIS: ductal carcinoma in situ; C,ND: Carcinoma, not determined.

Assessing Results Assessing Nominal Accuracy Can use percent “correct” calls for each of the K types. If K is large, then need large N to power estimates. Can also reduce K by combining categories into subgroups

Assessing Results Assessing Nominal Accuracy If ordinal subgroups possible, can have ordinal analyses. May also be able to define differences between categories in terms of clinical importance – this could reduce table size and create ordinal categories However, loss of information should be considered when combining categories.

Problems with Kappa and Overall Agreement
Assessing Results Problems with Kappa and Overall Agreement Not good as primary descriptive measures. Summarizes KxK table by a single number, severe reduction in information. Depends on prevalence. Agreement between WSI and OM can change by changing proportion of diseased and non-diseased subjects (column totals). “Not good for tests whose results are functions of reader variability.”* * Obuchowski (2001), Stat. Med.

Problems with Kappa and Overall Agreement
Assessing Results Problems with Kappa and Overall Agreement kappa = kappa=0.29 Overall agreement=91.6% Overall agreement=89% Sensitivity=40% Sensitivity=40% Specificity=94.4% Specificity=94.4%

Hypotheses and Study Success Criterion
Assessing Results Hypotheses and Study Success Criterion Regulatory decisions on WSI based on WSI performance in comparison to OM. What hypotheses are appropriate on “primary” and “secondary” measurements? Superiority? Non-inferiority? What are acceptable definitions of study success criteria?

Study Sizing Study success criterion must be met.
Assessing Results Study Sizing Study success criterion must be met. The study is typically sized to power for hypotheses to be satisfied for study success criterion.

Precision Studies

Precision Studies Definition
CLSI* definition of precision “measure of closeness of agreement between independent test/measurement results obtained under stipulated conditions.” Studies to assess variability in WSI measurements when changes are made to important factors (sources of variability). Repeatability and Reproducibility are considered extreme measures of precision. * CLSI: Clinical Laboratory and Standards Institute

Precision Studies Definition
Repeatability: Imprecision of measurements made under the same conditions (same pathologist, scanner, …). Reproducibility: Imprecision of measurements made when conditions are varied to “largest” extent (different pathologists, scanners, laboratories, …). Multiple studies can be used (varying or fixing various factors) to cover the range of precision measurements.

Precision Studies Issues
Q4 Precision Studies Issues What factors to be included in precision study? What specimens to be used for precision studies for WSI? Representation of all tissue-types needed? Representation of all potential specimens (e.g., needle biopsy, resections …) needed? …

Precision Studies Issues
Issues common to clinical study: Sample selection Measurements to be assessed Study Sizing

Precision Studies Non-continuous measurements
Methods exists for precision assessment of continuous measurements but no uniform agreement on methods for non-continuous (ordinal, qualitative) measurements.

Precision study Example
A precision study to characterize imprecision of histologic grade measurements by WSI across scanners. 3 scanners, 2 pathologists, single site Each pathologist does 60 reads (3 scans of, e.g., 20 slides) with washout. Order of scans and order of de-identified slides is randomized. Positive and negative “correct” call rates across scanners can be used to characterize imprecision.

Summary Intended use drives studies needed for approval.
Study design is critical. Differences in measurements for different tissue-types and specimen collection procedures complicates assessment. A collection of studies would probably be needed to assess different aspects of the device. Comparative performance on all critical measurements should be evaluated. Adequate sizing is important.

Hematology and Pathology Devices Panel Meeting October 22-23, 2009

Similar presentations

Presentation on theme: "Hematology and Pathology Devices Panel Meeting October 22-23, 2009"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Hematology and Pathology Devices Panel Meeting October 22-23, 2009

Similar presentations

Presentation on theme: "Hematology and Pathology Devices Panel Meeting October 22-23, 2009"— Presentation transcript:

Similar presentations

About project

Feedback