# Dr Marion Wooldridge 26th July 2002 Data quality, combining multiple data sources, and distribution fitting Selected points and illustrations!

## Presentation on theme: "Dr Marion Wooldridge 26th July 2002 Data quality, combining multiple data sources, and distribution fitting Selected points and illustrations!"— Presentation transcript:

Dr Marion Wooldridge 26th July 2002 Data quality, combining multiple data sources, and distribution fitting Selected points and illustrations!

Outline of talk Data quality Data quality what do we need? what do we need? Combining multiple data sources Combining multiple data sources when can we do it? when can we do it? an example explained an example explained Fitting distributions Fitting distributions how do we do it? how do we do it? an example explained an example explained

What do we mean by data? Experimental results Experimental results numerical numerical Field data Field data numerical numerical Information on pathways/processes Information on pathways/processes words and numbers words and numbers Expert opinion Expert opinion words to convert to numbers?! words to convert to numbers?!

Why do we need data? Model construction Model construction risk pathways risk pathways what actually happens? what actually happens? do batches get mixed? do batches get mixed? is the product heated? etc…. is the product heated? etc…. Model inputs Model inputs estimate probabilities estimate probabilities estimate uncertainties estimate uncertainties estimate variability estimate variability

Which data should we use? The best available! The best available! However……. However……. a risk assessment is pragmatic a risk assessment is pragmatic a risk assessment is ‘iterative’ a risk assessment is ‘iterative’ there are always data deficiencies there are always data deficiencies these are often ‘crucial’ these are often ‘crucial’ So what does this tell us?….. So what does this tell us?…..

What do we mean by pragmatic? Purpose of risk assessment….. Purpose of risk assessment….. To give useful info to risk manager To give useful info to risk manager to aid decisions, usually in short term to aid decisions, usually in short term This leads to time constraints This leads to time constraints This generally means using currently available data for decisions ‘now’ This generally means using currently available data for decisions ‘now’ however incomplete or uncertain that data however incomplete or uncertain that data

What do we mean by ‘iterative’? Often several stages, with different purposes Often several stages, with different purposes ‘Is there really a problem?’ ‘Is there really a problem?’ rapid preliminary assessment rapid preliminary assessment ‘OK, so give me the answer’ ‘OK, so give me the answer’ refine assessment refine assessment ‘Did we get it right?’ ‘Did we get it right?’ revisit assessment if/when new data revisit assessment if/when new data

The minimum data quality required….. will be different at each stage will be different at each stage so - why is that?…. so - why is that?….

‘Is there really a problem?’ Risk manager needs rapid answer Risk manager needs rapid answer ‘preliminary’ risk assessment ‘preliminary’ risk assessment identify data rapidly available identify data rapidly available may be incomplete, anecdotal, old, from a different country, about a different strain, from a different species, etc. etc……. may be incomplete, anecdotal, old, from a different country, about a different strain, from a different species, etc. etc……. Still allows decision to be made Still allows decision to be made a problem….. a problem….. is highly likely - do more now! is highly likely - do more now! may develop - keep watching brief may develop - keep watching brief is highly unlikely - but never zero risk! is highly unlikely - but never zero risk!

Stage 1: Conclusion... Sometimes, poor quality data may be useful data Sometimes, poor quality data may be useful data is it ‘fit-for-purpose’ is it ‘fit-for-purpose’ Don’t just throw it out! Don’t just throw it out!

‘OK, so give me the answer’ refine risk assessment refine risk assessment set up model set up model data on risk pathways data on risk pathways model input data model input data utilise ‘best’ data found, identify ‘real’ gaps & uncertainties utilise ‘best’ data found, identify ‘real’ gaps & uncertainties elicit expert opinions elicit expert opinions data should be ‘best’ currently available data should be ‘best’ currently available with time allowed to search all reasonable sources with time allowed to search all reasonable sources still often incomplete and uncertain still often incomplete and uncertain checked by peer review checked by peer review

‘So the risk is…….’ The best currently available data gives…. The best currently available data gives…. the best currently available estimate the best currently available estimate Which is…. Which is…. the best the risk manager can ever have ‘now’! the best the risk manager can ever have ‘now’! Even if… Even if… it is still based on data gaps and uncertainties it is still based on data gaps and uncertainties which it will be! which it will be! the ‘best’ data may still be poor data the ‘best’ data may still be poor data allows targeted future data collection - may be lengthy allows targeted future data collection - may be lengthy

Stage 2: Conclusion…. If a choice If a choice Need to identify the best data available Need to identify the best data available What makes it the best data? What makes it the best data? What do we do if we can’t decide? What do we do if we can’t decide? Multiple data sources for model! Multiple data sources for model! What do we do with crucial data gaps? What do we do with crucial data gaps? Often need expert opinion Often need expert opinion How do we turn that into a number? How do we turn that into a number?

‘Did we get it right?’ revisit assessment if/when new data available revisit assessment if/when new data available targetted data collection targetted data collection may be years away…. may be years away…. but allows quality to be specified but allows quality to be specified not just ‘best’ but ‘good’ not just ‘best’ but ‘good’ will minimise uncertainty will minimise uncertainty should describe variability should describe variability

What makes data ‘good’? Having said all that - some data is ‘good’! Having said all that - some data is ‘good’! Two aspects Two aspects Intrinsic features of the data Intrinsic features of the data universally essential for high quality universally essential for high quality Applicability to current situation Applicability to current situation data selection affects quality of the assessment and output data selection affects quality of the assessment and output General principles…... General principles…...

High quality data should be…. Complete Complete for assessment in hand for assessment in hand required level of detail required level of detail Relevant Relevant e.g. right bug, country, management system, date etc. etc. e.g. right bug, country, management system, date etc. etc. nothing irrelevant nothing irrelevant Succinct and transparent Succinct and transparent Fully referenced Fully referenced Presented in a logical sequence Presented in a logical sequence

High quality data should have…. Full information on data source or provenance Full information on data source or provenance Full reference if paper/similar Full reference if paper/similar Name if pers. comm. or unpublished Name if pers. comm. or unpublished Copy of web-page or other ephemeral source Copy of web-page or other ephemeral source Date of data collection/elicitation Date of data collection/elicitation Affiliation and/or funding source of provider Affiliation and/or funding source of provider

High quality data should have…. A description of the level of uncertainty and the amount of variability A description of the level of uncertainty and the amount of variability uncertainty = lack of knowledge uncertainty = lack of knowledge variability = real life situation variability = real life situation Units Units where appropriate where appropriate Raw numerical data Raw numerical data where appropriate where appropriate

High quality study data should have…. Detailed information on study design: e.g. Detailed information on study design: e.g. Experimental or field based Experimental or field based Details of sample and sampling frame including: Details of sample and sampling frame including: livestock species (inc. scientific name)/product definition/population sub-group definition livestock species (inc. scientific name)/product definition/population sub-group definition source (country, region, producer, retailer, etc.) source (country, region, producer, retailer, etc.) sample selection method (esp for infection: clinical cases or random selection) sample selection method (esp for infection: clinical cases or random selection) Population size Population size Season of collection Season of collection Portion description or size Portion description or size Method of sample collection Method of sample collection

High quality microbiological data should have…. Information on microbiological methods including: Information on microbiological methods including: Pathogen species, subspecies, strain Pathogen species, subspecies, strain may include antibiotic resistance pattern may include antibiotic resistance pattern tests used, including any variation from published methods tests used, including any variation from published methods sensitivity and specificity of tests sensitivity and specificity of tests units used units used precision of measurement precision of measurement

High quality numerical data should have…. Results as raw data Results as raw data Including: Including: Number tested Number tested results given for all samples tested results given for all samples tested for pathogens, number of micro-organisms for pathogens, number of micro-organisms not just positive/negative. not just positive/negative.

A note on comparability…. High quality data sets can easily be checked for comparability High quality data sets can easily be checked for comparability as they contain sufficient detail as they contain sufficient detail are they describing or measuring the same thing? are they describing or measuring the same thing? are the levels of uncertainty similar? are the levels of uncertainty similar? Poor quality data sets are difficult to compare Poor quality data sets are difficult to compare lack detail lack detail may never know if describing or measuring same thing! may never know if describing or measuring same thing! or may be exactly same data!! or may be exactly same data!! difficult/unwise to combine! difficult/unwise to combine!

and on homogeneity….. homogeneity of input data aids comparability of model output homogeneity of input data aids comparability of model output homogeneity achieved by, e.g. homogeneity achieved by, e.g. standardised methods standardised methods for sampling, testing etc. for sampling, testing etc. standard units standard units standard definitions standard definitions for pathogen, host, product, portion size, etc. for pathogen, host, product, portion size, etc.

Multiple data sources - why? No data specific to the problem - but several studies with some relevance No data specific to the problem - but several studies with some relevance e.g. one old, one a different country etc e.g. one old, one a different country etc Several studies directly relevant to the problem - all information needs inclusion Several studies directly relevant to the problem - all information needs inclusion e.g. regional prevalence studies available; need national e.g. regional prevalence studies available; need national Expert opinion - but a number of experts Expert opinion - but a number of experts and may need to combine with other data and may need to combine with other data

Multiple data sources….how? model all separately model all separately ‘what-if’ scenarios ‘what-if’ scenarios several outputs several outputs for data inputs and ‘model uncertainty’ for data inputs and ‘model uncertainty’ fit single distribution fit single distribution point values point values weight alternatives weight alternatives equally or differentially? equally or differentially? expert opinion on weights? expert opinion on weights? sample by weight distribution - single output sample by weight distribution - single output for data inputs and ‘model uncertainty’ for data inputs and ‘model uncertainty’

Weighting example Overall aim: Overall aim: To estimate probability that a random bird from GB broiler poultry flock will be campylobacter positive at point of slaughter To estimate probability that a random bird from GB broiler poultry flock will be campylobacter positive at point of slaughter Current need: Current need: To estimate probability that a random flock is positive To estimate probability that a random flock is positive Data available: Data available: Flock positive prevalence data from 4 separate studies Flock positive prevalence data from 4 separate studies Ref: Hartnett E, Kelly L, Newell D, Wooldridge M, Gettinby G (2001). Ref: Hartnett E, Kelly L, Newell D, Wooldridge M, Gettinby G (2001). The occurrence of campylobacter in broilers at the point of slaughter, a quantitative risk assessment. Epidemiology and Infection 127; 195-206. The occurrence of campylobacter in broilers at the point of slaughter, a quantitative risk assessment. Epidemiology and Infection 127; 195-206.

Probability flock is positive, P fp, per data set Raw data used Raw data used P fp per data set P fp per data set ‘standard’ method ‘standard’ method Beta distribution Beta distribution r = positive flocks r = positive flocks s = flocks sampled s = flocks sampled P fp = Beta (r+1,s-r+1) P fp = Beta (r+1,s-r+1) for P1, P2, P4, P4 for P1, P2, P4, P4 describes uncertainty per data set describes uncertainty per data set

Probability random GB flock is positive - weighting method used Data sets: Data sets: 2 x poultry companies 2 x poultry companies 1 x colleagues (published) epidemiological study 1 x colleagues (published) epidemiological study 1 x another published study 1 x another published study Other data available: Other data available: Market share by total bird numbers for studies 1,2,3 Market share by total bird numbers for studies 1,2,3 Assumption made: Assumption made: study 4 was ‘rest of market’ study 4 was ‘rest of market’ approximation approximation Weighting method: Weighting method: uses all available info uses all available info better than ‘equal weighting’ better than ‘equal weighting’

So - weights assigned…. based on market share per study based on market share per study P1+P2 = 35%; w1+w2 = 0.35 P1+P2 = 35%; w1+w2 = 0.35 combined here as confidential data combined here as confidential data P3=50%; w3 = 0.50 P3=50%; w3 = 0.50 P4=15%; w4 = 0.15 P4=15%; w4 = 0.15 probability random flock positive, P fp probability random flock positive, P fp

So - when to combine? Depends on info needed Depends on info needed in example, for random GB flock in example, for random GB flock must combine must combine but - loses info but - loses info in example, no probability by source in example, no probability by source Depends on info available Depends on info available and assumptions on relevance/comparability and assumptions on relevance/comparability e.g: two studies 10 years old and 5 years old: e.g: two studies 10 years old and 5 years old: if prevalence studies - leave separate? Trend info? ‘What-if’ scenarios better? if prevalence studies - leave separate? Trend info? ‘What-if’ scenarios better? if effects of heat on organism - will this have altered? Check method, detection efficacy, strain etc - but age of study per se not relevant if effects of heat on organism - will this have altered? Check method, detection efficacy, strain etc - but age of study per se not relevant Assessors judgement Assessors judgement

Fitting distributions Could fill a book (has done!) Could fill a book (has done!) Discrete or continuous? Discrete or continuous? Limited values ‘v’ any value (within the range); e.g. Limited values ‘v’ any value (within the range); e.g. number of chickens in flock - discrete (1,2,3... etc) number of chickens in flock - discrete (1,2,3... etc) chicken bodyweight - continuous (within range min-max) chicken bodyweight - continuous (within range min-max) Parametric or non-parametric? Parametric or non-parametric? Theoretically described ‘v’ directly from data; e.g. Theoretically described ‘v’ directly from data; e.g. incubation period - theoretically lognormal - parametric incubation period - theoretically lognormal - parametric percentage free-range flocks by region - direct from data percentage free-range flocks by region - direct from data Truncated or not? Truncated or not? For unbounded distributions - e.g. incubation period > lifespan? For unbounded distributions - e.g. incubation period > lifespan?

Where do we start? Discrete or continuous? Discrete or continuous? Inspect the data Inspect the data whole numbers = limited values = discrete whole numbers = limited values = discrete Parametric or non-parametric? Parametric or non-parametric? Use parametric with care! - usually if: Use parametric with care! - usually if: theoretical basis known and appropriate theoretical basis known and appropriate has proved to be accurate even if theory undemonstrated has proved to be accurate even if theory undemonstrated parameters define distribution parameters define distribution non-parametric generally more useful/appropriate non-parametric generally more useful/appropriate For biological process data For biological process data for selected distribution - biological plausibility (at least!) for selected distribution - biological plausibility (at least!) gross error check….. gross error check…..

Distribution fitting example... Aim: Aim: To estimate, for a random GB broiler flock at slaughter, its probable age To estimate, for a random GB broiler flock at slaughter, its probable age Data available: Data available: Age at slaughter, in weeks, for a large number of GB broiler flocks Age at slaughter, in weeks, for a large number of GB broiler flocks Ref: Hartnett E (2001). Ref: Hartnett E (2001). Human infection with Campylobacter spp. from chicken consumption: A quantitative risk assessment. PhD Thesis, Department of Statistics and Modelling Science, University of Strathclyde. 322 pages. Human infection with Campylobacter spp. from chicken consumption: A quantitative risk assessment. PhD Thesis, Department of Statistics and Modelling Science, University of Strathclyde. 322 pages. Funded by, and work undertaken in, The Risk Research Department, The Veterinary Laboratories Agency, UK. Funded by, and work undertaken in, The Risk Research Department, The Veterinary Laboratories Agency, UK.

Discrete or continuous? Discrete or continuous? Inspect data…age given in whole weeks - but: Inspect data…age given in whole weeks - but: no reason why slaughter should be at specific weekly intervals no reason why slaughter should be at specific weekly intervals time is a continuous variable time is a continuous variable Parametric or non-parametric? Parametric or non-parametric? No theoretical basis No theoretical basis non-parametric more appropriate non-parametric more appropriate Other considerations Other considerations slaughter for meat slaughter for meat shouldn’t be too young (too small) or too old (industry economics) shouldn’t be too young (too small) or too old (industry economics) suggests bounded distribution suggests bounded distribution Sequence of steps: 1. Decisions….

2. What does the distribution look like? Scatter plot drawn….. Scatter plot drawn….. Suggests triangular ‘appearance’ Suggests triangular ‘appearance’

3. Is this appropriate? Triangular distribution is.. Triangular distribution is.. continuous continuous non-parametric non-parametric bounded bounded Sounds good so far! Sounds good so far! ‘checked’ in BestFit (Palisade) ‘checked’ in BestFit (Palisade) only AFTER logic considered only AFTER logic considered gave Triang as best fit gave Triang as best fit

4. Check: Rank 1 using Kolmogorov-Smirnov test in BestFit

Conclusion….. Use Triang in model! Use Triang in model! And it was…. And it was…. Note: This was also an example of combining (point value) data from multiple sources by fitting a single distribution Note: This was also an example of combining (point value) data from multiple sources by fitting a single distribution

Summary…. Acceptable data quality Acceptable data quality requires judgement requires judgement Multiple data sets - to combine or not? Multiple data sets - to combine or not? requires judgement requires judgement Distribution - the most appropriate? Distribution - the most appropriate? requires judgement requires judgement Risk assessment is art plus science Risk assessment is art plus science assessor frequently uses judgement assessor frequently uses judgement and makes assumptions and makes assumptions Only safeguard: transparency! Only safeguard: transparency!

Download ppt "Dr Marion Wooldridge 26th July 2002 Data quality, combining multiple data sources, and distribution fitting Selected points and illustrations!"

Similar presentations