Presentation is loading. Please wait.

Presentation is loading. Please wait.

Laboratorio Bioinformatica. Obbiettivi Comprendere gli approcci con cui, utilizzando la tecnologia dei microarray, è possibile identificare: – Marcatori.

Similar presentations


Presentation on theme: "Laboratorio Bioinformatica. Obbiettivi Comprendere gli approcci con cui, utilizzando la tecnologia dei microarray, è possibile identificare: – Marcatori."— Presentation transcript:

1 Laboratorio Bioinformatica

2 Obbiettivi Comprendere gli approcci con cui, utilizzando la tecnologia dei microarray, è possibile identificare: – Marcatori prognosti/diagnostici di patologie

3 Esempio Analizzeremo il modo con cui si identificano marcatori molecolari di patologie dissezionando l’approccio presentato in: PNAS 2005, 102: PNAS 2007, 104:

4 La domanda biologica Huntington’s disease (HD) is an autosomal dominant disorder caused by an expansion of glutamine repeats in ubiquitously distributed huntingtin protein. Mutant huntingtin interferes with the function of widely expressed transcription factors, suggesting that gene expression may be altered in a variety of tissues in HD, including peripheral blood. Highly quantitative biomarkers of neurodegenerative disease remain an important need in the urgent quest for disease-modifying therapies. For Huntington’s disease (HD), a genetic test is available (trait marker), but necessary state markers are still in development. Tested hypothesis: – Two studies exists: Borovecki et al. Detecting biomarkers profiling complete blood from HD patients (hd), pre-HD patients (pre) and normal donors (n). Runne et al. Detecting biomarkers profiling lymphocytes from HD patients (hd), and normal donors (n). – Is it possible to identify disease biomarkers using these data sets?

5 Experimental groups Borovecki : – HD group: 12 HD-affected (stage I-II) subjects 5 early presymptomatic carriers of the gene mutation, as determined by genetic testing. – Normal group: 14 healthy control subjects – Affymetrix hgu133a Runne: – HD group: 12 HD-moderate stage HD subjects – Normal group: 10 healthy control subjects – Affymetrix hgu133plus2

6 Experimental design

7 Recognition and statement of the problem The problem should be specified enough and the conditions under which the experiment will be performed should be understood so the appropriate design for the experiment can be selected.

8 Example We are investigating the effect of a drug, by BrdU incorporation, considering three concentrations (10 nM, 100 nM, 1  M), over 3 different tumor cell lines (CL). In this example the factors are two: – CL, qualitative factor with 3 levels – Drug concentration, quantitative factor with 3 levels

9 Identicare i fattori coinvolti nello studio di Borovecki Lo studio è costituito da: – pazienti HD, pazienti preHD e donatori Quanti fattori sono coinvolti? –1–1 Quali: – pazienti I fattori sono quantitativi o qualitativi? – Qualitativi Quanti livelli ci sono? – 3 (HD, preHD, N) Pre HD N HD Livelli Fattore

10 Come posso ottenere i dati sperimentali? Recentemente per l’accettazione di un articolo su riviste internazionali viene richiesto che dati siano depositati su banche dati pubbliche: – Europa: arrayexpress – USA: GEO

11

12

13

14 E’ possibile scaricare i dati: 1. in formato tipo excel (tabulato) contenente tutte le informazioni dell’esperimento 2.le immagini dell’array (in questo caso i.CEL files dell’Affymetrix)

15 Header Matrix series file

16 Affymetrix geneChips

17 PM MM cell Probe pair Gene sequence ACCAGATCTGTAGTCCATGCGATGC ACCAGATCTGTAATCCATGCGATGC PM MM Probe set (Affymetrix)

18 Per analizzare i dati di microarray è necessario disporre di softwares dedicati I dati da microarray non possono essere analizzati con un semplice foglio excel ma necessitano di strumenti statistici alquanto sofisticati. Esistono software commerciali od open-source. In questo corso le esercitazioni verranno fatte utilizzando un software open-source: – Bioconductor

19 Bioconductor Platform specific devices Analysis pipe-line Sample Preparation Preparation Array Fabrication Hybridization Scanning+ Image Analysis Normalization Filtering statisticalanalysis Annotation BiologicalKnowledgeextraction Qualitycontrol

20 Come si inizia ad analizzare i dati? Se i.CEL files sono disponibili si esegue un approfondito controllo di qualità. In mancanza dei.CEL files, se è solo disponibile il matrix series file, è possibile eseguire un numero più limitato di controlli di qualità.

21 Analysis pipe-line Normalization Filtering Statisticalanalysis Annotation BiologicalKnowledgeextraction Qualitycontrol

22 Perché si fanno i controlli di qualità (QC)? I QC sono un punto molto importante di un analisi di dati di microarray. Questo perché solitamente il numero di esperimenti disponibili è limitato e la presenza di uno o più arrays caratterizzati da un alto numero di artefatti sperimentali potrebbe inficiare l’analisi. Il QC permette di identificare gli arrays outliers e permettere al ricercatore di valutare se è necessario rimuoverli o no.

23 Controllo di qualità per identificare la presenza di array outliers Avendo a disposizione solo MSF per valutare la presenza di arrays outliers si ispezionano: Box plot delle frequenze di intensità dei vari arrays.

24

25

26 Controllo di qualità per valutare l’omogeneità dei gruppi sperimentali Principal component analysis Clustering gerarchico

27 Principal component analysis Principal component analysis (PCA) involves a mathematical procedure that transforms a number of (possibly) correlated variables into a (smaller) number of uncorrelated variables called principal components. Principal component analysis (PCA) involves a mathematical procedure that transforms a number of (possibly) correlated variables into a (smaller) number of uncorrelated variables called principal components. The first principal component accounts for as much of the variability in the data as possible The first principal component accounts for as much of the variability in the data as possible Each succeeding component accounts for as much of the remaining variability as possible. Each succeeding component accounts for as much of the remaining variability as possible. The components can be thought of as axes in n-dimensional space, where n is the number of components. Each axis represents a different trend in the data. The components can be thought of as axes in n-dimensional space, where n is the number of components. Each axis represents a different trend in the data.

28 PCA1 PCA2 PCA

29 2 1 2° PC will be orthogonal to the 1st In general the first three components account for nearly all the variability. Therefore, PCA can be reasonably represented in a 3D space. In general the first three components account for nearly all the variability. Therefore, PCA can be reasonably represented in a 3D space.

30

31 Hierarchical Clustering (HCL Hierarchical Clustering (HCL ) HCL is an agglomerative/divisive clustering method. HCL is an agglomerative/divisive clustering method. The iterative process continues until all groups are connected in a hierarchical tree. The iterative process continues until all groups are connected in a hierarchical tree.

32 Hierarchical Clustering (agglomerative) s8s1s2s3s4s5s6s7 s1s8s2s3s4s5s6s7s1s8s4s2s3s5s6 s1 is most like s8 s4 is most like {s1, s8} Modified by TMEV presentation (www.tigr.org)

33 s7s1s8s4s2s3s5s6 s1s8s4s2s3s5s7 s6s1s8s4s5s7s2s3 Hierarchical Clustering s5 is most like s7 {s5,s7} is most like {s1, s4, s8} Modified by TMEV presentation (www.tigr.org)

34 s6s1s8s4s5s7s2s3 Hierarchical Tree Modified by TMEV presentation (www.tigr.org)

35 Hierarchical Clustering During construction of the hierarchy, decisions must be made to determine which clusters should be joined. During construction of the hierarchy, decisions must be made to determine which clusters should be joined. The distance or similarity between clusters must be calculated. The rules that govern this calculation are linkage methods. The distance or similarity between clusters must be calculated. The rules that govern this calculation are linkage methods.

36 Agglomerative Linkage Methods Linkage methods are rules or metrics that return a value that can be used to determine which elements (clusters) should be linked. Linkage methods are rules or metrics that return a value that can be used to determine which elements (clusters) should be linked. Three linkage methods that are commonly used are: Three linkage methods that are commonly used are: – Single Linkage – Average Linkage – Complete Linkage Modified by TMEV presentation (www.tigr.org)

37 t4 is clearly an outlier!

38 Exercise Usare target file target.GSE8762.classif.txt e il file esperimental.design.names.gse8762.txt per valutare con la PCA il comportamento dei fattori disease status e gender nel dataset in esame. Usare target file target.GSE8762.classif.txt e il file esperimental.design.names.gse8762.txt per valutare con la PCA il comportamento dei fattori disease status e gender nel dataset in esame.

39 Exercise Open R Open R Load the oneChannelGUI Load the oneChannelGUI Start a new project: Start a new project: – Change the working dir in dataset.huntington – Load the target file – Set as project name: ronne

40 Exercise Starting from the data set you have loaded Starting from the data set you have loaded – check the data box plotplots Answer the following questions: Answer the following questions: – Is there any array characterized by a very narrow probe intensity distribution? YES (which? …………………………….)NO YES (which? …………………………….)NO – Is there any array which is significantly different with respect to the others? YES (which? …………………………….) NO YES (which? …………………………….) NO

41 Exercise Inspect if the experimental groups of our ronne data set (HD, N) are relatively homogeneous using PCA and hierachical clustering. Inspect if the experimental groups of our ronne data set (HD, N) are relatively homogeneous using PCA and hierachical clustering. Is it easy to discriminate on the basis of disease status? Is it easy to discriminate on the basis of disease status? – Yes – No

42 Analysis pipe-line Normalization Filtering Statisticalanalysis Annotation BiologicalKnowledgeextraction Qualitycontrol

43 Raggruppare i dati dei singoli probes in un unico valore per il probeset Analysis steps: Analysis steps: – Calculating probe set summaries: RMA RMA GCRMA GCRMA – Normalization: Quantile method Quantile method L’INTENSITA’ DI FLUORESCENZA E’ ESPRESSA COME LOG 2 (INTENSITA’) L’INTENSITA’ DI FLUORESCENZA E’ ESPRESSA COME LOG 2 (INTENSITA’)

44 Brief summary about probe set intensity calculation RMA methodology (Irizarry et al., 2003) performs background correction, normalization, and summarization in a modular way. RMA does not take in account unspecific probe hybridization in probe set background calculation. GCRMA is a version of RMA with a background correction component that makes use of probe sequence information (Wu et al., 2004).

45 Why Normalization ? Sample preparation Sample preparation Variability in hybridization Variability in hybridization Spatial effects Spatial effects Scanner settings Scanner settings Experimenter bias Experimenter bias To remove systematic biases, which include, Extracted from D. Hyle presentation,

46 Analysis pipe-line Normalization Filtering Statisticalanalysis Annotation BiologicalKnowledgeextraction Qualitycontrol

47 Multiple testing errors Performing multiple statistical tests two types of errors can occur: Performing multiple statistical tests two types of errors can occur: – Type I error (False positive) – Type II error (False negative) Reduction of type I errors increases the number of type II errors. Reduction of type I errors increases the number of type II errors. It is important to identify an approach that reduces false positives with the minimum loss of information (false negative) It is important to identify an approach that reduces false positives with the minimum loss of information (false negative)

48 Filtering can be performed at various levels: Annotation features: Annotation features: – Specific gene features (i.e. GO term, presence of transcriptional regulative elements in promoters, etc.) Signal features: Signal features: – % intensities greater of a user defined value – Interquantile range (IQR) greater of a defined value

49 Intensity distributions RMAGCRMA Bg level probe sets

50 How to define the efficacy of a filtering procedure? This enrichment is very similar to that used to evaluate the purification folds of a protein after a chromatographic step. This enrichment is very similar to that used to evaluate the purification folds of a protein after a chromatographic step.

51 Filtering by genefilter pOverA (keep if ≥ 25% probe sets have intensities ≥ log 2 (100)) /42 SpikeIn Enrichment:401%22300 Enrichment:100%

52 Filtering by InterQuantile Range IQR 25% 75%

53 How filtering by genefilter IQR works? The distribution of all intensity values of a differential expression experiment are the summary of the distribution of each gene expression over the experimental conditions

54 The filter removes genes that show little changes within the experimental points How filtering by IQR works?

55 Filtering by genefilter IQR (removing if intensities IQR  0.25, 0.5) 68 42/42 SpikeIn Enrichment:32794%22300 Enrichment:100%244 Enrichment:9139%

56 Esercizio Caricare i dati di Borovecki partendo dal matrix series file usando: Caricare i dati di Borovecki partendo dal matrix series file usando: – target.GSE1751.hd.n.txt Valutare con PCA/HCL come si separano i campioni. Valutare con PCA/HCL come si separano i campioni. Applicare un filtro interquartile a 0.25 e 0.5. Applicare un filtro interquartile a 0.25 e 0.5. – Quanti trascritti rimangono dopo ognuno dei filtri? – Con la PCA e HCL i due gruppi di dati sono ancora separabili? Applicare un filtro interterquantile a 0.5 ed un filtro di intensità 50% > 100 Applicare un filtro interterquantile a 0.5 ed un filtro di intensità 50% > 100 – Cosa succede alla distribuzione dei dati? – Con PCA ed HCL i dati sono ancora separabili?

57 Analysis pipe-line Normalization Filtering Statisticalanalysis Annotation BiologicalKnowledgeextraction Qualitycontrol

58 Statistical analysis The sensitivity of statistical tests is affected by the number of available replicates. The sensitivity of statistical tests is affected by the number of available replicates. Replicates can be: Replicates can be: – Technical – Biological Biological replicates better summarize the variability of samples belonging to a common group. Biological replicates better summarize the variability of samples belonging to a common group. The minimum number of replicates is an important issue! The minimum number of replicates is an important issue!

59 Fold change filtering The intensity change between experimental groups (i.e. control versus treated) are known as: The intensity change between experimental groups (i.e. control versus treated) are known as: – Fold change. Frequently an arbitrary threshold is used to define a significant differential expression.

60 Statistical analysis Intensity changes between experimental groups (i.e. control versus treated) are known as: Intensity changes between experimental groups (i.e. control versus treated) are known as: – Fold change. – Ranking genes based on fold change alone implicitly assigns equal variance to every gene. Fold change alone is not sufficient to indicate the significance of the expression changes. Fold change alone is not sufficient to indicate the significance of the expression changes. Fold change has to be supported by statistical information. Fold change has to be supported by statistical information.

61 Statistical validation Statistical validation can be performed using parametric and non-parametric tests. Statistical validation can be performed using parametric and non-parametric tests. Parametric tests: Parametric tests: – The populations under analysis are normally distributed. Non parametric tests: Non parametric tests: – There is no assumption on samples distribution. Non parametric are less sensitive than parametric. Non parametric are less sensitive than parametric.

62 Selecting differentially expressed genes Differential expression linked to a specific biological event. Statistical validation method I Statistical validation method III Statistical validation method II

63 Selecting differentially expressed genes Each method grasps some true signals but not all.Each method grasps some true signals but not all. Each method catches some false signals.Each method catches some false signals. The trick is to find the best condition to maximize true signals while minimizing fakes.The trick is to find the best condition to maximize true signals while minimizing fakes. Each method grasps some true signals but not all.Each method grasps some true signals but not all. Each method catches some false signals.Each method catches some false signals. The trick is to find the best condition to maximize true signals while minimizing fakes.The trick is to find the best condition to maximize true signals while minimizing fakes.

64 SAM Significance Analysis of Microarray

65 SAM (Significance analysis of microarrays) (Tusher et al. 2001) fudge factor regularizes the t -statistic by inflating the denominator fudge factor regularizes the t -statistic by inflating the denominator s(i) is the pooled standard deviation, taking into account differing gene-specific variation across arrays. s(i) is the pooled standard deviation, taking into account differing gene-specific variation across arrays.

66 Two-class unpaired: to pick out genes whose mean expression level is significantly different between two groups of samples (analogous to between subjects t-test). SAM design in oneChannelGUI

67 SAM uses data permutations to define a set of significant differential expression. SAM uses data permutations to define a set of significant differential expression. N NN T TT N N N T T T N NN T TT N N N T T TN NN T TT {}

68 FDR is given by p0 * False / Called p0 is the prior probability pi0 that a gene is not differentially expressed FDR is given by p0 * False / Called p0 is the prior probability pi0 that a gene is not differentially expressed

69 How SAM calculates the False Discovery Rate for a specific delta? Permutations 1234 Mean false 720

70 Rank Product is a non-parametric statistic that detects items that are consistently highly ranked in a number of lists, for example genes that are consistently found among the most strongly upregulated genes in a number of replicate experiments. It is based on the assumption that under the null hypothesis that the order of all items is random the probability of finding a specific item among the top r of n items in a list is p = r/n.

71 Multiplying these probabilities leads to the definition of the rank product: where r i is the rank of the item in the i-th list and n i is the total number of items in the i-th list. The smaller the RP value, the smaller the probability that the observed placement of the item at the top of the lists is due to chance.

72 (1/7)*(2/7) = A B AB

73

74 Permutating the genes in the two arrays A B a1a1 a2a2 b1b1 b2b a1b1a1b a1b2a1b a2b1a2b a2b2a2b2

75 E

76 a1b1a1b a1b2a1b a2b1a2b a2b2a2b AB AB AB AB

77 ( )/(4*7)=0 ( )/(4*7)=0.10 ( )/(4*7)=0.14

78 AB [( )/4]/( )=0 [( )/4]/( )=0.25 [( )/4]/( )=0.25 Significantly differentially expressed genes!

79 Esercizio Caricare i dati di Borovecki partendo dal matrix series file usando: Caricare i dati di Borovecki partendo dal matrix series file usando: – Creare il target.GSE8762.gender.txt Valutare con PCA/HCL come si separano i campioni. Valutare con PCA/HCL come si separano i campioni. Applicare un filtro interquartile a 0.5. Applicare un filtro interquartile a 0.5. SAM (FDR < 10%) per identificare un set di geni differenzialmente espressi. SAM (FDR < 10%) per identificare un set di geni differenzialmente espressi.


Download ppt "Laboratorio Bioinformatica. Obbiettivi Comprendere gli approcci con cui, utilizzando la tecnologia dei microarray, è possibile identificare: – Marcatori."

Similar presentations


Ads by Google