Presentation is loading. Please wait.

Presentation is loading. Please wait.

Knowledge Discovery in Microarray Gene Expression Data Insilicogen Junhyung Park.

Similar presentations


Presentation on theme: "Knowledge Discovery in Microarray Gene Expression Data Insilicogen Junhyung Park."— Presentation transcript:

1

2 Knowledge Discovery in Microarray Gene Expression Data Insilicogen Junhyung Park

3 The Complete Microarray Bioinformatics Solution Data Management Databases Statistical Analysis Image ProcessingAutomation Data Mining Cluster Analysis

4 Microarrays 3 Microarrays work by exploiting the ability of a given mRNA molecule (target) to bind specifically to, or hybridize to, the DNA template (probe) from which it originated. This mechanism acts as both an "on/off" switch to control which genes are expressed in a cell as well as a "volume control" that increases or decreases the level of expression of particular genes as necessary.

5 The miracle of microarray experiments 4 “I think you should be more explicit here in step two.” Problems: Experimental design Array fabrication Data analysis Normalization QC Image processing Interpretation Data mining Communication

6 Microarray Experiment Stages Microarray Probe and Layout Design Probe and Layout Design Spotting Array segmentation Array segmentation Image Analysis Hybridization Background intensity extraction Background intensity extraction Target detection Target intensity extraction Target intensity extraction Ratio analysis Data Mining DB Normalization Clustering PCA Hierarchical graph SOM Time series, etc...

7 Expression analysis experiments RNA isolation cDNA production and labeling Hybridization and washing drying Laboratory work Scanning of the microarray Image analysis Normalization Statistical analysis Computer work

8 Cells from condition A Cells from condition B mRNA Label Dye 2 Label Dye 1 cDNA equaloverunder Mix Total or Ratio of expression of genes from two sources

9 GSI Lumonics Emission Spectra : Cy3 and Cy5

10 cDNA Microarrays Glass slides or similar supports containing cDNA sequences that serve as probes for measuring mRNA levels in target samples cDNAs are arrayed on each slide in a grid of spots. Each spot contains thousands of copies of a sequence that matches a segment of a gene’s coding sequence. A sequence and its complement are present in the same spot. Different spots typically represent different genes, but some genes may be represented by multiple spots 9 BBSI - 2007

11 cDNA Microarray Probes Expressed Sequence Tags (ESTs) commonly serve as probes on cDNA microarrays. ESTs are small pieces of DNA sequence (usually 200 to 500 nucleotides long) that are generated by sequencing the end of a cDNA that has been reverse transcribed from mRNA. 10 BBSI - 2007 AAAAAAAAA...AmRNA TTTTTTTTTT...T cDNA EST

12 cDNA Probes on Microarrays Spotting cDNA Probes on Microarrays Solutions containing probes are transferred from a plate to a microarray slide by a robotic arrayer. The robot picks up a small amount of solution containing a probe by dipping a pin into a well on a plate. The robot then deposits a small drop of the solution on the microarray slide by touching the pin onto the slide. The pin is washed and the process is repeated for a different probe. Most arrayers use several pins so that multiple probes are spotted simultaneously on a slide. Most arrayers print multiple slides together so that probes are deposited on several slides prior to washing. 11 BBSI - 2007

13 Clone Tracking 12 source plates array plate 96-channel pipette mapping array plates slide 8-pin arrayer mapping

14 June 11, 2007 The PixSys 5500 Arraying Robot (Cartesian Technologies) Vacuum wash station The print head holds up to 32 pins in a 8x4 format Vacuum hold-down platform (50 slide capacity) Robotic arm Source: Dan Nettleton Course Notes Statistics 416/516X

15 Printing Arrays on 50 slides

16 microarray slide plate with wells holding probes in solution All spots of the same color are made at the same time. All spots in the same sector are made by the same pin. Spotting the Probes on the Microarray 8 X 4 Print Head

17 16 cDNA microarray slide 2cDNA microarray slide 1 TTCCAG...... GATATG...... Each spot contains many copies of a sequence along with its complement (not shown). spot for gene 201 spot for gene 576 TTCCAG...... GATATG...... spot for gene 201 spot for gene 576

18 Using cDNA Microarrays to Measure mRNA Levels RNA is extracted from a target sample of interest. mRNA are reverse transcribed into cDNA. The resulting cDNA are labeled with a fluorescent dye and are placed on a microarray slide. Dyed cDNA sequences hybridize to complementary probes spotted on the array. A laser excites the dye and a scanner records an image of the slide. The image is quantified to obtain measures of fluorescence intensity for each pixel. Pixel values are processed to obtain measures of mRNA abundance for each probe on the array.

19 Difficult to Make Meaningful Comparisons between Genes The measures of mRNA levels are affected by several factors that are partly or completely confounded with genes (e.g., EST source plate, EST well, print pin, slide position, length of mRNA sequence, base composition of mRNA sequence, specificity of probe sequence, etc.). Within-gene comparisons of multiple cell types or across multiple treatment conditions are much more meaningful.

20 Using cDNA Microarrays to Measure mRNA Levels ACCTG...G TTCTG...A GGCTT...C ATCTA...A ACGGG...T CGATA...G ?????????? Sample 1 Sample 2 Microarray Slide Spots (Probes) Unknown mRNA Sequences (Target)

21 Convert to cDNA and Label with Fluorescent Dyes ACCTG...G TTCTG...A GGCTT...C ATCTA...A ACGGG...T CGATA...G Sample 1 Sample 2 ?????????? Sample 1 Sample 2

22 Mix Labeled cDNA and Hybridize to the Slide ACCTG...G TTCTG...A GGCTT...C ATCTA...A ACGGG...T CGATA...G Sample 1 Sample 2 ??????????

23 22 Excite Dye with Laser, Scan, and Quantify Signals ACCTG...G7652138 TTCTG...A57084388 GGCTT...C8566765 ATCTA...A120813442 ACGGG...T67849762 CGATA...G67239 Sample 1 Sample 2

24 Oligonucleotides : Simplified Example June 11, 2007 gene 1 gene 2 shared green regions indicate high degree of sequence similarity throughout much of the transcript ATTACTAAGCATAGATTGCCGTATA oligo probe for gene 1 GCGTATGGCATGCCCGGTAAACTGG oligo probe for gene 2...

25 Oligo Microarray Fabrication Oligos can be synthesized and stored in solution for spotting as is done with cDNA microarrays. Oligo sequences can be synthesized on a slide or chip using various commercial technologies. In one approach, sequences are synthesized on a slide using ink-jet technology similar to that used in color printers. Separate cartridges for the four bases (A, C, G, T) are used to build nucleotides on a slide. Affymetrix uses a photolithographic approach.

26 Microarray Experiment Stages Microarray Probe and Layout Design Probe and Layout Design Spotting Array segmentation Array segmentation Image Analysis Hybridization Background intensity extraction Background intensity extraction Target detection Target intensity extraction Target intensity extraction Ratio analysis Data Mining DB Normalization Clustering PCA Hierarchical graph SOM Time series, etc...

27 Image Analysis for Microarray 1. Array segmentation 2. Background intensity extraction 3. Target detection 4. Target intensity extraction 5. Ratio analysis Next, Data Mining !

28 Image Processing 27

29 Processing of images Addressing or gridding –Assigning coordinates to each of the spots Segmentation –Classification of pixels either as foreground or as background Intensity extraction (for each spot) –Foreground fluorescence intensity pairs (R, G) –Background intensities –Quality measures

30 Parameters to address the spots positions  Separation between rows and columns of grids  Individual translation of grids  Separation between rows and columns of spots within each grid  Small individual translation of spots  Overall position of the array in the image Processing of images-Addressing The basic structure of the images is known (determined by the arrayer)

31 Processing of images-Segmentation Classification of pixels as foreground or background -> fluorescence intensities are calculated for each spot as measure of transcript abundance Production of a spot mask : set of foreground pixels for each spot

32 Processing of images-Segmentation

33 Segmentation methods : –Fixed circle segmentation –Adaptive circle segmentation –Adaptive shape segmentation –Histogram segmentation Fixed circleScanAlyze, GenePix, QuantArray Adaptive circleGenePix, Dapple Adaptive shapeSpot, region growing and watershed Histogram methodImaGene, QuantArraym DeArray and adaptive thresholding

34 Processing of images-Fixed circle segmentation Fits a circle with a constant diameter to all spots in the image Easy to implement The spots need to be of the same shape and size Bad example !

35 Processing of images-Adaptive circle segmentation The circle diameter is estimated separately for each spot Dapple finds spots by detecting edges of spots (second derivative) Problematic if spot exhibits oval shapes

36 Processing of images-Adaptive shape segmentation Specification of starting points or seeds Regions grow outwards from the seed points preferentially according to the difference between a pixel’s value and the running mean of values in an adjoining region.

37 Processing of images-Histogram segmentation Uses a target mask chosen to be larger than any other spot Foreground and background intensity are determined from the histogram of pixel values for pixels within the masked area Example : QuantArray –Background : mean between 5th and 20th percentile –Foreground : mean between 80th and 95th percentile Unstable when a large target mask is set to compensate for variation in spot size BkgdForeground

38 Processing of images-Data Quality Irregular size or shape Irregular placement Low intensity Saturation Spot variance Background variance 37 indistinguishablesaturated bad print artifactmiss alignment

39 Intensity Extraction : Spot intensity The total amount of hybridization for a spot is proportional to the total fluorescence at the spot Spot intensity = sum of pixel intensities within the spot mask Since later calculations are based on ratios between cy5 and cy3, we compute the ratio of medians value over the spot mask

40 Intensity Extraction : Background intensity Motivation : spot’s measured intensity includes a contribution of non- specific hybridization and other chemicals on the glass Fluorescence from regions not occupied by DNA should by different from regions occupied by DNA -> could be interesting to use local negative controls (spotted DNA that should not hybridize) Different background methods : Local background, morphological opening, constant background, no adjustment

41 Intensity Extraction : Local background Focusing on small regions surrounding the spot mask. Median of pixel values in this region Most software package implement such an approach ScanAlyzeImaGeneSpot, GenePix By not considering the pixels immediately surrounding the spots, the background estimate is less sensitive to the performance of the segmentation procedure

42 Quality measures (-> Flag) How good are foreground and background measurements ? –Variability measures in pixel values within each spot mask –Spot size –Circularity measure –Relative signal to background intensity Based on these measurements, one can flag a spot

43 Quantification of expression For each spot on the slide we calculate Red intensity = Rfg - Rbg fg = foreground, bg = background, and Green intensity = Gfg - Gbg and combine them in the log (base 2) ratio Log 2 ( Red intensity / Green intensity)

44 Spot intensity data

45 Gene Expression Data Gene expression data on p genes for n samples Genes Slides Gene expression level of gene 5 in slide 4 = Log 2 ( Red intensity / Green intensity) slide 1slide 2slide 3slide 4slide 5 … 1 0.46 0.30 0.80 1.51 0.90... 2-0.10 0.49 0.24 0.06 0.46... 3 0.15 0.74 0.04 0.10 0.20... 4-0.45-1.03-0.79-0.56-0.32... 5-0.06 1.06 1.35 1.09-1.09... These values are conventionally displayed on a red (>0) yellow (0) green (<0) scale.

46 GPR Data File Format

47 Microarray Experiment Stages Microarray Probe and Layout Design Probe and Layout Design Spotting Array segmentation Array segmentation Image Analysis Hybridization Background intensity extraction Background intensity extraction Target detection Target intensity extraction Target intensity extraction Ratio analysis Data Mining DB Normalization Clustering PCA Hierarchical graph SOM Time series, etc...

48 Knowledge Discovery in Microarrays 47 Image Processing Data Cleaning Quality Control Data Imputation Normalization Gene Selection Differential Expression Feature Selection Outcomes Hypotheses Publications Treatments Pattern Assessment Annotation Expert Analysis Experiments Pattern Recognition Clustering PCA/SVD Associations

49 Uses of DNA microarrays Expression analysis Genotyping Resequencing In toxicology, cancer research, environmental research, population genetics, association studies, etc.

50 First question : What do I want to find out? Screen thousands of(differential) genes to find relevant ones in my process of interest: - Health/diseased-comparison - Marker genes of tumor changes - Genes expressed in particular physiological phase Follow a pattern of expression level changes in time series to find co- expressed genes(pathways, regulatory networks) Study presence of interesting genes(genus-specific ID of the e.g. microbial community) 2 1 3

51 Second question : Why do you use analysis tools? To manipulate the data so that differences due to - Different array - Time of hybridization, or - Sample Will be removed before the results from different arrays can be compared to each other To manipulate the data so that differences due to - Different array - Time of hybridization, or - Sample Will be removed before the results from different arrays can be compared to each other Normalization

52 Second question : Why do you use analysis tools? To find genes that are significantly differentially expressed - for that you need statistics to evaluate your data and to separate true results from noise or error To find genes that are significantly differentially expressed - for that you need statistics to evaluate your data and to separate true results from noise or error

53 Second question : Why do you use analysis tools? To connect the information to other biological data - Co-expressed genes may be involved in the same function / state / condition - Do a group (cluster) of genes with similar expression profile have a common function? To connect the information to other biological data - Co-expressed genes may be involved in the same function / state / condition - Do a group (cluster) of genes with similar expression profile have a common function? Data mining

54 Normalization: Why normalize data? To balance the fluorescence intenities of the two dyes(green Cy3 and red Cy5 dye) To allow the comparison of expression levels across experiments(slides) To adjust scale of the relative gene expression levels(as measured by log ratios) across replicate experiments

55 Source of error Pin geometry Slide heterogeneity mRNA preparation Fluorescent labeling Hybridization conditions Scanning and image analysis

56 Data Normalization 55 Calibrated, red and green equally detectedUncalibrated, red light under detected

57 M vs. A plot

58 Box plot

59 Normalization: Channel biases Before Normalization…

60 Normalization: Channel biases After Normalization…

61 Additional Normalization Dye swap –Combine relative expression levels without explicit normalization –Compute lowess fit for log 2 (RR’/GG’)/2 vs. log 2 (A + A’)/2 –Normalized ratio is log 2 (R/G) - c(A) where c(A) is the lowess prediction 60

62 What is clustering? Clustering methods are used to –find genes from the same biological process –group the experiments to similar conditions Different clustering methods can give different results. The physically motivated ones are more robust. Focusing on subsets of the genes and conditions can uncover structure that is masked when using all genes and conditions

63 Football Booing Cheering

64 Cluster Analysis Cluster genes based on expression profiles –Gene expression across several treatments Hypothesis: Genes with similar function have similar expression profiles 63

65 tumour 2 tumour 5 tumour 1 tumour 4 tumour 3 healthy 4 healthy 1 healthy 3 healthy 2 healthy 5 -2.0 0 1.0 2.0 pearson correlation log (ratio) pearson correlation Hierarchical clustering Example, two dimensions

66 calculate distance matrix calculate averages of most similar Clustering methods hierarchical clustering

67 calculate distance matrix 1234 calculate averages of most similar Dendrogram Clustering methods hierarchical clustering (avg. linkage)

68 1234 5 12345 Look at branching pattern when assessing similarity, not simply the sample (or gene) order ! 1234512345 Hierarchical clustering Isomorphism

69 Analysis. k-means clustering. Condition 2 (e.g. age) Condition 1 (e.g. expression) Cond. 3 (e.g. stage) Step 1. genes are randomly divided into k groups. k is assigned by the user. k =4 for this illustration.

70 Analysis. k-means clustering. Condition 2 (e.g. age) Condition 1 (e.g. expression) Cond. 3 (e.g. stage) Step 2. centroid for each group is determined.

71 Analysis. k-means clustering. Condition 2 (e.g. age) Condition 1 (e.g. expression) Cond. 3 (e.g. stage) Step 3. genes are reassigned to its closest centroids

72 Analysis. k-means clustering. Condition 2 (e.g. age) Condition 1 (e.g. expression) Cond. 3 (e.g. stage) Step 4. centroids’ positions are recalculated

73 Analysis. k-means clustering. Condition 2 (e.g. age) Condition 1 (e.g. expression) Cond. 3 (e.g. stage) Step 5. reiterate assignment/ recalculation steps until all genes are assigned to the immutable groups (100-10000 times)

74 Analysis. Self-organizing map. similar to k-means clustering however, SOM arranges groups in a two-dimensional map in addition to dividing genes into groups based on expression patterns useful for visualizing of distinct expression patterns and determining which of these patterns are variants of one another. Condition 2 (e.g. age) Condition 1 (e.g. expression) Cond. 3 (e.g. stage)

75 Self-Organizing Maps Situate grid of nodes along a plane where datapoints are distributed Silicon Genetics, 2003

76 Self-Organizing Maps Sample a gene and subject the closest node and neighboring nodes to its ‘gravitational’ influence Silicon Genetics, 2003

77 Self-Organizing Maps Silicon Genetics, 2003

78 Self-Organizing Maps Sample another gene… Silicon Genetics, 2003

79 Self-Organizing Maps …and so on, and so on… Silicon Genetics, 2003

80 Self-Organizing Maps …until all genes have been sampled several times over. Each cluster is defined with reference to a node, specifically comprised by those genes for which it represents the closest node. Silicon Genetics, 2003

81 Analysis. QT clustering. user-defined minimum number of genes in cluster and maximum diameter starts with a random single gene, then adds the gene that is closest to the starting gene therefore increasing diameter of the cluster. Adds new genes until no genes can be added without the diameter growing beyond the allowed cutoff. some genes may not be clustered Condition 2 (e.g. age) Condition 1 (e.g. expression) Cond. 3 (e.g. stage)

82 Expression Group A B Fold change cut-off  global  sliding window T down T up Identifying differential expression Fold change

83 Expression Group A B Comparing means (1-2 groups)  student’s t Comparing means (>2 groups)  1-way ANOVA Comparing means (multifactor)  2-way ANOVA Expression Group A B H 0 :  A =  B,  A -  B = 0 H 1 :  A   B Expression A 1 Identifying differential expression Parametric statistics

84 Challenge : Data analysis across multiple dimensions of biology

85

86 Consult the litterature on about 300 genes?

87 TP53 alone gives you 41025 publications... Too many data

88 Search every gene in PubMed?

89

90 Harmony of text-mining tool, database and visualization for pathway analysis Pubmed Open access Google Entity-based index Semantic Index Automatic reader’s digest Document Summary Indexing the scientific literature Extracting interactions to create databases for systems biology Text-mining Tool Text-mining Tool Software solution for Knowledge management and pathway analysis of the high-throughput data

91 Pathway Analysis based on DEG

92 Companies Affymetrix –Single dye, high density, oligonucleotide arrays Nimblegen –Similar to Affy, but specializing in custom arrays Agilent –cDNA microarrays, scanners, etc. Axon Instruments –Scanners, software, etc. 91

93 Data Sources Several central microarray data databases exist –Stanford Microarray Database http://genome-www.stanford.edu/microarray –GEO (NCBI) http://www.ncbi.nlm.nih.gov/geo –READ (RIKEN) http://read.gsc.riken.go.jp/ Find more at –http://ihome.cuhk.edu.hk/~b400559/arraysoft_public.h tmlhttp://ihome.cuhk.edu.hk/~b400559/arraysoft_public.h tml 92

94 R and Bioconductor R is an open-source clone of S-Plus –http://www.r-project.org Bioconductor is a collection of microarray data analysis tools –http://www.bioconductor.org –Primarily written by research scientists Very wide range of statistical and machine learning analyses Command-line/programming language –More flexible, but more difficult to use and learn 93

95 TM4 TIGR –http://www.tigr.org/softwarehttp://www.tigr.org/software MADAM –Microarray data manager MEV –Multiple experiment viewer Spotfinder –Image processing ArrayViewer

96 Other Software A more comprehensive list –http://ihome.cuhk.edu.hk/~b400559/arraysoft. htmlhttp://ihome.cuhk.edu.hk/~b400559/arraysoft. html Pathway Analysis Data management Annotation … 95

97 Thank you!


Download ppt "Knowledge Discovery in Microarray Gene Expression Data Insilicogen Junhyung Park."

Similar presentations


Ads by Google