Presentation is loading. Please wait.

Presentation is loading. Please wait.

Chip arrays and gene expression data. Motivation.

Similar presentations


Presentation on theme: "Chip arrays and gene expression data. Motivation."— Presentation transcript:

1 Chip arrays and gene expression data

2 Motivation

3 With the chip array technology, one can measure the expression of all genes at once (even all exons). Can answer questions such as: 1.Which genes are expressed in a muscle cell? 2.Which genes are expressed during the first weak of pregnancy in the mother? In the new baby? 3.Which genes are expressed in cancer?

4 4. If one mutates a TF: which genes are not expressed following this change? 5. Which genes are not expressed in the brain of a retarded baby? 6. Which genes are expressed when one is asleep versus when the same person is awake?

5 Analyzing Output

6 Output Brain tumor females Brain tumor males w.t Gene 1 Gene 2 Gene 3 Gene 25,000 Each cell is either an absolute number or a relative one, depending on the technology used.

7 Repeats Brain tumor female1 Brain tumor male2 Brain tumor male1 w.t Gene 1 Gene 2 Gene 3 Gene 25,000 The repeat can either be the same sample – a different chip or a “real” biological repeat – a different sample.

8 Expression profile bt4bt3bt2bt1wt4wt3wt2wt1 231716154534g1 97366457g2 603026255232g3 Genes 1 and 3 show the same trend (go both high under the same conditions). That is: they have the same expression profile.

9 Clustering bt4bt3bt2bt1wt4wt3wt 2 wt 1 231716154534g1 97366457g2 603026255232g3 In general, we want to find all the genes that share the same expression profile → suggestive of a functional linkage. There are clustering algorithms, which do exactly that.

10 Clustering bt4bt3bt2bt1wt4wt3wt 2 wt 1 2302204534g1 90806457g2 1661605232g3 Clustering of the conditions can suggest two types of brain tumor (bt)

11 Clustering bt4bt3bt2bt1wt4wt3wt 2 wt 1 2302204534g1 90806457g2 37165232g3 Bi-clustering: both on the conditions and the genes.

12 Applications

13 Think of increasing the glucose concentration of E.coli and making a chip array in various concentration. One can potentially discover all genes in the glucose pathway. Knocking out a gene → discover all genes that interact with it.

14 Applications Analyzing expression of genes can help reveal the gene network of a given organism.

15 Gene network

16 Clinical /  11g1 4g2 0g3 Do someone has a brain tumor? bt4bt3bt2bt1wt4wt3wt 2 wt 1 2302204534g1 90806457g2 1661605232g3

17 MammaPrint Used to assess the risk that a breast tumor will spread to other parts of the body (metastasis). It is based on the well- known 70-gene breast cancer gene signature In February, 2007 the FDA cleared the MammaPrint test for use in the U.S

18 Sequence by hybridization It was thought that the following procedure could work for sequencing a genome: 1.Make a chip containing all x mers (e.g., x = 25). 2.Hybridize a genome to the chip. 3.By analyzing all the hybridizations with their overlaps – assemble the genome. Problem: it doesn’t work.

19 ChIP-on-chip : A method for measuring protein-DNA interaction. Proteins that bind DNA includes: Those responsible for transcription regulation Transcription factors (TFs) Replication proteins Histones…

20 ChIP-on-chip: One chip is for Chromatin ImmunoPrecipitation and the second chip is for DNA microarrays. The method is used mostly to detect TF binding sites.

21 Tiling arrays Here the chip array should include not only protein coding genes but also control regions, or simply – the entire genome.

22 Deep sequencing reads Yoder-Himes D.R. et al. PNAS (2009)

23 Machine learning Learning mode on. Bioinfo is great.

24 Clustering

25 Clustering (of expression data) UPGMA is one such direct method, receiving as input a distance matrix and giving as output an ultrametric tree. It was suggested by Sokal and Michener (1958).

26 Clustering (of expression data) Often, there is a one- to-one transformation between the data and points in space. For example, expression of all genes under a specific condition is a point: Condit ion 1 5Gene 1 7Gene 2 2Gene 3 54Gene 20000 (5,7,2,…, 54) a point in a space of dimension 20,000.

27 Clustering (of expression data) Another example, each expression profile is a point in a space whose dimension is the number of conditions Condit ion 4 Condit ion 3 Condit ion 2 Condit ion 1 3342050Gene 1 (50,20,4,33) a point in a space of dimension 4

28 In space: each point is a gene Condition 1 Condition 2 g1

29 Our goal will be to cluster genes Condition 1 Condition 2 Genes that are in the same cluster (show similar patterns of expression) are likely to be functionally related.

30 Distance between two expression profiles The Euclidian distance = Condit ion 4 Condit ion 3 Condit ion 2 Condit ion 1 3342050Gene 1 3132030Gene 2

31 Distance between two expression profiles We can compute the distances between each pair of expression profiles and obtain a distance table. Condit ion 4 Condit ion 3 Condit ion 2 Condit ion 1 3342050Gene 1 3132030Gene 2 3132030Gene 3 3132030Gene 4

32 The distance table g1g2g3g4g5g6g7g8 g1 0324851504898148 g2 02634293384136 g3 04244 92152 g4 0443886142 g5 02489142 g6 090142 g7 0148 g8 0

33 The distance table g1g2g3g4g5g6g7g8 g1 0324851504898148 g2 02634293384136 g3 04244 92152 g4 0443886142 g5 0 24 89142 g6 090142 g7 0148 g8 0

34 Starting tree g5g6 We call the father node of g5 and g6 -- “g56”. g56

35 Removing the g5 and g6 rows and columns, and adding the g56 row and column g1g2g3g4g56g7g8 g1 0324851?98148 g2 02634?84136 g3 042?92152 g4 0?86142 g56 089142 g7 0148 g8 0

36 Computing distances g1g2g3g4g5g6g7g8 g1 0324851504898148

37 The updated table. Starting the second iteration… g1g2g3g4g56g7g8 g1 03248514998148 g2 026343184136 g3 0424492152 g4 04186142 g56 089142 g7 0148 g8 0

38 Building the tree - Continued We call the father node of g2 and g3 - - “g23”. g5g6 g56 g2g3 g23

39 Computing distances g1g2g3g4g56g7g8 g56 49314441089142

40 The updated table. Starting a new iteration… g1g23g4g56g7g8 g1 040514998148 g23 03837.588144 g4 04186142 g56 089142 g7 0148 g8 0

41 Tree g5g6 g56 g2g3 g2356 g23

42 Computing distances g1g23g4g56g7g8 g1 040514998148

43 Starting a new iteration… g1g2356g4g7g8 g1 044.55198148 g2356 039.588.75143 g4 086142 g7 0148 g8 0

44 Building the tree g5g6 g56 g2g3 g2356 g23 g4 g23456

45 Computing distances g1g2356g4g7g8 g1 044.55198148

46 Starting an additional iteration… g1g23456g7g8 g1 045.898148 g23456 088.2142.8 g7 0148 g8 0

47 Constructing the tree g5g6 g56 g2g3 g2356 g23 g4 g123456 g1 g23456

48 One more iteration… g123456g7 g8 g123456 089.833143.66 g7 0148 g8 0

49 Reconstructing the tree g5g6 g56 g2g3 g2356 g23 g4 g1234567 g1 g23456 g7 g123456

50 The new table g1234567g8 g1234567 0144.2857 g8 0

51 Resulting tree g5g6 g56 g2g3 g2356 g23 g4 g123456 g1 g23456 g7 g1234567 g8

52 From tree to clusters g5g6g2g3 g4 g1 g7 g8 If we want two clusters, we will cut here, and obtain g8 versus g1-7.

53 From tree to clusters g5g6g2g3 g4 g1 g7 g8 If we want 3 clusters, we will cut here, and obtain g8,g7, and g1-6.

54 From tree to clusters g5g6g2g3 g4 g1 g7 g8 The 4 clusters are: g8,g7,g1,g23456

55 Classification Condition 2 Condition 1 2050Gene 1 2030Gene 2 2030Gene 3 2030Gene 4 Gene 1 Gene 2 ? If red = brain tumor and yellow healthy – do I have a brain tumor?

56 Gene 1 Gene 2 ? In SVM we find a (hyper)plane that divides the space in two. SVM = support vector machine Condition 2 Condition 1 2050Gene 1 2030Gene 2 2030Gene 3 2030Gene 4

57 Gene 1 Gene 2 ? The further the point is from the separating (hyper)plane, the more confident we are in the classification SVM – confidence in classification

58 Gene 1 Gene 2 ? Sometimes we cannot perfectly separate the training data. In this case, we will find the best separation. SVM – cannot always perfectly classify

59 KNN = k nearest neighbors Gene 1 Gene 2 ? KNN is another method for classification. For each point it looks at its k nearest neighbors. If red = brain tumor and yellow healthy – do I have a brain tumor?

60 Gene 1 Gene 2 ? For each point it looks at its k nearest neighbors. For example, the method with k=3 looks at points 3 nearest neighbors to decide how to classify it. If the majority are “Red” it will classify the point as red. If red = brain tumor and yellow healthy – do I have a brain tumor? KNN = k nearest neighbors

61 Gene 1 Gene 2 ? KNN is better than SVM for the above case. If red = brain tumor and yellow healthy – do I have a brain tumor? KNN = k nearest neighbors

62 In the above example – how will the point be classified in KNN with K=1? In SVM? Gene 1 Gene 2 ? KNN - exercise

63 Training dataset Gene 1 Gene 2 ? The red and yellow points are used to train the classifier. The more training data one has -> the better the classifier will perform.

64 Test dataset Gene 1 Gene 2 ? Usually some points for which we know the answer are not given to the classifier and are used to TEST its performance.

65 Decision tree OperationSmokerGene2Gene1Age yes high <20 yes high <20 no low <20 yes highlow[20,40] yesnohigh [20,40] noyeslowhigh[20,40] noyeslow >40 no lowhigh>40 no highlow>40

66 Decision tree Age >40 Operation = no YesNo Gene 2 highlow Operation = yes Operation = no Decision trees are automatically built from “train data” and are used for classification. They also tell us which features are most important.

67 Voting Decision trees Training data that need a classification algorithm (Yes/No) Voting uses an array of machine learning algorithms and chooses the classification suggested by most classifiers. KNNSVMTrain: New datum (Test) NoYes YES

68 Classification is used outside the scope of bioinformatics The distance between the query and each point in the dataset is computed. Based on the identity of the k nearest members, the digit is identified. *More advanced algorithms allow rotation and enlargement of the digit to be classified.

69 UPGMA - exercise x12x34 x12 015 x34 0 In the above example – how will the point be clustered using UPGMA? x1x2x3x4 x1 021230 x2 0810 x3 04 x4 0 x12x3x4 x12 01020 x3 04 x4 0

70 Dataset sizes A classifier is needed to detect “Pupko disease” based on gene expression. Pupko disease is extremely rare (say, it inflicts 1 out of 100000 people). A classifier was trained on a large volume of samples in which all cases are negative. On a test dataset it correctly classified 99.9% of the cases… "לא חוכמה": the fraction of positive cases in the test data is only ~0.01%. Take home message: (1) better to train classifier on ~equal number of “positive” and “negative” cases. (2) Reporting only “% accurate classifications” is not enough. One has to report both FP,FN, TP, TN (in this example, all positive are FP  FALSE POSITIVE RATE OF 100%).

71 Exercises - examples ב Clustering ע " י UPGMA איחדתי את גן X וגן Y. המרחק בין גן X לגן T היה 7, והמרחק בין גן Y לגן T היה 9. אלו מהמשפטים הבאים נכון ? א- המרחק בין הקבוצה שמאחדת את גנים X ו Y ל T הוא 8. ב- אי אפשר לחשב את המרחק בין האיחוד של X ו Y ל T כי לא נתון המרחק בין X ל Y. ג- המרחק בין גן X וגן Y קטן מ 7. ד- א '+ ב '. ה- א '+ ג '. ו- ב '+ ג '. ז- א '+ ב '+ ג '. ח- אף תשובה אינה נכונה.

72 Exercises - examples ב Clustering ע " י UPGMA איחדתי את גן X וגן Y. המרחק בין גן X לגן T היה 7, והמרחק בין גן Y לגן T היה 9. אלו מהמשפטים הבאים נכון ? א- המרחק בין הקבוצה שמאחדת את גנים X ו Y ל T הוא 8. ב- אי אפשר לחשב את המרחק בין האיחוד של X ו Y ל T כי לא נתון המרחק בין X ל Y. ג- המרחק בין גן X וגן Y קטן מ 7. ד- א '+ ב '. ה- א '+ ג '. ו- ב '+ ג '. ז- א '+ ב '+ ג '. ח- אף תשובה אינה נכונה.

73 Exercises - examples 23. אלו מהמשפטים הבאים נכון ? א- ב SVM ככל שמרחק בין הנקודה שרוצים לסווג למשטח המפריד קטן יותר – הסיכוי שהסיווג שגוי קטן יותר. ב- ב SVM תמיד כל הנקודות מסוג א ' הן בצד אחד וכל הנקודות מסוג ב ' הן בצד השני. ג- ניתן לפתח SVM שיסווג חלבונים לטרנס - ממברנליים ולכאלה שלא. ד- אף תשובה אינה נכונה.

74 Exercises - examples 23. אלו מהמשפטים הבאים נכון ? א- ב SVM ככל שמרחק בין הנקודה שרוצים לסווג למשטח המפריד קטן יותר – הסיכוי שהסיווג שגוי קטן יותר. ב- ב SVM תמיד כל הנקודות מסוג א ' הן בצד אחד וכל הנקודות מסוג ב ' הן בצד השני. ג- ניתן לפתח SVM שיסווג חלבונים לטרנס - ממברנליים ולכאלה שלא. ד- אף תשובה אינה נכונה.

75 Exercises - examples 24. נתון האיור הבא : אלו מהמשפטים הבאים נכון ? א- לפי SVM ( ליניארי ) הנקודה עם הסימן שאלה תסווג להיות נקודה שחורה. ב- לפי KNN כשמספר השכנים שווה אחד, הנקודה עם הסימן שאלה תסווג להיות נקודה לבנה. ג- א '+ ב ' ד- אף תשובה אינה נכונה Gene 1 Gene 2 ?

76 Exercises - examples 24. נתון האיור הבא : אלו מהמשפטים הבאים נכון ? א- לפי SVM ( ליניארי ) הנקודה עם הסימן שאלה תסווג להיות נקודה שחורה. ב- לפי KNN כשמספר השכנים שווה אחד, הנקודה עם הסימן שאלה תסווג להיות נקודה לבנה. ג- א '+ ב ' ד- אף תשובה אינה נכונה Gene 1 Gene 2 ?

77 Legionalla pneumophila case-study

78 How did it all begin? Legionella pneumophila

79 Legionnaire disease nowadays Legionella pneumophila

80 Copyright © 2005 Nature Publishing Group. Created by Arkitek from Nature Reviews Microbiology

81 Identifying the effectors Legionella pneumophila

82 Homology to host proteins Regulatory elements Genome proximity to other effectors Secretion signal Abundance in Metazoa / Bacteria GC content Sequence homology The features Legionella pneumophila

83 The effectors machine 5 Legionella pneumophila

84 The big picture Similarity to known effectors Regulatory elements Features Similarity to host proteins G-C content Secretory signals Feature selection NN SVM Naïve Bayes Bayesian Net Voting Classification algorithms Experimental validation Predicted effectors Prior knowledge Trained model Unclassified genes Predicted non-effectors Newly validated effectors Non- effectors Validated effectors Abundance in Metazoa\Bacteria Genome arrangement Legionella pneumophila

85 Does it really work?? Machine learning

86


Download ppt "Chip arrays and gene expression data. Motivation."

Similar presentations


Ads by Google