Efficient Probe Selection in Micro-array Design Speaker: Cindy Y. Li Joint work with: Leszek G ą sieniec, Paul Sant, and.

Slides:



Advertisements
Similar presentations
On-line Construction of Suffix Trees Chairman : Prof. R.C.T. Lee Speaker : C. S. Wu ( ) June 10, 2004 Dept. of CSIE National Chi Nan University.
Advertisements

DNA strands can be separated under conditions which break H-bonds
Analysis of Microarray Genomic Data of Breast Cancer Patients Hui Liu, MS candidate Department of statistics Prof. Eric Suess, faculty mentor Department.
Recombinant DNA Technology
Basic Gene Expression Data Analysis--Clustering
10.1 Behavior Cell-network Protein Gene G2G2 G3G3 G1G1 P1P1 Environment (stimuli, nutrients, temperature, etc.) G4G4 P2P2 P3P3 P4P4 C2C2 C3C3 C4C4 B2B2.
Algorithmic Foundations COMP108 COMP108 Algorithmic Foundations Searching Prudence Wong
Locating conserved genes in whole genome scale Prudence Wong University of Liverpool June 2005 joint work with HL Chan, TW Lam, HF Ting, SM Yiu (HKU),
Replication. N N H R O CH3 O T N N R H N H O C R N N N N H H N A G R N N N O H U.
M. Kathleen Kerr “Design Considerations for Efficient and Effective Microarray Studies” Biometrics 59, ; December 2003 Biostatistics Article Oncology.
Predicting Enhancers in Co-Expressed Genes Harshit Maheshwari Prabhat Pandey.
Efficient Selection of Unique and Popular Oligos for Large EST Databases Stefano Lonardi University of California, Riverside joint work with Jie Zheng,
Combinatorial Pattern Matching CS 466 Saurabh Sinha.
Next Generation Sequencing, Assembly, and Alignment Methods
Microarray technology and analysis of gene expression data Hillevi Lindroos.
Design and Optimization of Universal DNA Arrays Ion Mandoiu CSE Department & BME Program University of Connecticut.
Design of Optimal Multiple Spaced Seeds for Homology Search Jinbo Xu School of Computer Science, University of Waterloo Joint work with D. Brown, M. Li.
Physical Mapping I CIS 667 February 26, Physical Mapping A physical map of a piece of DNA tells us the location of certain markers  A marker is.
Indexed Search Tree (Trie) Fawzi Emad Chau-Wen Tseng Department of Computer Science University of Maryland, College Park.
Selection of Optimal DNA Oligos for Gene Expression Arrays Reporter : Wei-Ting Liu Date : Nov
Using a Genetic Algorithm for Approximate String Matching on Genetic Code Carrie Mantsch December 5, 2003.
RNA-Seq An alternative to microarray. Steps Grow cells or isolate tissue (brain, liver, muscle) Isolate total RNA Isolate mRNA from total RNA (poly.
Exact and Approximation Algorithms for DNA Tag Set Design Ion Mandoiu and Dragos Trinca Computer Science & Engineering Department University of Connecticut.
Information Aspects of Nucleic Acids Measurement Technologies Description of nucleic acid measurement technologies Algorithmic, optimization, data analysis.
Human Genome Project. Basic Strategy How to determine the sequence of the roughly 3 billion base pairs of the human genome. Started in Various side.
Accurate Method for Fast Design of Diagnostic Oligonucleotide Probe Sets for DNA Microarrays Nazif Cihan Tas CMSC 838 Presentation.
Introduce to Microarray
STAT115 STAT215 BIO512 BIST298 Introduction to Computational Biology and Bioinformatics Spring 2015 Xiaole Shirley Liu Please Fill Out Student Sign In.
Affymetrix GeneChip Data Analysis Chip concepts and array design Improving intensity estimation from probe pairs level Clustering Motif discovering and.
1 Exact Set Matching Charles Yan Exact Set Matching Goal: To find all occurrences in text T of any pattern in a set of patterns P={p 1,p 2,…,p.
Introduction to Bioinformatics Algorithms Clustering and Microarray Analysis.
Microarrays: Basic Principle AGCCTAGCCT ACCGAACCGA GCGGAGCGGA CCGGACCGGA TCGGATCGGA Probe Targets Highly parallel molecular search and sort process based.
with an emphasis on DNA microarrays
This Week: Mon—Omics Wed—Alternate sequencing Technologies and Viromics paper Next Week No class Mon or Wed Fri– Presentations by Colleen D and Vaughn.
1 Outline Last time: –Molecular biology primer (sections ) –PCR Today: –More basic techniques for manipulating DNA (Sec. 3.8) Cutting into shorter.
Filter Algorithms for Approximate String Matching Stefan Burkhardt.
A New Oklahoma Bioinformatics Company. Microarray and Bioinformatics.
Probe Design Using Exact Repeat Count August 8th, 2007 Aaron Arvey.
Hugh E. Williams and Justin Zobel IEEE Transactions on knowledge and data engineering Vol. 14, No. 1, January/February 2002 Presented by Jitimon Keinduangjun.
An Empirical Study of Choosing Efficient Discriminative Seeds for Oligonucleotide Design Won-Hyong Chung and Seong-Bae Park Dept. of Computer Engineering.
JM - 1 Introduction to Bioinformatics: Lecture III Genome Assembly and String Matching Jarek Meller Jarek Meller Division of Biomedical.
Combinatorial Optimization Problems in Computational Biology Ion Mandoiu CSE Department.
Design of oligonucleotides for microarrays and perspectives for design of multi-transcriptome arrays Henrik Bjorn Nielsen, Rasmus Wernersson and Steen.
1 FINAL PROJECT- Key dates –last day to decided on a project * 11-10/1- Presenting a proposed project in small groups A very short presentation (Max.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Sequencing DNA 1. Maxam & Gilbert's method (chemical cleavage) 2. Fred Sanger's method (dideoxy method) 3. AUTOMATED sequencing (dideoxy, using fluorescent.
Northern blotting & mRNA detection by qPCR - part 2.
Bioinformatics MEDC601 Lecture by Brad Windle Ph# Office: Massey Cancer Center, Goodwin Labs Room 319 Web site for lecture:
Whole Genome Approaches to Cancer 1. What other tumor is a given rare tumor most like? 2. Is tumor X likely to respond to drug Y?
Molecular Classification of Cancer Class Discovery and Class Prediction by Gene Expression Monitoring.
Gene expression and DNA microarrays No lab on Thursday. No class on Tuesday or Thursday next week –NCBI training Monday and Tuesday –Feb. 5 during class.
. Finding Motifs in Promoter Regions Libi Hertzberg Or Zuk.
FOOTHILL HIGH SCHOOL SCIENCE DEPARTMENT Chapter 13 Genetic Engineering Section 13-2 Manipulating DNA.
1 Microarray Clustering. 2 Outline Microarrays Hierarchical Clustering K-Means Clustering Corrupted Cliques Problem CAST Clustering Algorithm.
From: Duggan et.al. Nature Genetics 21:10-14, 1999 Microarray-Based Assays (The Basics) Each feature or “spot” represents a specific expressed gene (mRNA).
Introduction to Oligonucleotide Microarray Technology
Similarity Measurement and Detection of Video Sequences Chu-Hong HOI Supervisor: Prof. Michael R. LYU Marker: Prof. Yiu Sang MOON 25 April, 2003 Dept.
Detecting DNA with DNA probes arrays. DNA sequences can be detected by DNA probes and arrays (= collection of microscopic DNA spots attached to a solid.
STAT115 STAT215 BIO512 BIST298 Introduction to Computational Biology and Bioinformatics Spring 2016 Xiaole Shirley Liu.
Part 3 Gene Technology & Medicine
Human Genome Project.
Selection of Oligonucleotide Probes for Protein Coding Sequences
DNA Sequencing The DNA from the genome is chopped into bits- whole chromosomes are too large to deal with, so the DNA is broken into manageably-sized overlapping.
Position specific effect of SNP on signal ratio from long oligonucleotide CGH microarrays; most single probe aberrations represent genuine genomic variants.
DNA Library Design for Molecular Computation
DNA Sequencing.
Bioinformatics, Vol.17 Suppl.1 (ISMB 2001)
Clustering.
Presentation transcript:

Efficient Probe Selection in Micro-array Design Speaker: Cindy Y. Li Joint work with: Leszek G ą sieniec, Paul Sant, and Prudence Wong Special thanks go to: David Peleg Algorithmics Group, Dept. of Computer Science, University of Liverpool

Talk Overview Background: Microarrays & Hybridization Problem Statement Our Approach Experimental Work Conclusion

Hybridization Process DNA 5’... TGTGCTTGACAACATAGTTG... 3’ || | | Short DNA Fragments 3’-CTACGGACCGAT-5’ A single-stranded DNA probe (middle panel) is linked to an enzyme and allowed to base pair (hybridize) with the mRNA. After a series of washes, only fragments that are hybridized with the target mRNA remain.

Tool: DNA Microarrays Labeled DNA/RNA mixture flushed over array of short DNA fragments Laser activation of fluorescent labels

Talk Overview Background: Microarrays & Hybridization Problem Statement Our Approach Experimental Work Conclusion

Probe concept A probe is a substring of a gene, which acts as its fingerprint (a.k.a., signature) Probes are relatively short DNA sequences. Usually, a probe is ~ base pairs long. For example: DNA... TGTGCTTGGCAACATAGATAGATGC... Probe TGCTTGGCAACATAGATAGA

Finding unique probes We are interested in finding a single (or a small group of) unique probe(s) for each gene The search process should be both time and space efficient P5P5 P1P1 P2P2 P3P3 P4P4 G1G1 G3G3 G4G4 Probes Genes G2G2

Finding unique probes Given a database S of gene sequences For each sequence g in S try to find a single probe P which hybridizes only with g If P cross-hybridizes with some other sequences in S (i.e., P has a close occurrence in S) then find a small set of probes that uniquely identifies g. Sometimes multiple probes are required due to the error prone wet lab environment

The use of probes The uniqueness of probes allows us to identify the genes taking part in the experiment in the wet lab I.e., seeing the trace (green color) of a number of probes on the microarray we can identify precisely which genes were involved in the experiment

Finding Unique Probes - Performance Measure Each gene in the database S should be uniquely identified by a smallest possible number of probes The search for probes should be time/space efficient The time of the search for probes should be “fairly” independent of the length of the probes All probes should be far (Hamming distance) from each other Probes should satisfy some extra (e.g., related to hybridization process) conditions Naive approach: Scans through the whole length-n genome for every length-m probe and determine if the Hamming distance is big enough, which takes O(mn 2 ) time. For example, 72 hours for S. pombe genome of length 7.1 x 10 6 bps and thus impractical for large genome.

Previous Work – Approaches based on Suffix array and fast pattern matching [Li F. and Stormo G., 2001] BLAST to avoid cross-hybridization [Rouillard J. M., Herbert C. J. and Zuker M., 2002] Longest common substrings [Rahmann S. 2002] Various filtering techniques [Lockhart DJ et al, 1996] Methods based on pigeon hole principle [Lee W. H. and Sung W. K., 2003] etc

Previous Work – The probe selection criteria No single base exceeds 50% of the probe size The length of any contiguous As and Ts or Cs and Gs is less than 25% of the probe size (G+C)% is between 40% and 60% of the probe Sensitivity - No self-complementarity within the probe sequence Homogeneity - Melting Temperature not being too low or too high Specificity – probes are unique to each gene

Previous Work – Test data Test data Genome Name E. coliS. pombeYeast Genome Length 4,752,41113,149,6518,783,280 Number of Genes 5, ,888

Previous Work – Test data Total length 8,783,280 Total # of genes 5,888

Previous Work Li and Stormo BIBE,2000 Rouillard et al,Bioinfor- matics,2002 Rahmann, WABI, 2002 Lee and Sung,CSB, 2003 E.Coli23-bps 1.5 days 50-bps 31 minutes Yeast24-bps 4 days 50-bps 1 day 50-bps 49 minutes Neurospora crassa 25-bps 4 hours 50-bps 3.5 hours # of probesTop 10All probesTop 50All probes

Talk Overview Background: Microarrays & Hybridization Problem Statement Our new alternative approach - main observations - the algorithm Experimental work Conclusion

Main Observations In general randomness help! 80% of “randomly” (based on our algorithm) chosen candidates for probes satisfy the probe selection criteria related to hybridization process [this suggests that random sequences hybridize properly more likely] The expected Hamming distance between two randomly chosen sequences of a length n over 4 letter alphabet is ~ 3n/4. [this suggests that randomly chosen probes will be far from each other]

An interesting observation In general, fragments of DNA sequences representing genes are more deterministic (contain more organized information) comparing to the rest of the sequence. In contrary, the best probes (signatures) representing genes are very likely to be random or almost random!

The Algorithm (*) For every gene g in the database S: a)generate a random base-pair sequence of length m b)find the closest length-m substring P in gene g c)check P for good probe criteria [80% pass this test] If P does not pass the criteria go to a) d)cross-hybridization checking for P [98% pass this test] For every length-m substring Q in other sequences S-{g}: If H(P,Q) > d, P is chosen as the probe for g, goto (*) Otherwise, P can possibly cross-hybridize and we must generate another length-m random substring P', go to a)

The algorithm R (*) For every gene g in the database S: a) generate a random base-pair sequence of length m g P b) find the closest length-m substring P in gene g c) Check P for good probe criteria, if P does not pass the criteria, go to a)

The algorithm g P d) Check P for cross-hybridization checking For every length-m substring Q in other sequences (S - {g}): If H(P,Q) > d, P is chosen as the probe for g, goto (*); Otherwise, P can possibly cross-hybridize and we must generate another length-m random substring, go to a) g1g1 P is far from g 1 √ H(P,Q)<d X gigi Q Background Sequences … g2g2 P is far from g 2 √ Generate another length-m random substring

Talk Overview Background: Microarrays & Hybridization Problem Statement Algorithm Experimental Work Conclusion

Experimental Work For Yeast: 1.80% genes with no probes Duplicated / very similar / too short apart from that 98.0% genes need only one probe 1.5% genes need two probes 0.5% genes need three probes Similar result with genome E.coli

Talk Overview Background: Microarrays & Hybridization Problem Statement Algorithm Experimental Work Conclusion

Conclusion Almost all (98%) genes can be uniquely identified by a single probe; the others need at most three probes Our method is: Suitable for large scale probe design Fairly independent from the length of probes Both time and space efficient Useful in design of fault-tolerant system of probes

Ongoing Work Distinguish multiple targets in a sample P1P1 g 1 P2’P2’ P1’P1’ g3g3 P2P2 g2g2

Questions?? ?

Thank You! Presented By Cindy Y. Li

self-complementarity Probe 5‘ TTTCAGTAATAAAAGATTTCTGT 3‘ |||| Probe 3‘ TGTCTTTAGAAAAATTAGACTTT 5‘

Melting Temperature T M can be used as a parameter to evaluate probe hybridization behavior T M is calculated for each probe as (SantaLucia et al., 1996) is the sum of the nearest neighbor enthalpy changes is the sum of the nearest neighbor entropy changes R is the Gas Constant (1.987 cal deg-1 mol-1) C T is the total molar concentration of strands ( )

Melting Temperature thermodynamic stability / nearest neighbour/ ATCG A T C G TTT C A G TAATTAAAAA G ATTT C T G T kcal/mol