Quick Overview of Bioinformatics

Quick Overview of Bioinformatics
Chuong Huynh NIH/NLM/NCBI New Delhi, India September 28, 2004

What is bioinformatics? - Definition
My definition – bringing biological themes to computers Peter Elkin: Primer on Medical Genomics: Part V: Bioinformatics “Bioinformatics is the discipline that develops and applies informatics to the field of molecular biology.” BISTIC Bioinformatics Definition “Research, development, or application of computational tools and approaches for expanding the use of biological, medical, behavioral or health data, including those to acquire, store, organize, archive, analyze, or visualize such data” BISTIC Computational Biology Definition “Computational Biology: the development and application of data-analytical and theoretical methods, mathematical modeling and computational simulation techniques to the study of biological, behavioral, and social systems.” BISTIC definition of bioinformatics: - Bioinformatics: Research, development, or application of computational tools and approaches for expanding the use of biological, medical, behavioral or health data, including those to acquire, store, organize, archive, analyze, or visualize such data - Computational Biology: the development and application of data-analytical and theoretical methods, mathematical modeling and computational simulation techniques to the study of biological, behavioral, and social systems. Bioinformatics applies principles of information sciences and technologies to make the vast, diverse, and complex life sciences data more understandable and useful. Computational biology uses mathematical and computational approaches to address theoretical and experimental questions in biology. Although bioinformatics and computational are distinct, there is also significant overlap and activity at their interface. What is Bioinformatics? Bioinformatics is an integration of mathematical, statistical and computer methods to analyze biological, biochemical and biophysical data. -- old url

Useful/Necessary Bioinformatics Skills
Strong background in some aspect of molecular biology!!! Ability to communicate biological questions comprehensibly to computer scientists Thorough comprehension of the problem in the bioinformatics field Statistics (association studies, clustering, sampling) Ability to filter, parse, and munge data and determine the relationships between the data sets Mathematics (e.g. algorithm development) Engineering (e.g. robotics) Good knowledge of a few molecular biology software packages (molecular modeling / sequence analysis) Command line computing environment (Linux/Unix knowledge) Data administration (esp. relational database concept) and Computer Programming Skills/Experience (C/C++, Sybase, Java, Oracle) and Scripting Language Knowledge (Perl and perhaps Phython) More of a list of industry bioinformatics core requirement Molecular evolution, physical chemistry (kinetics, thermodynamics), statistics and probability; database design and implementation; algorithm development; molecular biology laboratory methods computer science and genomics expertise for analysis of small- and large-scale informatics problems; the ability to filter information and extract possible relationships between data sets; database administration and programming skills (e.g., Sybase, CORBA, PERL, Java, C++, C); the ability to frame biological questions in a manner understandable to computer scientists (enabling the design of effective tools); and a thorough understanding of the problems addressed in the bioinformatics field.

Bioinformatics Flow Chart (0)
1a. Sequencing 6. Gene & Protein expression data 1b. Analysis of nucleic acid seq. 7. Drug screening 2. Analysis of protein seq. 3. Molecular structure prediction Ab initio drug design OR Drug compound screening in database of molecules 4. molecular interaction 8. Genetic variability 5. Metabolic and regulatory networks

1a. Sequencing Base calling Physical mapping Fragment assembly 1b. Analysis of nucleic acid seq. -gene finding Multiple seq alignment  evolutionary tree Stretch of DNA coding for protein; Analysis of noncoding region of genome 2. Analysis of protein seq. Sequence relationship 3. Molecular structure prediction 3D modeling; DNA, RNA, protein, lipid/carbohydrate This is more of a drug discovery perspective Gene discovery require annotating genes Many of the concepts here will be discussed through this course. Protein-protein interaction Protein-ligand interaction 4. molecular interaction 5. Metabolic and regulatory networks

6. Gene & Protein expression data EST DNA chip/microarray 7. Drug screening Lead compound binds tightly to binding site of target protein Lead optimization – lead compound modified to be nontoxic, few side effects, target deliverable Ab initio drug design OR Drug compound screening in database of molecules Drug molecules designed to be complementary to binding Sites with physiochemical and steric restrictions. Now investigated at the genome scale SNP, SAGE 8. Genetic variability

Genome Sequencing Clone by clone vs whole genome shotgun
Strategy Strategy Clone by clone vs whole genome shotgun Libraries Libraries Subcloning; generate small insert libraries Sequencing Sequencing Most genome will be sequenced and can be sequenced; few problem are unsolvable. Assembly Assembly Assembly: Process of taking raw single-pass reads into contiguous consensus sequence (Phred/Phrap) Problem lies in understanding what you have: Gene prediction/gene finding Annotation Closure Closure Closure: Process of ordering and merging consensus sequences into a single contiguous sequence Finishing = Assembly + Closure step Annotation Annotation DNA features (repeats/similarities) Gene finding Peptide features Initial role assignment Others- regulatory regions Release Release Release data to the public e.g. EMBL or GenBank

Both strands coverage;
Sequencing Genomic DNA Shearing/Sonication Small DNA fragments kb Clone Library pUC18 Subclone and Sequence DNA sequencing Random clones Shotgun reads Assembly Contigs Finishing read Take large piece of usually genomic DNA and break it up into smaller piece - use restriction enzyme in smaller projects at the beginning of sequencing; for sequencing project use shearing or sonication; the longer you sonicate the smaller the pieces you get; put in bacterial vector – subclone – sequencing in random; choose random clones, then sequence both ends M13, pUC; try to assemble these pieces together and reform the original sequence again; fill in gaps called finishing; how big piece of DNA do you start off with? Depends on how much confidence you have in putting those pieces together; remember the average read lengths is about 600 bp Slide provided from Daniel Lawson Both strands coverage; Gap filled Finishing Complete sequence

Annotation of eukaryotic genomes
Genomic DNA ab initio gene prediction transcription Unprocessed RNA RNA processing Mature mRNA Gm3 AAAAAAA Comparative gene prediction translation Nascent polypeptide folding Prokaryotic world would ignore some of this steps; based on what you are interested in; Ab initio – black magic? – whether you trust a gene finder will depend on your knowledge of underlying Comparative another organism to your known DNA – interesting from a DNA to DNA (mouse/human; C elegans C. brissae; P falciparum/vivax comparison) Functional identification something the community wants. Annotation: Extract ORFs Predict protein Remove errors Compare with database of ‘known function proteins’ Provide transitive annotations Active enzyme Functional identification Function Reactant A Product B

Annotation Predict protein Extract ORFs Remove errors
Compare with database of ‘known function proteins’ Provide transitive annotations

Positional Cloning

Positional Candidate Cloning

The new information is always partial
Complete Eukaryotic Genomes Ongoing Eukaryotic Prokaryotic Ongoing Published Even a complete genome is only partially understood

Why not use the genome sequence once its ‘ready’?
Finding exons 30% overprediction 20% not found at all Comparison systems rely on EST sequences which themselves contain large error rates Others are looking through partial data Once the genome is done …when? Expressed sequences are there in part and represent a very very powerful key.

Interpreting data from many sources

Genomics and Tropical Diseases
How Can Genomics Contribute to the Control of Tropical Diseases? Challenges and Opportunities The Role of Bioinformatics Strategic emphases for research WHO/TDR Genomics and World Health Report 2002

B. Bloom (1995) A microbial minimalist. Nature 378:236
Why Pathogen Genomics? “The power and cost-effectiveness of modern genome sequencing technology mean that complete genome sequences of 25 of the major bacterial and parasitic pathogens could be available within five years. For about 100 million dollars (…), we could buy the sequence of every virulence determinant, every protein antigen and every drug target.” From a slide from Carlos Morel B. Bloom (1995) A microbial minimalist. Nature 378:236

Genomics and Drug Development for Tropical Diseases: Challenges
Knowledge limitations A large proportion of pathogen genes have unknown function Heavy investment in genomics is done by the commercial sector and therefore not widely available Emphasis and priorities Genomes of non-pathogenic model organisms (S. cerevisiae, D. melanogaster, C. elegans, A. thaliana) Genomes of pathogens that affect individuals in developed countries Neglected diseases  neglected pathogens

Doing Successful Science in the new millennium
Huge increase in available biological information Classic paradigm of ‘molecular biology’ now is altering rapidly to genomics Understanding of the new paradigms concerns more than ‘just bench biology’ Discovery requires large scale systems and broad collaborations, Global problems Funding comes in large amounts at group level, no longer a single laboratory or institution effort. Accountable output

The Bigger Picture (Malaria)

Genomics Approach to Drug Development: Opportunities
Classical laboratory assays aim at targets in which mutation is lethal to the pathogen Valuable targets can be missed Sulphonamides: Inhibition of the p-aminobenzoic acid pathway not lethal for growth in laboratory but severely attenuate the capacity to cause disease

Genomics Approach to Drug Development: Opportunities
New approaches for the identification of gene products specifically involved in the disease process may uncover further drug targets Signature tagged mutagenesis (STM) Transposon site hybridization (TraSH) Pathogen genomics and data mining for the discovery of new drug targets

Fosmidomycin September 1999: a basic science breakthrough (data mining through bioinformatics identify new targets for chemotherapy of malaria) 1st semester 2001: Results of Phase I clinical trials

Fosmidomycin example - lesson
A lesson to take home: 1½ years from data mining and laboratory research to phase II, proof-of-principle clinical trials

Bioinformatics: Opportunities in Health Research and Development
New drug research and development Identification of novel drug/vaccine targets Structural predictions Tapping into biodiversity Reconstruction of metabolic pathways Systems biology Identification of vaccine candidates through analysis of surface antigens and epitopes

A Window of Opportunity for Disease Endemic Countries
Bioinformatics is an extremely important tool, with relevance to studying pathogenic organisms Pathogens of interest to DECs already being sequenced (e.g. P. falciparum, T. cruzi, T. brucei, Leishmania sp.) Computational biology is ‘people-intensive’, less affected by infrastructure, economics, etc than other areas of biological research ‘Critical mass’ issues less critical – a world-wide community is within reach

Relatively Modest Hardware Needs and Technical Support
Linux operating system permits use of the personal computer as a powerful workstation Vast repository of public domain software for computational biology Individual accounts for remote access and data processing can be open at high-performance computer facilities and regional centers EMB network nodes, FIOCRUZ (Brazil), SANBI (South Africa), CECALCULA (Venezuela), ICGEB (Trieste and New Delhi)

Relatively Modest Hardware Needs and Technical Support
Powerful searches using public websites NCBI, EMB nodes, Sanger Center, Expasy/SwissProt, KEGG database High-speed internet access is becoming more and more available in disease endemic countries through regional and international support, e.g.: Asia-Pacific Advanced Network Consortium (APAN) MIMCom Malaria Research Resources

SANBI, Cape Town, South Africa
International Training Course on Bioinformatics and Computational Biology Applied to Genome Studies (Train-the-trainers Workshop) May 21-June 15, 2001 FIOCRUZ, Brazil TDR Regional Training Centers & Regional Training Courses on Bioinformatics Applied to Tropical Diseases Africa SANBI, Cape Town, South Africa Course: Jan 20-Feb 02, 2002; Mar 19-Apr 4, 2003; Feb 2-15, 2004 (with NBN series) Univ of Ibadan, Ibadan, Nigeria Course: May 26-Jun 07, 2003 South America USP, São Paulo, Brazil Course: Feb 18-March 02, 2002; July 17-19, 2003; July 5-16, 2004; Southeast Asia ICGEB, New Delhi, India Course: Apr 26-May 09, 2002; Sep 22-Oct 06, 2003; Sept 28-Oct 11, 2004 Mahidol University, Bangkok, Thailand Course: Jul 09-23, 2002; Sep 29-Oct 10, 2003; July 26-Aug6, 2004

Training Course on Bioinformatics and Functional Genomics Applied to Insect Vectors of Human Diseases At the Center for Bioinformatics and Applied Genomics (CBAG) and Center for Vector and Vector-Borne Diseases (CVVD), Faculty of Science, Mahidol University, Bangkok, Thailand January 17-28, 2005 Training Course on Functional Genomics of Insect Vectors of Human Diseases African Center for Training in Functional Genomics of Insect Vectors of Human Diseases (AFRO VECTGEN) At the Malaria Research and Training Center (MRTC), Bamako, Mali Dec 1-16, 2004

Beginning Bioinformatics Books
Baxevanis & Ouellette Bioinformatics: A Practical Guide to the Analysis of Genes and Proteins 2nd Edition. John Wiley Publishing. Gibas & Jambeck Developing Bioinformatics Computer Skills. O’Reilly. Bioinformatics: Genome Sequence Analysis Mount 2001 Bioinformatics For Dummies – Claverie & Notredame 2003 Bioinformatics and Functional Genomics Pesvner 2003 Introduction to Bioinformatics – Lesk 2002 Fundamental Concepts of Bioinformatics Krane & Raymer 2003 Beginning Perl for Bioinformatics – Tisdall 2002 Primer of Genome Science – Gibson & Muse 2002

The Challenge What is expected of you?

Comments and Suggestions
Course Schedule Take out your course schedule. Comments and Suggestions

Quick Overview of Bioinformatics

Similar presentations

Presentation on theme: "Quick Overview of Bioinformatics"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Quick Overview of Bioinformatics

Similar presentations

Presentation on theme: "Quick Overview of Bioinformatics"— Presentation transcript:

Similar presentations

About project

Feedback