Presentation is loading. Please wait.

Presentation is loading. Please wait.

Introduction to Bioinformatics

Similar presentations


Presentation on theme: "Introduction to Bioinformatics"— Presentation transcript:

1 Introduction to Bioinformatics
IBGP/BMI 730 Introduction to Bioinformatics Director: Prof. Victor Jin

2 Basic Molecular Biology
All living things are made of Cells Prokaryote, Eukaryote Cell Signaling What is Inside the cell: From DNA, to RNA, to Proteins

3 Cells Fundamental working units of every living system.
Every organism is composed of one of two radically different types of cells: prokaryotic cells or eukaryotic cells. Prokaryotes and Eukaryotes are descended from the same primitive cell. All extant prokaryotic and eukaryotic cells are the result of a total of 3.5 billion years of evolution. Crowded slide Organism of life…. Cells

4 Cell Structure A cell is a smallest structural unit of an organism that is capable of independent functioning All cells have some common features

5 Cell Cycle Born, eat, replicate, and die

6 The Tree of Life According to the most recent evidence, there are three main branches to the tree of life. Prokaryotes include Archaea (“ancient ones”) and bacteria. Eukaryotes are kingdom Eukarya and includes plants, animals, fungi and certain algae.

7 Prokaryotes and Eukaryotes
Single cell Single or multi cell No nucleus Nucleus No organelles Organelles One piece of circular DNA Chromosomes No mRNA post transcriptional modification Exons/Introns splicing

8 Signaling Pathways: Control Gene Activity
Instead of having brains, cells make decision through complex networks of chemical reactions, called pathways Synthesize new materials Break other materials down for spare parts Signal to eat or die

9 An Example -- Cell Cycle Signaling

10 Cells Information and Machinery
Cells store all information to replicate itself Human genome is around 3 billions base pair long Almost every cell in human body contains same set of genes But not all genes are used or expressed by those cells Machinery: Collect and manufacture components Carry out replication Kick-start its new offspring

11 Terminology Genome: an organism’s genetic material
Gene: a discrete units of hereditary information located on the chromosomes and consisting of DNA Genotype: The genetic makeup of an organism Phenotype: the physical expressed traits of an organism Nucleic acid: Biological molecules(RNA and DNA) that allow organisms to reproduce Amino acid: Organic molecules that build blocks of proteins. Protein: a large, complex molecule that is essential part of organisms and participates in every process within cells and achieve a particular function.

12 Three critical molecules
DNAs Hold information on how cell works RNAs Act to transfer short pieces of information to different parts of cell Provide templates to synthesize into protein Proteins Form enzymes that send signals to other cells and regulate gene activity Form body’s major components (e.g. hair, skin, etc.)

13 Overview of DNA to RNA to Protein
A gene is expressed in two steps Transcription: RNA synthesis Translation: Protein synthesis

14 DNA the Genetics Makeup
Genes are inherited and are expressed genotype (genetic makeup) phenotype (physical expression) On the left, is the eye’s phenotypes of green and black eye genes.

15 Central Dogmas of Molecular Biology
1) The concept of genes is historically defined on the basic of genetic inheritance of a phenotype. (Mendellian Inheritance) 2) The DNA an organism encodes the genetic information. It is made up of a double stranded helix composed of ribose sugars. Adenine(A), Citosine (C), Guanine (G) and Thymine (T). [note that only 4 values nees be encode ACGT.. Which can be done using 2 bits.. But to allow redundant letter combinations (like N means any 4 nucleotides), one usually resorts to a 4 bit alphabet.]

16 Central Dogmas of Molecular Biology
3) Each side of the double helix faces it´s complementary base. A T, and G  C. 4) Biochemical process that read off the DNA always read it from the 5´´side towards the 3´ side. (replication and transcription). 5) A gene can be located on either the ´plus strand´ or the minus strand. But rule 4) imposes the orientation of reading .. And rule 3 (complementarity) tells us to complement each base E.g. If the sequence on the + strand is ACGTGATCGATGCTA, the – strand must be read off by reading the complement of this sequence going ´backwards´ e.g. TAGCATCGATCACGT

17 Central Dogmas of Molecular Biology
6) DNA information is copied over to mRNA that acts as a template to produce proteins. We often concentrate on protein coding genes, because proteins are the building blocks of cells and the majority of bio-active molecules. (but let´s not forget the various RNA genes)

18 Bioinformatics Bioinformatics (computational biology) solves biological problems on the molecular level with the use of techniques including: applied mathematics statistics computer science artificial intelligence Crowded slide Organism of life…. Cells

19 Bioinformatics Biological Data Computer Calculations +

20 Molecular Biology as an Information Science
Central Paradigm for Bioinformatics -> Genomic Sequence -> Transcript > Protein Structure > Protein Function Large Amounts of Information Data Management Computer Algorithms Statistical Methods Central Dogma of Molecular Biology DNA -> RNA -> Protein -> Phenotype Molecules Sequence, Structure, Function Processes Mechanism, Specificity, Regulation

21 Major research efforts
Sequence alignment Gene finding Genome assembly RNA structure prediction Protein structure prediction Analysis of gene regulation Prediction of protein-protein interactions Modeling of evolution

22 Major research areas Sequence analysis Genome annotation
Computational evolutionary biology Measuring biodiversity Analysis of gene expression Analysis of regulation Analysis of protein expression Analysis of mutations in cancer Analysis of epigenetics in cancer High-throughput in vivo binding analysis Prediction of protein structure Comparative genomics Modeling biological systems High-throughput image analysis Protein-protein docking Software and tools Databases Web services in bioinformatics

23 Data types DNA sequences RNA sequences Protein sequences
Gene Expression cDNA, mRNA microarray data Now tiling array technology 50 M data points to tile the human genome at ~50 bp res. Can only sequence genome once but can do an infinite variety of array experiments Protein-DNA interactions ChIP-chip, ChIP-seq, ChIP-PET and so on Phenotype Experiments KOs Protein Interactions Yeast hybrid Proteomics

24 Other Integrative Data
Information to understand genomes Metabolic Pathways Regulatory Networks Signaling Networks Whole Organisms Phylogeny The Literature (MEDLINE)

25 GenBank Growth

26 Exponential Growth of Data Matched by Development of Computer Technology
CPU vs Disk & Internet Driving Force in Bioinformatics Internet Hosts No. Protein Domain Structures

27 Types of Relational databases
The Internet can be thought of as one enormous relational database. The “links”/URL are the primary keys. SQL (Standard Query Language) Sybase; Oracle ; Access; (Databases systems) Sybase used at NCBI. SRS(One type of database querying system of use in Biology)

28 XML Database and vocabularies for life science
HTML: Hypertext Markup Language XML: a general-purpose specification for creating custom markup languages. It is classified as an extensible language, because it allows the user to define the mark-up elements BSML: an extensible language specification and container for bioinformatic data. BSML was developed under a 1997 grant from the National Human Genome Research Institute (NHGRI) as an evolving public domain standard for the bioinformatics community

29 Examples of XML <?xml version="1.0" encoding="UTF-8"?>
<element_name attribute_name="attribute_value">Element Content</element_name> <book>This is a book... </book>

30 Primary Databases A primary Database is a repository of data derived from experiments or from research knowledge. Genbank (Nucleotide repository) Protein DB, Swissprot PDB (MMDB) are primary databases. Pubmed (literature) Genome Mapping databases. Kegg Database.(pathways)

31 Secondary Databases A secondary database contains information derived from other sources. Refseq (Currated collection of Genbank at NCBI) UniGene (Clustering of ESTs at NCBI) GeneID (Unique ID for each Gene at NCBI) Organism-specific databases are often a mix between primary and secondary.

32 Biological Databases Nucleotide databases:
Genbank: International Collaboration NCBI (USA), EMBL (Europe), DDBJ (Japan and Asia) A “bank” No curation.. Submission to these database is required for publication in a journal. Organism specific databases (Quick quiz: Find URLs using search engines) FlyBase ChickGBASE pigbase wormpep YPD (Yeast Protein Database) SGD(Saccharomyces Genome Database)

33 Protein Databases: NCBI: More on next week
Swiss Prot:(Free for academic use, otherwise commercial. Licensing restrictions on discoveries made using the DB version free of any licensing) pay version) NCBI has the latest free version. Translated Proteins from Genbank Submissions EMBL TrEMBL is a computer-annotated supplement of SWISS-PROT that contains all the translations of EMBL nucleotide sequence entries not yet integrated in SWISS-PROT PIR

34 Genome mapping information:
Structure databases: PDB: Protein structure database. MMDB: NCBI’s version of PDB with entrez links. Genome mapping information: NCBI (Human) Genome Centers: Stanford, Washington University, UC Berkeley Research Centers and Universities

35 Literature databases:
NCBI: Pubmed: All biomedical literature. Abstracts and links to publisher sites for full text retrieval/ordering journal browsing. Publisher web sites. Biomednet: Commercial site for litterature search. Pathways database: KEGG: Kyoto Encyclopedia of Genes and Genomes: Genome Search and Visualization database: UCSC Genome Browser (genome.uscs.edu/)

36 Information techniques
Geometry Robotics Graphics (Surfaces, Volumes) Comparison and 3D Matching (Vision, recognition) Physical Simulation Newtonian Mechanics Electrostatics Numerical Algorithms Simulation Databases Building, Querying Complex data Text String Comparison Text Search 1D Alignment Significance Statistics Alta Vista, grep Finding Patterns Machine Learning Clustering Data mining

37 Bioinformatics as New Paradigm for Scientific Computing
Physics Prediction based on physical principles EX: Exact Determination of Rocket Trajectory Emphasizes: Supercomputer, CPU Biology Classifying information and discovering unexpected relationships EX: Gene Expression Network Emphasizes: networks, “federated” database

38 Topics -- Genome Sequence
Finding Genes in Genomic DNA introns exons promotors Characterizing Repeats in Genomic DNA Statistics Patterns Duplications in the Genome Large scale genomic alignment Whole-Genome Comparisons Finding Structural RNAs

39 Topics -- Protein Sequence
Sequence Alignment How to align two strings optimally via Dynamic Programming Local vs Global Alignment Suboptimal Alignment Hashing to increase speed (BLAST, FASTA) Amino acid substitution scoring matrices Multiple Alignment and Consensus Patterns How to align more than one sequence and then fuse the result in a consensus representation HMMs, Profiles Motifs Scoring schemes and Matching statistics How to tell if a given alignment or match is statistically significant A P-value or An E-value)? Score Distributions Low Complexity Sequences Evolutionary Issues Rates of mutation and change

40 Topics – Structures Secondary Structure “Prediction”
via Propensities Neural Networks, Genetic Alg. Simple Statistics TM-helix finding Assessing Secondary Structure Prediction Structure Prediction: Protein vs RNA Tertiary Structure Prediction Fold Recognition Threading Ab initio Direct Function Prediction Active site identification Relation of Sequence Similarity to Structural Similarity

41 Topics -- Structures Structural Alignment
Structure Comparison Basic Protein Geometry and Least-Squares Fitting Distances, Angles, Axes, Rotations Calculating a helix axis in 3D via fitting a line LSQ fit of 2 structures Molecular Graphics Calculation of Volume and Surface How to represent a plane How to represent a solid How to calculate an area Hinge prediction Packing Measurement Structural Alignment Aligning sequences on the basis of 3D structure. DP does not converge, unlike sequences, what to do? Other Approaches: Distance Matrices, Hashing Fold Library Docking and Drug Design as Surface Matching

42 Topics – Function Genomics
Genome Comparisons Ortholog Families, pathways Large-scale censuses Frequent Words Analysis Genome Annotation Identification of interacting proteins Networks Global structure and local motifs Structural Genomics Folds in Genomes, shared & common folds Bulk Structure Prediction Genome Trees Expression Analysis Time Courses clustering Measuring differences Identifying Regulatory Regions Large scale cross referencing of information Function Classification and Orthologs The Genomic vs. Single-molecule Perspective

43 Bioinformatics tools Sequence comparison (pairwise and multiple alignments, e.g. ClustalW, Blastz, ) Phylogenetic reconstruction (e.g. Phylip, IQPNNI, SplitsTree) Database search (e.g. BLAST, HMMer) Comparative sequence assembly (e.g. OSLay) Gene finding (e.g. genscan, FirstEF) Motif discovery (e.g. MEME, Weeder) Protein structure (e.g. CE)

44 Bioinformatics algorithms
Dynamic Programming EM algorithms Neural Networks Hidden Markov Models Support Vector Machine Phylogenetic Trees Clustering

45 Bioinformatics Topics?
(YES?) Digital Libraries Automated Bibliographic Search and Textual Comparison Knowledge bases for biological literature (YES) Motif Discovery Using Gibb's Sampling (YES) Metabolic Pathway Simulation (YES) Gene identification by sequence inspection Prediction of splice sites (YES) Linkage Analysis Linking specific genes to various traits YES) RNA structure prediction Identification in sequences (YES) Homology modeling


Download ppt "Introduction to Bioinformatics"

Similar presentations


Ads by Google