Introduction to Bioinformatics Dr. Lokesh Gambhir Department Of Life Sciences Shri Guru Ram Rai Institute of Technology & Sciences (SGRRITS)
By the end of this course, you will What is bioinformatics? There are many different answers to this. One basic definition is that it is the use of computational methods to analyse biological data. By the end of this course, you will • have knowledge of the many data resources available at the NCBI and EBI, • understand some of the basic principles behind aligning sequences, • understand some key points about different sequence alignment programs, • have experience running some web-based bioinformatics programs, • understand the information returned by some sequence database searching programs, • appreciate some of the practical approaches available for automating bioinformatics. Dr. Lokesh Gambhir, Assistant professor, Department of Life Sciences, SGRRITS, Dehradun
Databases There are many freely available data resources. A large number are hosted by large national and international institutions such as the American center, the National Center for Biotechnology Information (NCBI) and the European Bioinformatics Centre (EBI). Dr. Lokesh Gambhir, Assistant professor, Department of Life Sciences, SGRRITS, Dehradun
Folding problem and structure prediction Few Concepts to remember DNA Protein/Structure Pattern recognition Folding problem and structure prediction The Twilight Zone Orthologs Paraolgs DNA sequencing Few Concepts to remember ? Dr. Lokesh Gambhir, Assistant professor, Department of Life Sciences, SGRRITS, Dehradun
What can be discovered about a gene by a database search? A little or a lot, depending on the gene Evolutionary information: homologous genes, taxonomic distributions, allele frequencies, synteny, etc. Genomic information: chromosomal location, introns, UTRs, regulatory regions, shared domains, etc. Structural information: associated protein structures, fold types, structural domains Expression information: expression specific to particular tissues, developmental stages, phenotypes, diseases, etc. Functional information: enzymatic/molecular function, pathway/cellular role, localization, role in diseases Dr. Lokesh Gambhir, Assistant professor, Department of Life Sciences, SGRRITS, Dehradun
Searching sequence databases Start from sequence, find information about it Many kinds of input sequences Could be amino acid or nucleotide sequence Genomic or mRNA/cDNA or protein sequence Complete or fragmentary sequences Exact matches are rare (even uninteresting in many cases), so often goal is to retrieve a set of similar sequences. Both small (mutations) and large (required for function) differences within “similar” can be interesting. Dr. Lokesh Gambhir, Assistant professor, Department of Life Sciences, SGRRITS, Dehradun
What might we want to know about a sequence? Is this sequence similar to any known genes? How close is the best match? Significance? What do we know about that gene? Genomic (chromosomal location, allelic information, regulatory regions, etc.) Structural (known structure? structural domains? etc.) Functional (molecular, cellular & disease) Evolutionary information: Is this gene found in other organisms? What is its taxonomic tree? Dr. Lokesh Gambhir, Assistant professor, Department of Life Sciences, SGRRITS, Dehradun
Dr. Lokesh Gambhir, Assistant professor, Department of Life Sciences, SGRRITS, Dehradun
Dr. Lokesh Gambhir, Assistant professor, Department of Life Sciences, SGRRITS, Dehradun
NCBI Dr. Lokesh Gambhir, Assistant professor, Department of Life Sciences, SGRRITS, Dehradun
To carry out its diverse responsibilities, NCBI: Conducts research on fundamental biomedical problems at the molecular level using mathematical and computational methods. Maintains collaborations with several NIH institutes, academia, industry, and other governmental agencies Fosters scientific communication by sponsoring meetings, workshops, and lecture series Supports training on basic and applied research in computational biology for postdoctoral fellows through the NIH Intramural Research Program Engages members of the international scientific community in informatics research and training through the Scientific Visitors Program Develops, distributes, supports, and coordinates access to a variety of databases and software for the scientific and medical communities Develops and promotes standards for databases, data deposition and exchange, and biological nomenclature Dr. Lokesh Gambhir, Assistant professor, Department of Life Sciences, SGRRITS, Dehradun
Dr. Lokesh Gambhir, Assistant professor, Department of Life Sciences, SGRRITS, Dehradun
Entrez The Entrez Global Query Cross-Database Search System is a federated search engine, or web portal that allows users to search many discrete health sciences databases at the National Center for Biotechnology Information (NCBI) website. Dr. Lokesh Gambhir, Assistant professor, Department of Life Sciences, SGRRITS, Dehradun
Classification of biological databases Dr. Lokesh Gambhir, Assistant professor, Department of Life Sciences, SGRRITS, Dehradun
Characteristics of entries in the primary nucleotide repositories • The large nucleotide databases are not hand-curated: the quality of the information is largely dependent on the people submitting the sequence. • Records can be updated by the original submitter, or by a third party if the submitter granted them permission and notified the relevant institute (not common). • There are redudant entries in these databases. • Entries can contradict one another. • Predicted or known proteins coded for by the sequence are linked to via their accession number in the Uniprot knowledgebase. • Information from any species, including sequences of unknown origin, can be deposited in the database. Dr. Lokesh Gambhir, Assistant professor, Department of Life Sciences, SGRRITS, Dehradun
GenBank GenBank® is the NIH genetic sequence database, an annotated collection of all publicly available DNA sequences. • Collaboration between NCBI (National Center for Biotechnology Information), EMBL (The European Molecular Biology Laboratory ), EBI (European Bioinformatics Institute), DDBJ (DNA Data Bank of Japan). Each record in GenBank is in a “GenBank flat file format”. • Each record contains information about a sequence type (DNA/protein/RNA……) • source/organism, reference, …… • features • functions of a region on the sequence • The sequence Dr. Lokesh Gambhir, Assistant professor, Department of Life Sciences, SGRRITS, Dehradun
GenBank http://www.ncbi.nlm.nih.gov/genbank/ Dr. Lokesh Gambhir, Assistant professor, Department of Life Sciences, SGRRITS, Dehradun
Flat File Format of GenBank Dr. Lokesh Gambhir, Assistant professor, Department of Life Sciences, SGRRITS, Dehradun
Dr. Lokesh Gambhir, Assistant professor, Department of Life Sciences, SGRRITS, Dehradun
Abraxane is a chemotherapeutic drug. How will you determine the molecular target of the drug ? Dr. Lokesh Gambhir, Assistant professor, Department of Life Sciences, SGRRITS, Dehradun
EMBL-EBI The roots of the EMBL-EBI lie in the world's first nucleotide sequence database The EMBL Nucleotide Sequence Data Library (now EMBL Bank, part of the European Nucleotide Archive), which was established in 1980 at the European Molecular Biology Laboratory in Heidelberg, Germany. The original goal was to establish a central database of DNA sequences, rather than have scientists submit sequences to journals. Data retrieval is done by employing SRS which connects the primary DNA-Protein databases along with secondary and specialised database MEDLINE is used for reference application Dr. Lokesh Gambhir, Assistant professor, Department of Life Sciences, SGRRITS, Dehradun
UniProt/SWISS-Prot The mission of UniProt is to provide the scientific community with a comprehensive, high quality and freely accessible resource of protein sequence and functional information. UniProt is comprised of four components, each optimised for different uses: The UniProt Knowledgebase (UniProtKB) is the central access point for extensive curated protein information, including function, classification, and cross-reference. UniProtKB comprises two sections: UniProtKB/Swiss-Prot which is manually annotated and is reviewed UniProtKB/TrEMBL which is automatically annotated and is not reviewed. The UniProt Reference Clusters (UniRef) databases provide clustered sets of sequences from the UniProtKB and selected UniProt Archive records to obtain complete coverage of sequence space at several resolutions while hiding redundant sequences. The UniProt Archive (UniParc) is a comprehensive repository, used to keep track of sequences and their identifiers. Dr. Lokesh Gambhir, Assistant professor, Department of Life Sciences, SGRRITS, Dehradun
UniProt/SWISS-Prot Dr. Lokesh Gambhir, Assistant professor, Department of Life Sciences, SGRRITS, Dehradun
Flat File Format UniProt/SWISS-Prot Dr. Lokesh Gambhir, Assistant professor, Department of Life Sciences, SGRRITS, Dehradun
Dr. Lokesh Gambhir, Assistant professor, Department of Life Sciences, SGRRITS, Dehradun