Download presentation
Presentation is loading. Please wait.
Published byGeoffrey Conley Modified over 6 years ago
1
COT6930 High Performance Computing and Bioinformatics Course overview, Introduction
Instructors: Xingquan (Hill) Zhu and Imad Mahgoub 11/26/2018
2
About COT 6930 Meeting time: T R 11:00AM-12:20PM Room: GS109
Course home page: Course lectures, homework, and solutions will be posted online Make sure you check the course website regularly.
3
About COT 6930 Instructors: Prof. Imad Mahgoub and Prof. Xingquan Zhu
s: Offices: S&E 406 (Dr. Mahgoub), S&E 366 (Dr. Zhu) Office hours: Dr. Zhu: T & R 1:00 – 4:00 pm or by appointment Dr. Mahgoub: T & R 12:25 – 3:25 pm or by appointment
4
Introduce Yourself Your name Your major and class Your background
Your research interests Why you study HPC & Bioinformatics Your expectations from this course Other
5
Expected Background Data Structures and Programming
Statistics: good if you’ve had at least one course, but not required We will cover necessary statistics background Molecular biology: no knowledge assumed, but you should be interested in learning some basic molecular biology concepts
6
Course Objective Deals with high performance computing in Bioinformatics research Bioinformatics basic concepts Cell Biology and molecular biology review DNA, to RNA to proteins, protein structure and function prediction Biological network and DNA microarrays Bioinformatics databases and tools Bioinformatics algorithms Pairwise sequence alignment, multiple sequence alignment Bioinformatics classification algorithms Parallel architectures and parallel programming paradigms Bioinformatics programs analysis & parallelization Parallel computation in biological classification and sequence analysis FLOPS: floating-point operations per second
7
Textbook No required textbook: Supplementary recommended reading:
Bioinformatics and Functional Genomics, by Jonathan Pevsner (Wiley, 2003) This has 1100 URLs, organized by chapter Some reading assignments may be in the form of papers Supplementary recommended reading: An Introduction to Bioinformatics Algorithms, by Neil C. Jones and Pavel A. Pevzner, MIT Press, 2004. Developing Bioinformatics Computer Skills, C. Gibas and P. Jambeck, O’reilly, 2001 Parallel Computing for Bioinformatics and Computational Biology by Albert Zomaya, Wiley 2006.
8
Grading Policy Homework: 20% Course projects: 30% Presentation: 15%
Participation: 10% Final: 25%
9
Grading Policy Cutoffs for grades (roughly) A: 85 – 100 B: 70 – 84
10
Lecture 1: Introduction to Bioinformatics
11
Bioinformatics - Definition
NIH Definition: Research, development, or application of computational tools and approaches for expanding the use of biological, medical, behavioral or health data, including those to acquire, store, organize, archive, analyze, or visualize such data. Definition varies and is somewhat vague Typically taken to include algorithms, databases, and data analysis Mathematical modeling or simulation of biological systems is typically excluded A simpler definition Computers + Biology = Bioinformatics (was an O’Reilly book)
12
The need for bioinformatics
The need for bioinformatics. The number of entries in biological databases is increasing exponentially. Bioinformatics is needed to understand and use this information. GenBank Growth As of August 2004, it contains 41.8 billion nucleotide bases from 37.3 million sequences. In 2004 along, more than 7.9 million new sequences were added to GeneBank
13
What is Bioinformatics
Representation/storage/retrieval/analysis of the biological data Concerning Sequences Structures Functions Sometimes used synonymously with computational biology or computational molecular biology Highly interdisciplinary nature Biology, mathematics, statistics, computer science, biochemistry, physics, chemistry, medicine, …
14
Bioinformatics Requires Interdisciplinary Research
From the “Bioinformatics: Building Bridges Symposium (April 13, 2006)
15
Promises of Bioinformatics
SmartMoney ranked bioinformatics as #1 among the next hot jobs (June 2002). “The fusion of biology and computer science is the hottest of the hot in science right now, and it's going to heat up even more. Bioinformaticians , …, use computer modeling to predict which drugs will work on which illnesses, shaving the time and cost of getting new products to market.”
16
Promises of Bioinformatics
Medicine Knowledge of protein structure facilitates drug design How drugs work? (Animation) Generally work by interacting with receptors on the surface of cells or enzymes (which regulate the rate of chemical reactions) within cells. Blocking, inhibiting, or simulating protein functions. Understanding of genomic variation allows the tailoring of medical treatment to the individual’s genetic make-up Genome analysis allows the targeting of genetic diseases The effect of a disease or of a therapeutic on RNA and protein levels can be elucidated The same techniques can be applied to biotechnology, crop and livestock improvement, etc...
17
Challenges in bioinformatics
Explosion of information Need for faster, automated analysis to process large amounts of data Need for integration between different types of information sequences, literature, annotations, protein levels, RNA levels etc… Need for “smarter” software to identify interesting relationships in very large data sets Lack of “bioinformaticians” Software needs to be easier to access, use and understand Biologists need to learn about the software, its limitations, and how to interpret its results
18
Bioinformatics Topics
Sequence alignments Find similarity between DNA / protein (amino acid) sequences Genome assembly Combining genomic fragments to form whole genome Gene identification & annotation Identify and classify genes on the genome (the functions of 90% of human genes are unknown, although their sequence information are available) Microarrays & gene expression analysis Use DNA microarray (gene chip) to measure mRNA Protein folding Compute 3D protein structure ↔ protein sequence Phylogenetic analysis Find genetic relationships between sequences / species
19
Topics Covered in this class
Introduction to Bioinformatics & Molecular Biology We will present a very brief introduction to molecular biology DNA, RNA, Proteins, Gene expression: from DNA to protein, Central dogma of molecular biology & bioinformatics Bioinformatics databases and tools Blast, genbank, protein bank etc. Sequence Alignment Multiple sequence alignment Protein structure Protein structure prediction Gene expression & data analysis
20
Molecular biology databases
Genomic sequence database Gene expression database Protein sequence database Protein structure database Protein family database
21
Sequence Alignment Pairwise sequence alignment is the most fundamental operation of bioinformatics Compare two (pairwise) or more (multiple) sequences DNA – 4 letters; Protein – 20 letters Useful for discovering functional, structural, and evolutionary information in biological sequences Assumptions: similar sequences may have the same function; or two similar sequences from different organisms may have a common ancestor sequence (homologous)
22
Sequence alignment: DNA sequences can be aligned to see similarities between gene from different sources 768 TT....TGTGTGCATTTAAGGGTGATAGTGTATTTGCTCTTTAAGAGCTG 813 || || || | | ||| | |||| ||||| ||| ||| 87 TTGACAGGTACCCAACTGTGTGTGCTGATGTA.TTGCTGGCCAAGGACTG 135 814 AGTGTTTGAGCCTCTGTTTGTGTGTAATTGAGTGTGCATGTGTGGGAGTG 863 | | | | |||||| | |||| | || | | 136 AAGGATC TCAGTAATTAATCATGCACCTATGTGGCGG 172 864 AAATTGTGGAATGTGTATGCTCATAGCACTGAGTGAAAATAAAAGATTGT 913 ||| | ||| || || ||| | ||||||||| || |||||| | 173 AAA.TATGGGATATGCATGTCGA...CACTGAGTG..AAGGCAAGATTAT 216 mismatch match gap
23
Database similarity searching: The BLAST program has been written to allow rapid comparison of a new gene sequence with the 100s of 1000s of gene sequences in databases (2) (1) Sequences producing significant alignments: (bits) Value gnl|PID|e (Z74911) ORF YOR003w [Saccharomyces cerevisiae] e-26 gi| (U18795) Prb1p: vacuolar protease B [Saccharomyces ce e-24 gnl|PID|e (X59720) YCR045c, len:491 [Saccharomyces cerevi e-13 gnl|PID|e (Z71514) ORF YNL238w [Saccharomyces cerevisiae] gnl|PID|e (Z71603) ORF YNL327w [Saccharomyces cerevisiae] gnl|PID|e (Z71554) ORF YNL278w [Saccharomyces cerevisiae] gnl|PID|e (Z74911) ORF YOR003w [Saccharomyces cerevisiae] Length = 478 Score = 112 bits (278), Expect = 7e-26 Identities = 85/259 (32%), Positives = 117/259 (44%), Gaps = 32/259 (12%) Query: 2 QSVPWGISRVQAPAAHNRG LTGSGVKVAVLDTGIST-HPDLNIRGG-ASFV 50 + PWG+ RV G G GV VLDTGI T H D R Sbjct: 174 EEAPWGLHRVSHREKPKYGQDLEYLYEDAAGKGVTSYVLDTGIDTEHEDFEGRAEWGAVI 233 Query: 51 PGEPSTQDGNGHGTHVAGTIAALNNSIGVLGVAPSAELYXXXXXXXXXXXXXXXXXQGLE 110 P D NGHGTH AG I GVA G+E Sbjct: 234 PANDEASDLNGHGTHCAGIIGSKH-----FGVAKNTKIVAVKVLRSNGEGTVSDVIKGIE 288 (3)
24
Multiple sequence alignment: Sequences of proteins from different organisms can be aligned to see similarities and differences
25
Protein structure Proteins perform various functions in cells.
The 3-D structure of a protein determines its function. One of the major goals of bioinformatics is to understand the relationship between amino acid sequence and 3-D structure in proteins. In theory, the structure of a protein could be reliably predicted from the amino acid sequence.
26
Protein Structure/Function
> 1NLG:_ NADP-LINKED GLYCERALDEHYDE-3-PHOSPHATE EKKIRVAINGFGRIGRNFLRCWHGRQNTLLDVVAINDSGGVKQASHLLKYDSTLGTFAAD VKIVDDSHISVDGKQIKIVSSRDPLQLPWKEMNIDLVIEGTGVFIDKVGAGKHIQAGASK VLITAPAKDKDIPTFVVGVNEGDYKHEYPIISNASCTTNCLAPFVKVLEQKFGIVKGTMT TTHSYTGDQRLLDASHRDLRRARAAALNIVPTTTGAAKAVSLVLPSLKGKLNGIALRVPT PTVSVVDLVVQVEKKTFAEEVNAAFREAANGPMKGVLHVEDAPLVSIDFKCTDQSTSIDA SLTMVMGDDMVKVVAWYDNEWGYSQRVVDLAEVTAKKWVA Amino Acid Sequence 3-D Structure Protein Function Classification: Gene Transfer EC Number: Computational Challenges: Determine structure from sequence Determine function from sequence/3D structure
27
Gene expression and data analysis
Microarray High-throughput approaches based on hybridization principle, developed recently. Generate terabytes of information that are overwhelming conventional methods of biological analysis different from sequence analysis. Microarray technology allows biologists to study genome-wide patterns of gene expression in any given cell type, at any given time, and under any given set of conditions cancer classification. Microarray data clustering and classification
28
Gene expression and data analysis
Microarray analysis Clustering Classification
29
High Performance Computing
Increase available computation power Exploit parallelism Today’s supercomputers will become our desktop in next 10 years Use multiple processors in parallel Application must be parallelized Exploit locality processors faster than memory, network in cache → avoid memory latency on processor → avoid network latency
30
HPC Topics Architectures Shared-memory multiprocessors
Distributed-memory multiprocessors Cluster
31
HPC Topics Parallel Programming Shared memory paradigm
Distributed memory paradigm Languages
32
HPC Topics Compilers Run-time systems Program analysis
Program transformations Locality optimizations Parallelism optimizations Run-time systems
33
Course’s Main Point Gain Bioinformatics knowledge set, as well as parallel programming experiences for Bioinformatics research Timeline: Month 1: Bioinformatics knowledge set Month 2: Parallel programming Month 3: HPC and Bioinformatics Remaining time: Student presentation
34
Reading assignment L. Hunter, Molecular Biology for Computer Scientists, Artificial Intelligence for Molecular Biology, L. Hunter Ed., pp. 1-46, AAAI Press, (online download:
Similar presentations
© 2025 SlidePlayer.com Inc.
All rights reserved.