Presentation is loading. Please wait.

Presentation is loading. Please wait.

 The following material is the result of a curriculum development effort to provide a set of courses to support bioinformatics efforts involving students.

Similar presentations


Presentation on theme: " The following material is the result of a curriculum development effort to provide a set of courses to support bioinformatics efforts involving students."— Presentation transcript:

1  The following material is the result of a curriculum development effort to provide a set of courses to support bioinformatics efforts involving students from the biological sciences, computer science, and mathematics departments. They have been developed as a part of the NIH funded project “Assisting Bioinformatics Efforts at Minority Schools” (2T36 GM008789). The people involved with the curriculum development effort include:  Dr. Hugh B. Nicholas, Dr. Troy Wymore, Mr. Alexander Ropelewski and Dr. David Deerfield II, National Resource for Biomedical Supercomputing, Pittsburgh Supercomputing Center, Carnegie Mellon University.  Dr. Ricardo Gonzalez-Mendez, University of Puerto Rico Medical Sciences Campus.  Dr. Alade Tokuta, North Carolina Central University.  Dr. Jaime Seguel and Dr. Bienvenido Velez, University of Puerto Rico at Mayaguez.  Dr. Satish Bhalla, Johnson C. Smith University.  Unless otherwise specified, all the information contained within is Copyrighted © by Carnegie Mellon University. Permission is granted for use, modify, and reproduce these materials for teaching purposes. These materials were developed with funding from the US National Institutes of Health grant #2T36 GM008789 to the Pittsburgh Supercomputing Center 1

2  This material is targeted towards students with a general background in Biology. It was developed to introduce biology students to the computational mathematical and biological issues surrounding bioinformatics. This specific lesson deals with the following fundamental topics:  Essential computing for bioinformatics  Computer Science track  This material has been developed by: Dr. Hugh B. Nicholas, Jr. National Center for Biomedical Supercomputing Pittsburgh Supercomputing Center Carnegie Mellon University These materials were developed with funding from the US National Institutes of Health grant #2T36 GM008789 to the Pittsburgh Supercomputing Center 2

3 Essential Computing for Bioinformatics Course Development Overview Bienvenido Vélez UPR Mayaguez July, 2008

4 Outline  Essential Computing for Bioinformatics  Course Description  Educational Objectives  Major Course Modules  Module Descriptions  Bioinformatics Data Management  Course Description  Educational Objectives  Major Course Modules  Module Descriptions  Status 4 These materials were developed with funding from the US National Institutes of Health grant #2T36 GM008789 to the Pittsburgh Supercomputing Center

5 Essential Computing for Bioinformatics Course Description This course provides a broad introductory discussion of essential computer science concepts that have wide applicability in the natural sciences. Particular emphasis will be placed on applications to Bioinformatics. The concepts will be motivated by practical problems arising from the use of bioinformatics research tools such as genetic sequence databases. Concepts will be discussed in a weekly lecture and will be practiced via simple programming exercises using Python, an easy to learn and widely available scripting language. 5 These materials were developed with funding from the US National Institutes of Health grant #2T36 GM008789 to the Pittsburgh Supercomputing Center

6 Educational Objectives  Awareness of the mathematical models of computation and their fundamental limits  Basic understanding of the inner workings of a computer system  Ability to extract useful information from various bioinformatics data sources  Ability to design computer programs in a modern high level language to analyze bioinformatics data.  Experience with commonly used software development environments and operating systems  Experience applying computer programming to solve bioinformatics problems 6 These materials were developed with funding from the US National Institutes of Health grant #2T36 GM008789 to the Pittsburgh Supercomputing Center

7 Major Course Modules ModuleLecture MARC Lecture First Steps in Computing: Course Overview1 Using Bioinformatics Data Sources2 Mathematical Computing Models35 High-level Programming (Python): Basics51 High-level Programming (Python): Flow Control62 High-level Programming (Python): Container Objects73 High-level Programming (Python): Files84 High-level Programming (Python): BioPython9 7 These materials were developed with funding from the US National Institutes of Health grant #2T36 GM008789 to the Pittsburgh Supercomputing Center

8 Main Advantages of Python  Familiar to C/C++/C#/Java Programmers  Very High Level  Interpreted and Multi-platform  Dynamic  Object-Oriented  Modular  Strong string manipulation  Lots of libraries available  Runs everywhere  Free and Open Source  Track record in bioInformatics (BioPython) 8 These materials were developed with funding from the US National Institutes of Health grant #2T36 GM008789 to the Pittsburgh Supercomputing Center

9 Example A Codon -> AminoAcid Dictionary >> code = { ’ttt’: ’F’, ’tct’: ’S’, ’tat’: ’Y’, ’tgt’: ’C’,... ’ttc’: ’F’, ’tcc’: ’S’, ’tac’: ’Y’, ’tgc’: ’C’,... ’tta’: ’L’, ’tca’: ’S’, ’taa’: ’*’, ’tga’: ’*’,... ’ttg’: ’L’, ’tcg’: ’S’, ’tag’: ’*’, ’tgg’: ’W’,... ’ctt’: ’L’, ’cct’: ’P’, ’cat’: ’H’, ’cgt’: ’R’,... ’ctc’: ’L’, ’ccc’: ’P’, ’cac’: ’H’, ’cgc’: ’R’,... ’cta’: ’L’, ’cca’: ’P’, ’caa’: ’Q’, ’cga’: ’R’,... ’ctg’: ’L’, ’ccg’: ’P’, ’cag’: ’Q’, ’cgg’: ’R’,... ’att’: ’I’, ’act’: ’T’, ’aat’: ’N’, ’agt’: ’S’,... ’atc’: ’I’, ’acc’: ’T’, ’aac’: ’N’, ’agc’: ’S’,... ’ata’: ’I’, ’aca’: ’T’, ’aaa’: ’K’, ’aga’: ’R’,... ’atg’: ’M’, ’acg’: ’T’, ’aag’: ’K’, ’agg’: ’R’,... ’gtt’: ’V’, ’gct’: ’A’, ’gat’: ’D’, ’ggt’: ’G’,... ’gtc’: ’V’, ’gcc’: ’A’, ’gac’: ’D’, ’ggc’: ’G’,... ’gta’: ’V’, ’gca’: ’A’, ’gaa’: ’E’, ’gga’: ’G’,... ’gtg’: ’V’, ’gcg’: ’A’, ’gag’: ’E’, ’ggg’: ’G’...... } >> 9 These materials were developed with funding from the US National Institutes of Health grant #2T36 GM008789 to the Pittsburgh Supercomputing Center

10 Example A DNA Sequence >>> cds = "atgagtgaacgtctgagcattaccccgctggggccgtatatcggcgcacaaa tttcgggtgccgacctgacgcgcccgttaagcgataatcagtttgaacagctttaccatgcggtg ctgcgccatcaggtggtgtttctacgcgatcaagctattacgccgcagcagcaacgcgcgctggc ccagcgttttggcgaattgcatattcaccctgtttacccgcatgccgaaggggttgacgagatca tcgtgctggatacccataacgataatccgccagataacgacaactggcataccgatgtgacattt attgaaacgccacccgcaggggcgattctggcagctaaagagttaccttcgaccggcggtgatac gctctggaccagcggtattgcggcctatgaggcgctctctgttcccttccgccagctgctgagtg ggctgcgtgcggagcatgatttccgtaaatcgttcccggaatacaaataccgcaaaaccgaggag gaacatcaacgctggcgcgaggcggtcgcgaaaaacccgccgttgctacatccggtggtgcgaac gcatccggtgagcggtaaacaggcgctgtttgtgaatgaaggctttactacgcgaattgttgatg tgagcgagaaagagagcgaagccttgttaagttttttgtttgcccatatcaccaaaccggagttt caggtgcgctggcgctggcaaccaaatgatattgcgatttgggataaccgcgtgacccagcacta tgccaatgccgattacctgccacagcgacggataatgcatcgggcgacgatccttggggataaac cgttttatcgggcggggtaa" >>> 10 These materials were developed with funding from the US National Institutes of Health grant #2T36 GM008789 to the Pittsburgh Supercomputing Center

11 Example CDS Sequence -> Protein Sequence >>> def translate(cds, code):... prot = ""... for i in range(0,len(cds),3):... codon = cds[i:i+3]... prot = prot + code[codon]... return prot >>> translate(cds, code) ’MSERLSITPLGPYIGAQ*’ 11 These materials were developed with funding from the US National Institutes of Health grant #2T36 GM008789 to the Pittsburgh Supercomputing Center

12 Using BioInformatics Data Sources Goal:Basic Experience  Searching Nuceotide Sequence Databases  Searching Amino Acid Sequence Database  Performing BLAST Searches  Using Specialized Data Sources IDEA: How can we expedite data collection and analysis?... writing programs to automate parts of the process. 12 These materials were developed with funding from the US National Institutes of Health grant #2T36 GM008789 to the Pittsburgh Supercomputing Center Reference: Bioinformatics for Dummies (Ch 1-4)

13 Mathematical Computing Goal:General Awareness  What is Computing?  Mathematical Models of Computing  Finite Automata  Turing Machines  The Limits of Computation  Church/Turing Thesis  What is an Algorithm?  Big O Notation 13 These materials were developed with funding from the US National Institutes of Health grant #2T36 GM008789 to the Pittsburgh Supercomputing Center

14 High-Level Programming (Python) Goal:Knowledge and Experience  Downloading and Installing the Interpreter  Command-line versus Batch Mode  Values, Expressions and Naming  Designing your own Functional Building Blocks  Controlling the Flow of your Program  String Manipulation (Sequence Processing)  Container Data Structures  File Manipulation 14 These materials were developed with funding from the US National Institutes of Health grant #2T36 GM008789 to the Pittsburgh Supercomputing Center Reference: How to Think Like A Computer Scientist: Learning with Python

15 CS Fundamentals will be Interleaved Throughout the Course  Information Representation and Encoding  Computer Architecture  Programming Language Translation Methods  The Software Development Cycle  Fundamental Principles of Software Engineering  Basic Data Structures for Bioinformatics  Design and Analysis of Bioinformatics Algorithms 15 These materials were developed with funding from the US National Institutes of Health grant #2T36 GM008789 to the Pittsburgh Supercomputing Center

16 Outline Essential Computing for Bioinformatics Course Description Educational Objectives Major Course Modules Module Descriptions Status  Bioinformatics Data Management  Course Description  Educational Objectives  Major Course Modules  Module Descriptions  Status 16 These materials were developed with funding from the US National Institutes of Health grant #2T36 GM008789 to the Pittsburgh Supercomputing Center

17 Essential Computing for Bioinformatics Course Description An introduction to the concepts of Data Base Management and Information Retrieval most commonly applicable to Bio-informatics database systems. The course is intended for Computer Science graduate students wishing to focus their academic careers in Bio-Informatics. The first half of the course discusses relational databases and the second half covers Information Retrieval systems. 17 These materials were developed with funding from the US National Institutes of Health grant #2T36 GM008789 to the Pittsburgh Supercomputing Center

18 Educational Objectives  Understanding of the pros and cons of the different data models available to manage bioinformatics data  Experience with the most widely used query models of bioinformatics data retrieval  Understanding of the relational data base model and its query language  Understanding of the vector and boolean query models for unstructured text retrieval  Experience programming tools to move information across daa models in order to efficiently process it  Experience with widely used database management systems and tools 18 These materials were developed with funding from the US National Institutes of Health grant #2T36 GM008789 to the Pittsburgh Supercomputing Center

19 Major Course Modules ModuleLecture Overview Data Models used in Bioinformatics1 Using Bioinformatics Data Sources2 Fundamentals of Unstructured Text Retrieval3 Fundamentals of The Relational Data Model4 Structure Query Language (SQL)5 Querying DNA/Protein sequence databases6 Migrating Information Across Databases and Tools7 Highly-Specialized biological databases8 19 These materials were developed with funding from the US National Institutes of Health grant #2T36 GM008789 to the Pittsburgh Supercomputing Center

20 Relational Databases and SQL  The Relational Model  The SQL Relational Query Language  Downloading and Installing MySQL  Creating databases in MySQL + Access  Importing data into MySQL + Access  Creating Analysis Reports with iReports  Exporting Data into Spreadsheets 20 These materials were developed with funding from the US National Institutes of Health grant #2T36 GM008789 to the Pittsburgh Supercomputing Center

21 Migrating Information Across Bioinformatics Databases and Tools  Manipulating sequence files  Bioinformatics database file formats  Parsing Bioinformatics database files to focus on information of interest  Programmatic access to Bioinformatics Data Sources  Importing data into relational databases and other analysis and reporting tools 21 These materials were developed with funding from the US National Institutes of Health grant #2T36 GM008789 to the Pittsburgh Supercomputing Center

22 Status  Essential Computing for Bioinformatics  Slides available  Programming examples available  Lab experiences under construction  Bioinformatics Data Management  Mostly under construction  Available Summer 2009 22 These materials were developed with funding from the US National Institutes of Health grant #2T36 GM008789 to the Pittsburgh Supercomputing Center

23 23 Questions?


Download ppt " The following material is the result of a curriculum development effort to provide a set of courses to support bioinformatics efforts involving students."

Similar presentations


Ads by Google