These materials were developed with funding from the US National Institutes of Health grant #2T36 GM008789 to the Pittsburgh Supercomputing Center 1 

Slides:



Advertisements
Similar presentations
For loops Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
Advertisements

 The following material is the result of a curriculum development effort to provide a set of courses to support bioinformatics efforts involving students.
I210 review Fall 2011, IUB. Python is High-level programming –High-level versus machine language Interpreted Language –Interpreted versus compiled 2.
An Introduction to Python – Part III Dr. Nancy Warter-Perez.
CPS120: Introduction to Computer Science Lecture 15 B Data Files.
Introduction to Perl Bioinformatics. What is Perl? Practical Extraction and Report Language A scripting language Components an interpreter scripts: text.
MARC: Developing Bioinformatics Programs July 2009 Alex Ropelewski PSC-NRBSC Bienvenido Vélez UPR Mayaguez Reference: How to Think Like a Computer Scientist:
MARC: Developing Bioinformatics Programs July 2009 Alex Ropelewski PSC-NRBSC Bienvenido Vélez UPR Mayaguez 1 Essential Computing for Bioinformatics Lecture.
Topics This week: File input and output Python Programming, 2/e 1.
Working with Files CSC 161: The Art of Programming Prof. Henry Kautz 11/9/2009.
Introductory Overview
MARC: Developing Bioinformatics Programs July 2009 Alex Ropelewski PSC-NRBSC Bienvenido Vélez UPR Mayaguez Reference: How to Think Like a Computer Scientist:
 The following material is the result of a curriculum development effort to provide a set of courses to support bioinformatics efforts involving students.
Introduction to Functions Intro to Computer Science CS1510 Dr. Sarah Diesburg.
Computer Programming for Biologists Class 5 Nov 20 st, 2014 Karsten Hokamp
November 15, 2005ICP: Chapter 7: Files and Exceptions 1 Introduction to Computer Programming Chapter 7: Files and Exceptions Michael Scherger Department.
MARC: Developing Bioinformatics Programs July 2009 Alex Ropelewski PSC-NRBSC Bienvenido Vélez UPR Mayaguez Reference: How to Think Like a Computer Scientist:
Chapter 6 Functions -- QuickStart. "The Practice of Computing Using Python", Punch & Enbody, Copyright © 2013 Pearson Education, Inc. What is a function?
Subroutines and Files Bioinformatics Ellen Walker Hiram College.
Doug Raiford Lesson 3.  More and more sequence data is being generated every day  Useless if not made available to other researchers.
 The following material is the result of a curriculum development effort to provide a set of courses to support bioinformatics efforts involving students.
Lecture 06 – Reading and Writing Text Files.  At the end of this lecture, students should be able to:  Read text files  Write text files  Example.
MARC: Developing Bioinformatics Programs July 2009 Alex Ropelewski PSC-NRBSC Bienvenido Vélez UPR Mayaguez Reference: How to Think Like a Computer Scientist:
Fundamental Programming: Fundamental Programming K.Chinnasarn, Ph.D.
Organizing information in the post-genomic era The rise of bioinformatics.
Computing Science 1P Lecture 15: Friday 9 th February Simon Gay Department of Computing Science University of Glasgow 2006/07.
Keyword Searching Weighted Federated Search with Key Word in Context Date: 10/2/2008 Dan McCreary President Dan McCreary & Associates
 The following material is the result of a curriculum development effort to provide a set of courses to support bioinformatics efforts involving students.
NCBI Literature Databases: PubMed
MARC: Developing Bioinformatics Programs July 2009 Alex Ropelewski PSC-NRBSC Bienvenido Vélez UPR Mayaguez 1 Essential Computing for Bioinformatics Lecture.
 The following material is the result of a curriculum development effort to provide a set of courses to support bioinformatics efforts involving students.
Guide to Programming with Python Chapter Seven Files and Exceptions: The Trivia Challenge Game.
Text Copyright © Software Carpentry 2010 This work is licensed under the Creative Commons Attribution License See
FILES. open() The open() function takes a filename and path as input and returns a file object. file object = open(file_name [, access_mode][, buffering])
Trinity College Dublin, The University of Dublin GE3M25: Computer Programming for Biologists Python, Class 2 Karsten Hokamp, PhD Genetics TCD, 17/11/2015.
1 CSC103: Introduction to Computer and Programming Lecture No 27.
1 Essential Computing for Bioinformatics Bienvenido Vélez UPR Mayaguez Lecture 3 High-level Programming with Python Part III: Files and Directories Reference:
Dept. of Animal Breeding and Genetics Programming basics & introduction to PERL Mats Pettersson.
BING 6004: Intro to Computational BioEngineering Spring 2016 Lecture 1: Using Python Expressions and Variables Bienvenido Vélez UPR Mayaguez Reference:
Bienvenido Vélez UPR Mayaguez Reference: How to Think Like a Computer Scientist: Learning with Python 1 Introduction to Python programming for Bioinformatics.
Alex Ropelewski Pittsburgh Supercomputing Center National Resource for Biomedical Supercomputing Bienvenido Vélez
MARC: Developing Bioinformatics Programs Alex Ropelewski PSC-NRBSC Bienvenido Vélez UPR Mayaguez Reference: How to Think Like a Computer Scientist: Learning.
Bienvenido Vélez UPR Mayaguez Essential Computing for Bioinformatics 1 BING 6004: Intro to Computational BioEngineering Spring 2016 Lecture 4: Basic File.
MARC: Developing Bioinformatics Programs June 2012 Alex Ropelewski PSC-NRBSC Bienvenido Vélez UPR Mayaguez Reference: How to Think Like a Computer Scientist:
Alex Ropelewski PSC-NRBSC Bienvenido Vélez UPR Mayaguez Using Molecular Biology to Teach Computer Science High-level Programming with Python Finding Patterns.
MARC: Developing Bioinformatics Programs Alex Ropelewski PSC-NRBSC Bienvenido Vélez UPR Mayaguez Essential BioPython Manipulating Sequences with Seq 1.
High-level Programming with Python Expressions and Variables Alex Ropelewski PSC-NRBSC Bienvenido Vélez UPR Mayaguez Reference: How to Think Like a Computer.
MARC: Developing Bioinformatics Programs Alex Ropelewski PSC-NRBSC Bienvenido Vélez UPR Mayaguez Reference: How to Think Like a Computer Scientist: Learning.
MARC: Developing Bioinformatics Programs Alex Ropelewski PSC-NRBSC Bienvenido Vélez UPR Mayaguez Reference: How to Think Like a Computer Scientist: Learning.
FILES AND EXCEPTIONS Topics Introduction to File Input and Output Using Loops to Process Files Processing Records Exceptions.
MARC: Developing Bioinformatics Programs July 2009 Alex Ropelewski PSC-NRBSC Bienvenido Vélez UPR Mayaguez Essential Computing for Bioinformatics 1 Lecture.
File Processing CMSC 120: Visualizing Information Lecture 4/15/08.
MARC: Developing Bioinformatics Programs June 2014 Alex Ropelewski PSC-NRBSC Bienvenido Vélez UPR Mayaguez Essential Computing for Bioinformatics 1 Lecture.
1 A Short Introduction to Analyzing Biological Data Using Relational Databases Part II: Creating a Relational Database to Model Biological Data Alex Ropelewski.
MARC: Developing Bioinformatics Programs June 2012 Alex Ropelewski PSC-NRBSC Bienvenido Vélez UPR Mayaguez Reference: How to Think Like a Computer Scientist:
Using Molecular Biology to Teach Computer Science
Topic: Python’s building blocks -> Variables, Values, and Types
Introduction to Bioinformatics and Functional Genomics
Using Molecular Biology to Teach Computer Science
Essential Computing for Bioinformatics
Topics Introduction to File Input and Output
Reading and Writing Files
Introduction to Programming
Notes about Homework #4 Professor Hugh C. Lauer CS-1004 — Introduction to Programming for Non-Majors (Slides include materials from Python Programming:
Lesson 3 Bioinformatics Laboratory
15-110: Principles of Computing
Topics Introduction to File Input and Output
Introduction to Programming
Topics Introduction to File Input and Output
Introduction to Computer Science
Presentation transcript:

These materials were developed with funding from the US National Institutes of Health grant #2T36 GM to the Pittsburgh Supercomputing Center 1  The following material is the result of a curriculum development effort to provide a set of courses to support bioinformatics efforts involving students from the biological sciences, computer science, and mathematics departments. They have been developed as a part of the NIH funded project “Assisting Bioinformatics Efforts at Minority Schools” (2T36 GM008789). The people involved with the curriculum development effort include:  Dr. Hugh B. Nicholas, Dr. Troy Wymore, Mr. Alexander Ropelewski and Dr. David Deerfield II, National Resource for Biomedical Supercomputing, Pittsburgh Supercomputing Center, Carnegie Mellon University.  Dr. Ricardo Gonzalez-Mendez, University of Puerto Rico Medical Sciences Campus.  Dr. Alade Tokuta, North Carolina Central University.  Dr. Jaime Seguel and Dr. Bienvenido Velez, University of Puerto Rico at Mayaguez.  Dr. Satish Bhalla, Johnson C. Smith University.  Unless otherwise specified, all the information contained within is Copyrighted © by Carnegie Mellon University. Permission is granted for use, modify, and reproduce these materials for teaching purposes.

These materials were developed with funding from the US National Institutes of Health grant #2T36 GM to the Pittsburgh Supercomputing Center 2  This material is targeted towards students with a general background in Biology. It was developed to introduce biology students to the computational mathematical and biological issues surrounding bioinformatics. This specific lesson deals with the following fundamental topics:  Computing for biologists  High level programming with Python  Computer science track  This material has been developed by: Dr. Hugh B. Nicholas, Jr. National Center for Biomedical Supercomputing Pittsburgh Supercomputing Center Carnegie Mellon University

Essential Computing for Bioinformatics Bienvenido Vélez UPR Mayaguez Lecture 4 High-level Programming with Python Part III: Manipulating Files Reference: How to Think Like a Computer Scientist: Learning with Python (Ch 11)

Outline  Text Files  Reading from Text Files  Writing to Text Files  Examples 4 These materials were developed with funding from the US National Institutes of Health grant #2T36 GM to the Pittsburgh Supercomputing Center

What are Text Files?  Persistent (non-volatile) storage of data  Needed when:  data must outlive the execution of your program  data does not fit in memory (external algorithms)  data is supplied in batch form (non-interactive)  Files are stored in your hard drive  Files are maintained by your computer’s Operating System (e.g. Linux, Windows, MacOS) 5 These materials were developed with funding from the US National Institutes of Health grant #2T36 GM to the Pittsburgh Supercomputing Center

Examples of Text Files  Word documents  Html documents retrieved from the web  XML documents  FASTA files  GENBANK file  Multiple Sequence Alignement Files (PFAM) Text files contain a sequence of numbers that must be decoded using some standard in order to be converted to string form Examples of encodings: ASCII, LATIN1, EBCDIC, Unicode Check for more infohttp://en.wikipedia.org/wiki/Character_encoding 6 These materials were developed with funding from the US National Institutes of Health grant #2T36 GM to the Pittsburgh Supercomputing Center

EOLcccccccccccc Top-level Anatomy of a Text File EOLccccccccccccccccccc ccccccccccccccccc cccccccccccccccccc ccccccccc EOF EOLccccccccccccccccccc cccccccccccccccc ccccccccccccccccc c Any Character EOL End-of-line Character EOF End-of-file Character Invisible

Examples of Text Files: HTML NCBI Sequence Viewer v2.0 …

Examples of Text Files: XML 1 1 Xanthomonas campestris pv. campestris taxon 340 …

Examples of Text Files: GENEBANK LOCUS NC_ bp DNA circular BCT 17-JUL-2008 DEFINITION Xanthomonas campestris pv. campestris, complete genome. ACCESSION NC_ VERSION NC_ GI: PROJECT GenomeProject:29801 KEYWORDS complete genome. SOURCE Xanthomonas campestris pv. campestris ORGANISM Xanthomonas campestris pv. campestris Bacteria; Proteobacteria; Gammaproteobacteria; Xanthomonadales; Xanthomonadaceae; Xanthomonas. REFERENCE 1 AUTHORS Vorholter,F.J., Schneiker,S., Goesmann,A., Krause,L., Bekel,T., Kaiser,O., Linke,B., Patschkowski,T., Ruckert,C., Schmid,J., Sidhu,V.K., Sieber,V., Tauch,A., Watt,S.A., Weisshaar,B., Becker,A., Niehaus,K. and Puhler,A. TITLE The genome of Xanthomonas campestris pv. campestris B100 and its use for the reconstruction of metabolic pathways involved in xanthan biosynthesis JOURNAL J. Biotechnol. 134 (1-2), (2008) PUBMED REFERENCE 2 (bases 1 to ) CONSRTM NCBI Genome Project TITLE Direct Submission JOURNAL Submitted (22-MAY-2008) National Center for Biotechnology Information, NIH, Bethesda, MD 20894, USA REFERENCE 3 (bases 1 to ) AUTHORS Linke,B. TITLE Direct Submission JOURNAL Submitted (03-DEC-2007) Linke B., Center For Biotechnology, Bielefeld University, Universitaetsstrasse 25, Bielefeld, GERMANY COMMENT PROVISIONAL REFSEQ: This record has not yet been subject to final …

Examples of Text Files: FASTA >gi| |ref|NC_ | Xanthomonas campestris pv. campestris, complete genome ATGGATGCTTGGCCCCGCTGTCTGGAACGTCTCGAAGCTGAATTCCCGCCCGAAGATGTCCACACCTGGT TGAAACCCCTGCAGGCCGAAGATCGCGGCGACAGCATCGTGCTGTACGCGCCGAACGCCTTCATTGTCGA GCAGGTTCGCGAGCGATACCTGCCGCGCATCCGCGAGTTGCTGGCGTATTTCGCCGGCAATGGCGAGGTG GCGCTGGCGGTCGGCTCCCGTCCGCGTGCGCCGGAGCCGCTGCCGGCACCGCAAGCCGTCGCCAGTGCGC CGGCGGCCGCGCCGATCGTGCCCTTCGCCGGCAACCTGGATTCGCATTACACCTTTGCCAACTTCGTGGA AGGCCGCAGCAACCAGCTCGGTCTGGCCGCGGCGATCCAGGCCGCACAGAAGCCTGGCGACCGGGCGCAC AACCCGTTGCTGCTGTACGGCAGCACCGGGCTGGGCAAGACCCACCTGATGTTCGCGGCCGGCAACGCGC TGCGCCAGGCCAATCCGGCCGCCAAGGTGATGTACCTGCGCTCGGAACAGTTCTTCAGCGCGATGATCCG CGCGTTGCAGGACAAGGCAATGGACCAGTTCAAGCGCCAGTTCCAGCAGATCGATGCGCTGCTGATCGAC GACATCCAGTTTTTTGCCGGCAAGGACCGCACGCAGGAGGAGTTTTTCCACACCTTCAACGCGCTGTTCG ACGGCCGCCAGCAGATCATCCTGACCTGCGACCGCTATCCGCGCGAAGTCGAGGGCCTGGAGCCGCGGCT GAAGTCGCGCCTGGCCTGGGGCCTGTCGGTGGCGATCGACCCGCCGGATTTCGAGACGCGTGCGGCAATC GTGCTGGCCAAGGCGCGCGAGCGCGGCGCCGAGATTCCCGACGACGTGGCGTTTCTGATCGCCAAGAAGA TGCGCTCGAACGTGCGCGACCTGGAAGGGGCGCTCAACACGTTGGTGGCCCGCGCCAACTTCACTGGCCG TTCGATCACCGTGGAGTTTGCGCAGGAGACGCTGCGTGACCTGTTGCGTGCGCAGCAGCAGGCGATCGGC ATTCCCAACATCCAGAAGACCGTGGCCGACTACTACGGCCTGCAGATGAAGGACCTGCTTTCCAAGCGCC GCACCCGCTCATTGGCGCGCCCGCGCCAGGTGGCGATGGCGCTCGCCAAGGAGTTGACCGAGCACAGCCT TCCCGAGATCGGCGATGCGTTTGCCGGCCGCGACCACACCACCGTGCTGCACGCCTGCCGGCAGATCCGC ACGCTGATGGAGGCCGACGGCAAGCTGCGCGAGGACTGGGAAAAGCTGATTCGCAAGCTCAGCGAGTGAG CGTTTGGCGCCCTTCTCGCCCGGCTTGGAAAAAAGCGTGGATGAGTTGGGGCCAAATGCGGGAGAATCTG TGGATAAATGCGGCTGCAGCGTTGAGGCGGAAACTATCCACAGGTTTTCCCACCGCTCCAGGGGCACTAG TCCATAGACTTTGAAGCTCAAAATGGTCTTTGGAAACAATACGTTAGTGTGGTTGTCCGCCGAAAACGGC CCTACCATCACCACCAAGCTTTTGATTTATTCCACATCTTTAAAGCATAGGGGCACGGAACCACATGCGT TTCACACTGCAGCGCGAAGCCTTCCTCAAGCCGTTGGCCCAGGTGGTCAATGTCGTCGAACGGCGTCAGA CATTGCCGGTACTGGCGAACTTGCTGGTGCAGGTGAACAACGGCCAGCTGTCGCTGACGGGGACCGACCT GGAAGTCGAAATGATCTCGCGCACCATGGTCGAGGACGCCCAGGACGGCGAAACCACGATCCCGGCGCGC AAGCTGTTCGACATCCTGCGGGCCCTGCCTGACGGCAGCCGTGTCACCGTCTCGCAAACCGGAGACAAGG TCACGGTGCAGGCCGGGCGCAGCCGCTTTACGCTCGCCACGCTACCGGCCAACGACTTCCCGTCGGTGGA CGAAGTCGAGGCCACCGAGCGTGTGGCGGTGCCGGAAGCCGGGCTGAAGGAGCTGATGGAGCGCACGGCG TTCGCCATGGCCCAGCAGGACGTGCGTTATTACCTCAACGGCCTGCTGTTCGACCTGCGCGATGGCCTGC …

Examples of Text Files: ClustalW clustalw 0xOLI RIILVTGASDGIGREAAMTYARY--GATVILLGRN EEKLRQV-- 1xMAN KKVIVTGASKGIGREMAYHLAKM-GA-HVVVTARS KETLQKV-- 2xAST KVILITGASRGIGLQLVKTVIEEDDECIVYGVART EAGLQSL-- 3xAST KVILVTGVSRGIGKSIVDVLFSLDKDTVVYGVARS EAPL-- 4xAST KIAVVTGASGGIGYEVTKELARN--GYLVYACARR LEPMAQL-- 5xMAN HVALVTGGNKGIGLAIVRDLCRL-FSGDVVLTARD VTRGQAAV-- 6xCSU NTVLITGGSAGIGLELAKRLLEL--GNEVIICGRS EARLAEAK-- 7xEX KTVIITGGARGLGAEAARQAVAA-GARVVLADVLD E-EGAATA-- 8xASP KTVLLTGASRGLGVYIARALAKE--QATVVCVSRS QSGLAQT-- 9xCSU KTALITGGGRGIGRATALALAKE--GVNIGLIGRT SANVEKV-- 10xOBR KIALVTGAMGGLGTAICQALAKD-GCIVAANCLPN FEPAAAWL-- 0xOLI ---ASHIN--EETG-RQPQWFILDLLTCTSENC-QQLAQRIAVNY----P-RLDGVLHNA 1xMAN ---VSHCL---ELG-AASAHYIA-GT---MEDM-TFAEQFVAQAG--KLMGGLDMLILNH 2xAST ---QREYG ADKFVYRVLD---ITDR-SRMEALVEEIR--QKHGKLDGIVANA 3xAST ----KKLK--EKYG-DRFFYVVG--D---ITED-SVLKQLVNAAVK--GHGKIDSLVANA 4xAST -----AIQ----FG-NDSIKPYK-LD---ISKP-EEIVTFSGFLRANLPDGKLDLLYNNA 5xMAN ----QQLQ---AEG--LSPRFHQ-LD---IDDL-QSIRALRDFLR--KEYGGLDVLVNNA 6xCSU ----QQLP N-IHTKQ-CD---VADR-SQREALYEWALK--EYPNLNVLVNNA 7xEX R---ELG--DAARYQH-LD---VTIE-EDWQRVVAYAR--EEFGSVDGLVNNA 8xASP ---CNAVK---AAG--GKAIAIP-FD---VRNT-SQLSALVQQAQ--DIVGPIDVLINNA 9xCSU ---AEEVK---ALG--VKAAFAA-AD---VKDA-DQVNQAVAQVK--EQLGDIDILINNA 10xOBR ----GQQE---ALG--FKFYVAE-GD---VSDF-ESCKAMVAKIEA--DLGPVDILVNNA …

Loading Fasta Sequences: Step 0 # ex. 1 # # open a file for reading # read a single line of the file and # bind the line to variable 'line' # file = open( ”test.txt","r") # Open the file line = file.readline() # Returns next line in a string print line file.close()# Close the file How can we scan the entire sequence of lines?

Loading Fasta Sequences: Step 1 # ex. 3 # # read all the lines of the file and store them in a list # print the 3rd line # file = open(pathname,"r") line = file.readlines() # Reads all lines into a sequence print line[2] file.close() Why is this NOT a great idea? How can we do better?

Loading Fasta Sequences: Step 2 # ex. 4 # # read 1 line at a time # print the line number and the line # def example4(pathname): file = open(pathname,"r") count = 0 for line in file: print count,":",line count = count + 1 file.close() Now lets try to extract info from the Fasta file?

A Fasta file with three sequences >gi| |ref|NC_ | Xanthomonas campestris pv. campestris, complete genome ATGGATGCTTGGCCCCGCTGTCTGGAACGTCTCGAAGCTGAATTCCCGCCCGAAGATGTCCACACCTGGT TGAAACCCCTGCAGGCCGAAGATCGCGGCGACAGCATCGTGCTGTACGCGCCGAACGCCTTCATTGTCGA GCAGGTTCGCGAGCGATACCTGCCGCGCATCCGCGAGTTGCTGGCGTATTTCGCCGGCAATGGCGAGGTG >gi| |ref|XM_ | PREDICTED: Equus caballus electron-transferring-flavoprotein … GATACGTGACCCGGAAGCCTCTGCTGGCCATGGCGTCATGCGTGGCGCCGGCGCAGAGCGAGAGAGAGTC GGGAGCGCTGTGAAGACAGAGCGGTCGGCTGATCAGAGACGAACTTCAGTGGAGGTGATGGCGCCCCCCG CGGCCTAGAGGTCCAGAGCGTGCCGCGAGCTGCAGACAGTACGCCTCCCATTGTATCCGACGGAGACTCC TCGTTGCAGGGAACATGTTGCTGCCGCTAGCCAAGCTGTCCTGTCCGGCATATCAGTGCTTTCATGCCTT AAAAATTAAGAAAGATTATCTACCTCTGTGTGCTACAAGATGGTCTTCAACTTCTGTTGTACCTCGAATT >gi| |gb|CP | Chloroherpeton thalassium ATCC 35110, complete genome ATGCTTATGAGCGAAGGACATGACCAGCCAGGCGCTCCTGTTTCTCATTTATCGGAACAGGCTTTAGCAC AAATTGCGTGGAAAAAATGCCTCGATATCATTCGTGATGGCCTTCATAACCTGCAAAGCTTTAAGACTTG GTTTGAGCCCATCGTGCCATTAAAACTTTCTGGCCAGGAATTGACCATTCAGGTGCCCAGCCAGTTTTTT TATGAAATGATCGAGGAAAATTATTACTCGCTTTTAAAGCGCGCCTTGATGGAAGTGATGGGAACGGGAG CCAAGCTGCGATATTCTGTTTTGGTTCAGGCGGCACAAGCTGAAACACCCGTCGTCGCTCGTATCCCCGA GAAAAAAGGCAAGCACACTCCGGCGGCGCTTCCGAAAGTCATTTCTCATGCGAATGGCAAGCAGACAAAA GAAGCTCAAGACTTATTTGCAAACAATGTACACCGCTTCGAAAGCTACCTAAATCCAAAATATCGCTTCG

Loading Fasta Sequences: Step 3 # ex. 5 # # read 1 line at a time # print only the fasta sequence header # def example5(pathname): file = open(pathname,"r") count = 0 for line in file: if line[0] == '>': print count,":",line count = count + 1 file.close() Now lets try to extract info from the Fasta file?

Loading Fasta Sequences: Step 4 # ex. 6 # # read 1 line at a time # identify the fasta sequence header and the sequence data # def example6(pathname): file = open(pathname,"r") for line in file: if line[0] == '>': print "header",line else: print "data",line file.close() How about blank lines and trailing newlines?

Loading Fasta Sequences: Step 5 # ex. 8 # # read 1 line at a time # identify the fasta sequence header and the sequence data # account for blank lines and trailing space/newlines # collect sequence data def example8(pathname): file = open(pathname,"r") data = "" for line in file: # remove the trailing '\n' and trailing spaces line = line.rstrip('\n ') # if the line length is < 1, there nothing to do for this line # so move to the next line if len( line ) < 1: continue if line[0] == '>': print "data",data data = "" print "header",line else: data = data + line file.close() How about blank lines and trailing newlines?

Loading Fasta Sequences: Step 6 def readFastaFile(filename): file = open(filename,"r") sequence_data = [] for line in file: # remove the trailing '\n' and trailing spaces line = line.rstrip('\n ') # if the line length is < 1, do nothing # so skip rest of iteration if len( line ) < 1: continue if line[0] == '>': sequence_data.append('') else: k = len(sequence_data) - 1 sequence_data[k] = sequence_data[k] + line file.close() return(sequence_data) How about blank lines and trailing newlines?

Summary of File Operations 21 These materials were developed with funding from the US National Institutes of Health grant #2T36 GM to the Pittsburgh Supercomputing Center MethodAction Read([n])Reads at most n bytes; if no n is specified, reads the entire file Readline([n])Reads a line of input, if n is specified reads at most n bytes Readlines()Reads all lines and returns the in a list Xreadlines()Reads all lines but handles them as a XRangeType Write(s)Writes string s Writelines(l)Writes all strings in list I as lines Close()Closes the file Seek(offset[, mode])Changes to a new file position=start + offset. Start is specified by the mode argument: mode=0 (default), start = start of the file, mode=1, start = current file position and mode=2, start = end of the file.

Importing ClustalW Files clustalw 0xOLI RIILVTGASDGIGREAAMTYARY--GATVILLGRN EEKLRQV-- 1xMAN KKVIVTGASKGIGREMAYHLAKM-GA-HVVVTARS KETLQKV-- 2xAST KVILITGASRGIGLQLVKTVIEEDDECIVYGVART EAGLQSL-- 3xAST KVILVTGVSRGIGKSIVDVLFSLDKDTVVYGVARS EAPL-- 4xAST KIAVVTGASGGIGYEVTKELARN--GYLVYACARR LEPMAQL-- 5xMAN HVALVTGGNKGIGLAIVRDLCRL-FSGDVVLTARD VTRGQAAV-- 6xCSU NTVLITGGSAGIGLELAKRLLEL--GNEVIICGRS EARLAEAK-- 7xEX KTVIITGGARGLGAEAARQAVAA-GARVVLADVLD E-EGAATA-- 8xASP KTVLLTGASRGLGVYIARALAKE--QATVVCVSRS QSGLAQT-- 9xCSU KTALITGGGRGIGRATALALAKE--GVNIGLIGRT SANVEKV-- 10xOBR KIALVTGAMGGLGTAICQALAKD-GCIVAANCLPN FEPAAAWL-- 0xOLI ---ASHIN--EETG-RQPQWFILDLLTCTSENC-QQLAQRIAVNY----P-RLDGVLHNA 1xMAN ---VSHCL---ELG-AASAHYIA-GT---MEDM-TFAEQFVAQAG--KLMGGLDMLILNH 2xAST ---QREYG ADKFVYRVLD---ITDR-SRMEALVEEIR--QKHGKLDGIVANA 3xAST ----KKLK--EKYG-DRFFYVVG--D---ITED-SVLKQLVNAAVK--GHGKIDSLVANA 4xAST -----AIQ----FG-NDSIKPYK-LD---ISKP-EEIVTFSGFLRANLPDGKLDLLYNNA 5xMAN ----QQLQ---AEG--LSPRFHQ-LD---IDDL-QSIRALRDFLR--KEYGGLDVLVNNA 6xCSU ----QQLP N-IHTKQ-CD---VADR-SQREALYEWALK--EYPNLNVLVNNA 7xEX R---ELG--DAARYQH-LD---VTIE-EDWQRVVAYAR--EEFGSVDGLVNNA 8xASP ---CNAVK---AAG--GKAIAIP-FD---VRNT-SQLSALVQQAQ--DIVGPIDVLINNA 9xCSU ---AEEVK---ALG--VKAAFAA-AD---VKDA-DQVNQAVAQVK--EQLGDIDILINNA 10xOBR ----GQQE---ALG--FKFYVAE-GD---VSDF-ESCKAMVAKIEA--DLGPVDILVNNA …

Reading from ClustalW Files 23 These materials were developed with funding from the US National Institutes of Health grant #2T36 GM to the Pittsburgh Supercomputing Center def loadClustalwAlignment(pathname): sequences = [] id2SequenceMap = dict() # Keep id -> seq dictionary to collect data f=open(pathname, 'r') f.readline() for line in f: line = line.replace('\n','') if (len(line)>1): sequenceID = extractID(line) if (sequenceID in id2SequenceMap): prevSequence = id2SequenceMap[sequenceID] else: prevSequence = ' ' id2SequenceMap[sequenceID]= prevSequence + extractSequence(line) f.close() sequences = [] for key in id2SequenceMap.keys(): sequences = sequences + [key + ":"+id2SequenceMap[key]] return sequences

Writing to Text Files def translateFastaFile(infilename, outfilename): infile = open(filename,"r") outfile = open(filename, "w") sequence_data = [] for line in infile: # remove the trailing '\n' and trailing spaces line = line.rstrip('\n ') # if the line length is < 1, do nothing # so skip rest of iteration if len( line ) < 1: continue if line[0] == '>': sequence_data.append('') else: k = len(sequence_data) - 1 sequence_data[k] = sequence_data[k] + line file.close() return(sequence_data) 24 These materials were developed with funding from the US National Institutes of Health grant #2T36 GM to the Pittsburgh Supercomputing Center

Homework 25 These materials were developed with funding from the US National Institutes of Health grant #2T36 GM to the Pittsburgh Supercomputing Center