Week 6 Topics Internet interaction APIs REST services XML.

Week 6 Topics Internet interaction APIs REST services XML

Internet Protocols A protocol defines the communication in a channel, primarily the format of messages FTP – file transfer protocol HTTP – hypertext transfer protocol HTTPS – HTTP Secure SCP – secure copy SMTP – Simple Mail Transfer Protocol BitTorrent The master protocols on which everything is build TCP/IP – Transmission Control Protocol / Internet Protocol UDP – User Datagram Protocol

Internet FTP

Internet FTP FTP is a simple protocol that was used to transfer files before http Steps of FTP interaction client sends a message to the server server responds and asks for authentication, username password username=anonymous password= is a common convention client sends authentication client sends commands via its ftp program cd change directory ls, dir list files in directory ascii,binary set file for ascii or binary file types get, mget download files put, mput upload files FTP is a stateful protocol. The server knows at all times who it is communicating with, and the history of the communication

Internet FTP A client server interaction
client is a program running on your computer server is a program running on a remote computer (server) listens on TCP port 21 (used to establish the connection) TCP port 20 is used for data transfer to/from client A port is essentially a label on the TCP message that directs an incoming message to a service 21: File Transfer Protocol (FTP) 22: Secure Shell (SSH) 23: Telnet remote login service 25: Simple Mail Transfer Protocol (SMTP) 53: Domain Name System (DNS) service 80: Hypertext Transfer Protocol (HTTP) 110: Post Office Protocol (POP3) 119: Network News Transfer Protocol (NNTP) 123: Network Time Protocol (NTP) 143: Internet Message Access Protocol (IMAP) 161: Simple Network Management Protocol (SNMP) 194: Internet Relay Chat (IRC) 443: HTTP Secure (HTTPS)

Internet Python ftplib package standard package
retrieving the orange tree genome from ftp.ncbi.nlm.nih.gov/genomes/Citrus_sinensis from ftplib import FTP ncbi_ftp = 'ftp.ncbi.nlm.nih.gov' ftp = FTP(ncbi_ftp, user='anonymous', filelist = ftp.dir() print(filelist) dr-xr-xr-x 4 ftp anonymous Feb 9 03: genomes -r--r--r ftp anonymous Nov 17 15:45 100GB -r--r--r ftp anonymous Nov 17 15:45 10GB -r--r--r ftp anonymous Nov 17 15:44 1GB -r--r--r ftp anonymous Nov 17 15:46 50GB -r--r--r ftp anonymous Nov 17 15:46 5GB -r--r--r ftp anonymous Oct 12 15:41 README.ftp dr-xr-xr-x 8 ftp anonymous Feb SampleData lr--r--r ftp anonymous Oct 12 15:43 asn1-converters -> toolbox/ncbi_tools/converters . . . dr-xr-xr-x 13 ftp anonymous Dec 19 23:30 genbank dr-xr-xr-x 6 ftp anonymous Feb gene dr-xr-xr-x 459 ftp anonymous Feb 9 03:50 genomes dr-xr-xr-x ftp anonymous Feb 9 05:11 geo dr-xr-xr-x 4 ftp anonymous Feb 9 03:50 giab dr-xr-xr-x 25 ftp anonymous Sep hapmap

Internet ftplib retrieving the orange tree genome from ftp.ncbi.nlm.nih.gov/genomes/Citrus_sinensis CHR_01 directory from ftplib import FTP ncbi_ftp = 'ftp.ncbi.nlm.nih.gov' yeast = 'genomes/Citrus_sinensis' ftp = FTP(ncbi_ftp, user='anonymous', ftp.cwd(yeast) # change working directory filelist = ftp.dir() print(filelist) dr-xr-xr-x 2 ftp anonymous Feb CHR_01 dr-xr-xr-x 2 ftp anonymous Feb CHR_02 dr-xr-xr-x 2 ftp anonymous Feb CHR_03 dr-xr-xr-x 2 ftp anonymous Feb CHR_04 dr-xr-xr-x 2 ftp anonymous Feb CHR_05 dr-xr-xr-x 2 ftp anonymous Feb CHR_06 dr-xr-xr-x 2 ftp anonymous Feb CHR_07 dr-xr-xr-x 2 ftp anonymous Feb CHR_08 dr-xr-xr-x 2 ftp anonymous Feb CHR_09 dr-xr-xr-x 2 ftp anonymous Feb CHR_Pltd dr-xr-xr-x 2 ftp anonymous Feb CHR_Un -r--r--r ftp anonymous Feb csi_ref_Csi_valencia_1.0_chr1.asn.gz -r--r--r ftp anonymous Feb csi_ref_Csi_valencia_1.0_chr1.fa.gz -r--r--r ftp anonymous Feb csi_ref_Csi_valencia_1.0_chr1.gbk.gz -r--r--r ftp anonymous Feb csi_ref_Csi_valencia_1.0_chr1.gbs.gz -r--r--r ftp anonymous Feb csi_ref_Csi_valencia_1.0_chr1.mfa.gz

Internet ftplib retrieving the orange tree genome from ftp.ncbi.nlm.nih.gov/genomes/Citrus_sinensis we want the file csi_ref_Csi_valencia_1.0_chr1.fa.gz from every chromosome gz is a compressed (binary) format dr-xr-xr-x 2 ftp anonymous Feb CHR_01 dr-xr-xr-x 2 ftp anonymous Feb CHR_02 dr-xr-xr-x 2 ftp anonymous Feb CHR_03 dr-xr-xr-x 2 ftp anonymous Feb CHR_04 dr-xr-xr-x 2 ftp anonymous Feb CHR_05 dr-xr-xr-x 2 ftp anonymous Feb CHR_06 dr-xr-xr-x 2 ftp anonymous Feb CHR_07 dr-xr-xr-x 2 ftp anonymous Feb CHR_08 dr-xr-xr-x 2 ftp anonymous Feb CHR_09 dr-xr-xr-x 2 ftp anonymous Feb CHR_Pltd dr-xr-xr-x 2 ftp anonymous Feb CHR_Un -r--r--r ftp anonymous Feb csi_ref_Csi_valencia_1.0_chr1.asn.gz -r--r--r ftp anonymous Feb csi_ref_Csi_valencia_1.0_chr1.fa.gz -r--r--r ftp anonymous Feb csi_ref_Csi_valencia_1.0_chr1.gbk.gz -r--r--r ftp anonymous Feb csi_ref_Csi_valencia_1.0_chr1.gbs.gz -r--r--r ftp anonymous Feb csi_ref_Csi_valencia_1.0_chr1.mfa.gz

Internet ftplib FTP.retrbinary(cmd, callback) Retrieve a file in binary transfer mode. cmd should be an appropriate RETR command: 'RETR filename'. The callback function is called for each block of data received we want to save every file retrieve to a local file open file for binary write (mode=‘wb’) since it is a binary file use the file.write command as the callback option in python people often wrap this up into one line from ftplib import FTP ncbi_ftp = 'ftp.ncbi.nlm.nih.gov' orange = 'genomes/Citrus_sinensis' fasta = 'csi_ref_Csi_valencia_1.0_chr1.fa.gz' ftp = FTP(ncbi_ftp, user='anonymous', ftp.cwd(orange + '/CHR_01') # change working directory print('retrieving {}...'.format(fasta), end=' ') try: ftp.retrbinary("RETR " + fasta, open(fasta, 'wb').write) except ftp.error_reply as err: print('Error retrieving {}'.format(fasta)) print('done') retrieving csi_ref_Csi_valencia_1.0_chr1.fa.gz... done

Internet ftplib automating a regular update of the orange genome
need a list of files to download provide a list of directories and files chrlist = [‘CHR_01’, CHR_02’ … ] files=[‘chr1’, ‘chr2’, … ] construct filenames with python code read names using ftp interface from ftplib import FTP ncbi_ftp = 'ftp.ncbi.nlm.nih.gov' orange = 'genomes/Citrus_sinensis' fasta = 'csi_ref_Csi_valencia_1.0_chr1.fa.gz' retrieving csi_ref_Csi_valencia_1.0_chr1.fa.gz... done

Internet ftplib from ftplib import FTP
ncbi_ftp = 'ftp.ncbi.nlm.nih.gov' orange = 'genomes/Citrus_sinensis' fasta = 'csi_ref_Csi_valencia_1.0_chr1.fa.gz' ftp = FTP(ncbi_ftp, user='anonymous', ftp.cwd(orange) # change working directory for file in ftp.nlst(): if file.startswith('CHR'): print('\nstarting', file) ftp.cwd(file) for datafile in ftp.nlst(): print(datafile, end=' ') if datafile.endswith('.fa.gz'): print('--> retrieving {} ... '.format(datafile), end=' ') try: ftp.retrbinary("RETR " + datafile, open(datafile, 'wb').write) print('done') except: print('Error retrieving {}'.format(fasta)) else: print(' skip') ftp.cwd('..') exit(0) starting CHR_01 csi_ref_Csi_valencia_1.0_chr1.asn.gz skip csi_ref_Csi_valencia_1.0_chr1.fa.gz --> retrieving csi_ref_Csi_valencia_1.0_chr1.fa.gz ... done csi_ref_Csi_valencia_1.0_chr1.gbk.gz skip csi_ref_Csi_valencia_1.0_chr1.gbs.gz skip csi_ref_Csi_valencia_1.0_chr1.mfa.gz skip starting CHR_02 csi_ref_Csi_valencia_1.0_chr2.asn.gz skip csi_ref_Csi_valencia_1.0_chr2.fa.gz --> retrieving csi_ref_Csi_valencia_1.0_chr2.fa.gz ... done

Web server remote computer Browser on local computer
Internet Web server remote computer Apache HTTP Transactions Server URL Form Parameters Script Results Form Web Page (HTML) Script Web Page Results Browser on local computer chrome

Internet HTTP/HTTPS a stateless protocol (in principle)
every interaction between browser and server is new there is no memory of the history of the exchange stateless interaction makes the job of the server much simpler but… many times you need some state information authentication history therefore cookies were invented 

Internet Web applications usually have a form interface
the information you type in a form is on your local computer in your web browser application when you submit, the web browser packages the information in http format sends to server, on port 80 Request types GET – all information is sent in URL URL has limited number of characters best for short messages (REST queries) POST – information is sent in a separate invisble block often used by forms necessary for large messages

Internet Internet Transaction
Server "listens" on a specific port (usually port 80) Browser on client sends a HTTP (Hypertext Transfer Protocol) header followed by a MIME (Multipurpose Internet Mail Extension) formatted message (next page) User-Agent identifies your browser some sites don’t like robots GET / HTTP/1.1[CRLF] Host: plantsp.genomics.purdue.edu[CRLF] Connection: close[CRLF] Accept-Encoding: gzip[CRLF] Accept:text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5[CRLF] Accept-Language: en-us,en;q=0.5[CRLF] Accept-Charset: ISO ,utf-8;q=0.7,*;q=0.7[CRLF] User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.7.12) Gecko/ Web-Sniffer/1.0.24[CRLF] Referer:

Internet Internet Transaction
MIME message – you should never have to see this Multipurpose Internet Mail Extensions (MIME) Mime-Version: 1.0 Content-type: text/plain This is the content Content-type: multipart/mixed; boundary="frontier" MIME-version: 1.0 --frontier This is the body of the message. Content-type: application/octet-stream Content-transfer-encoding: base64 gajwO4+n2Fy4FV3V7zD9awd7uG8/TITP/vIocxXnnf/5mjgQjcipBUL1b3uyLwAVtBLOP4nV LdIAhSzlZnyLAF8na0n7g6OSeej7aqIl3NIXCfxDsPsY6NQjSvV77j4hWEjlF/aglS6ghfju FgRr+OX8QZMI1OmR4rUJUS7xgoknalqj3HJvaOpeb3CFlNI9VGZYz6H6zuQBOWZzNB8glwpC --frontier--

Internet Internet Services
Many services on the internet can be accessed by Python programs Types REST web service based on simple HTTP access simplest to implement (usually) non-REST Service API (application programming interface) formerly common, now fairly rare (e.g. SOAP) usually stable often reliable Direct database connection fairly rare very powerful querying capability Screen scraping pretend your program is a browser dig the information out of HTML pages unstable (authors change their pages) unreliable (URLs are often transient)

Internet Web Services (programatic access) Reliable documented API
A few sites with web services NCBI search and retrieve from PubMed search and retrieve sequences eutils: EBI many applications, including clustalOmega, interproscan PDB Uniprot KEGG many others

Internet General Web servers provide a service to the community. Do not overburden the service by sending too many queries sending queries too quickly trying to download the entire database. Contact the resource for this. Include identifying information such as name and if requested If you exceed the recommended usage, your IP address will be banned If you are running on a Purdue server, the entire server may be banned NCBI between 9:00 PM and 5:00 AM Eastern time on weekdays no more than three URL requests per second limit large jobs to either weekends or between EMBL-EBI Web Services submit tool jobs in batches of no more than 30 at a time do not to submit more until the results and processing is complete Ensure that a valid address is provided.

Internet General Pay close attention to the required syntax
most REST servers use simple URLS – queries can be tested through web page query begins with ? parameters separated by & special characters converted to numeric farm (URL encoded) spaces are special characters NCBI for example No spaces Incorrect: &id=352, 25125, 234 Correct: &id=352,25125,234 Incorrect: &term=biomol mrna[properties] AND mouse[organism] Correct: &term=biomol+mrna[properties]+AND+mouse[organism] If you repeatedly send incorrect queries, your IP address may be banned If you are running on a Purdue server, the entire server may be banned

Internet NCBI Entrez Utilities (eutils)
Eutils provide almost complete access to data at ncbi. base URL Most commonly used Esearch - eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi UIDs matching a text query Results/downloads of a search on the History server Combines or limits UID datasets stored on the History server Efetch - eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi Returns formatted data records for a list of input UIDs Returns formatted data records for a set of UIDs stored on the Entrez History server EPost - eutils.ncbi.nlm.nih.gov/entrez/eutils/epost.fcgi Retrieve many UIDs. Accepts a list of UIDs from a given database stores the set on the History Server responds with a query key and web environment for the uploaded dataset

Internet NCBI Entrez Utilities (eutils) Less used EInfo EGQuery
List of the names of all valid Entrez databases Statistics for a single database, including lists of indexing fields and available link names EGQuery Find which databases have information. Returns number of records contained in each Entrez database Esummary Returns document summaries (DocSums) for a list of input UIDs ECitMatch Retrieves PubMed IDs (PMIDs) that correspond to a set of input citation strings Returns DocSums for a set of UIDs stored on the Entrez History server Elink Returns UIDs linked to an input set of UIDs in either the same or a different Entrez database, or UIDs that match an Entrez query Espell Provides spelling suggestions for terms within a single text query in a given database.

history – saving and reusing searches all searches are based on UIDs (unique identifiers) ncbi supports a mechanism where large sets of UIDs can be store on there server and then used for successive downloads (or researches) first search using the usehistory=y option esearch.fcgi?db=<db>&term=<query>&usehistory=y esearch returns a WebEnv value and a QueryKey value , <WE> and <QK> below Using WebEnv and QAueryKey, retrieve blocks of records by UID efetch.fcgi?db=<db>&query_key=<QK>&WebEnv=<WE>&retmode=xml use options retstart=<n> and retmax=<n> to get blocks of records

<eSearchResult> <Count>3563</Count> <RetMax>20</RetMax> <RetStart>0</RetStart> <QueryKey>1</QueryKey> <WebEnv> NCID_1_ _ _9001_ _ _0MetA0_S_MegaStore_F_1 </WebEnv> <IdList> <Id> </Id> <Id> </Id> <Id> </Id> <Id> </Id> <Id> </Id> <Id> </Id> <Id> </Id> <Id> </Id> <Id> </Id> <Id> </Id> <Id> </Id> <Id> </Id> <Id> </Id> <Id> </Id> <Id> </Id> <Id> </Id> <Id> </Id> <Id> </Id> <Id> </Id> <Id> </Id> </IdList> <TranslationSet/> <TranslationStack> <TermSet> <Term>asthma[All Fields]</Term> <Field>All Fields</Field> <Explode>N</Explode> </TermSet> <OP>GROUP</OP> </TranslationStack> <QueryTranslation>asthma[All Fields]</QueryTranslation> </eSearchResult>

Internet NCBI Entrez Utilities (eutils) part of response
&WebEnv=NCID_1_ _ _9001_ _ _0MetA0_S_MegaStore_F_1 &retmode=xml&retstart=10&retmax=5 part of response <?xml version="1.0" ?> <!DOCTYPE GBSet PUBLIC "-//NCBI//NCBI GBSeq/EN" " <GBSet> <GBSeq> <GBSeq_locus>NP_ </GBSeq_locus> <GBSeq_length>652</GBSeq_length> <GBSeq_moltype>AA</GBSeq_moltype> <GBSeq_topology>linear</GBSeq_topology> <GBSeq_division>PRI</GBSeq_division> <GBSeq_update-date>06-FEB-2018</GBSeq_update-date> <GBSeq_create-date>09-APR-2016</GBSeq_create-date> <GBSeq_definition>zinc finger protein 432 [Homo sapiens]</GBSeq_definition> <GBSeq_primary-accession>NP_ </GBSeq_primary-accession> <GBSeq_accession-version>NP_ </GBSeq_accession-version> <GBSeq_other-seqids> <GBSeqid>ref|NP_ |</GBSeqid> <GBSeqid>gi| </GBSeqid> </GBSeq_other-seqids> …

Internet Python requests package ( must install, not part of standard distribution PyCharm->files->settings->project->project interpreter->+ http response code 200 mean OK (i.e., success) success means a page was returned, not necessarily the result you wanted content requests.content attribute requests.text attribute requests.iter_content automaticaly decodes gzip and deflate requests.json() requests.raw() no decoding requests.raw.read() import requests # ncbi = ' esearch = 'esearch.fcgi?' query = 'db=protein&term=asthma&usehistory=y' response = requests.get(ncbi+esearch+query) print(response) <Response [200]>

Internet Python – requests
print(response.content) – produces a byte string (newlines added below) no interpretation of characters – more for graphics than text print(response.text) probably what you want b'<?xml version="1.0" encoding="UTF-8" ?>\n<!DOCTYPE eSearchResult PUBLIC "-//NLM//DTD esearch //EN" " x>20</RetMax><RetStart>0</RetStart><QueryKey>1</QueryKey><WebEnv>NCID_1_ _ _9001_ 995_ _0MetA0_S_MegaStore_F_1</WebEnv><IdList>\n<Id> </Id>\n<Id> </Id>\n<Id> 587</Id>\n<Id> </Id>\n<Id> </Id>\n<Id> </Id>\n<Id> </Id>\n<Id> </Id>\n<Id> </Id>\n<Id> </Id>\n<Id> </Id>\n<Id> </Id>\n<Id> </I d>\n<Id> </Id>\n<Id> </Id>\n<Id> </Id>\n<Id> </Id>\n<Id> </Id>\n<Id> </Id>\n<Id> </Id>\n</IdList><TranslationSet/><TranslationStack> <TermSet> <Term>asthma [All Fields]</Term> <Field>All Fields</Field> <Count>3563</Count> <Explode>N</Explode> </TermSet> <OP>GROUP</OP> </TranslationStack><QueryTranslation>asthma[All Fields]</QueryTranslation></eSearchResult> \n' <?xml version="1.0" encoding="UTF-8" ?> <!DOCTYPE eSearchResult PUBLIC "-//NLM//DTD esearch //EN" " <eSearchResult><Count>3563</Count><RetMax>20</RetMax><RetStart>0</RetStart><QueryKey>1</QueryKey><WebEnv>NCID_1_ _ _9001_ _ _0MetA0_S_MegaStore_F_1</WebEnv><IdList> <Id> </Id> <Id> </Id> <Id> </Id> <Id> </Id> <Id> </Id> <Id> </Id> <Id> </Id> <Id> </Id> <Id> </Id>

Internet Python – requests POST requests
some servers require post, some may use either get or post large inputs must be done as a post the information is stored in a dictionary keys are parameter names import requests # ncbi = ' esearch = 'esearch.fcgi‘ # no ? in URL params = {'db': 'protein', 'term': 'asthma', 'usehistory': 'y'} response = requests.post(ncbi + esearch, params) print('\n', response.text)

Internet Python – requests
Multipart-encoded files (Multipart MIME messages) used for uploading files as part of request

Internet Python Multipart Encoded Files
url = ' files = {'file': open('report.xls', 'rb')} r = requests.post(url, files=files) r.text { ... "files": { "file": "<censored...binary...data>" }, }

Internet Blast API (see also CloudBlast)
for information Guidelines Do not contact the server more often than once every 10 seconds. Do not poll for any single RID more often than once a minute. Use the URL parameters and tool, so that the NCBI can contact you if there is a problem. Run scripts weekends or between 9 pm and 5 am Eastern time on weekdays if more than 50 searches will be submitted. BLAST runs more efficiently if multiple queries are sent as one. If your queries are short (less than a few hundred bases), merge them into one search of up to 1,000 bases

Internet Blast API (see also CloudBlast) Parameter
QUERY -- Search query, Accession, GI, or FASTA. DATABASE -- BLAST database, a few listed below Nucleotide: nr, refseq_mrns, reseq_genomic Protein: nr, refseq, pdb PROGRAM -- BLAST program: blastn, megablast, blastp, blastx, tblastn, tblastx HITLIST_SIZE -- Number of databases sequences to keep, Integer DESCRIPTIONS -- Number of descriptions to print (applies to HTML and Text), Integer ALIGNMENTS -- Number of alignments to print (applies to HTML and Text), Integer RID -- BLAST search request identifier string returned when the search was submitted Blast returns a RID (run ID) we need to check it until its finished (this is called polling)

Internet Python -- Blast API we need to find this small bit of output
RTOE (Request Time of Execution) is the estimated time the search will take import requests blast = ' program = 'blastp' database = 'pdb' query = '''>AAG ARV1 [Homo sapiens] AMGNGGRSGCQYRCIECNQEAKELYRDYNHGVLKITICKSCQKPVDKYIEYDPVIILINAILCKAQAYRHILFNTQINIHGKLYLRWWQLQDSNQNTAPDDLIRYAKEWDF''' # command = 'Put&PROGRAM={}&DATABASE={}&QUERY={}'.format(program, database, query) command = {'CMD': 'Put', 'PROGRAM': program, 'DATABASE': database, 'QUERY': query, ' ': } print('command:', command) response = requests.post(blast, command) print(response.url) print('\n', response.text) print('response:', response) <!--QBlastInfoBegin RID = 7Y28Y0M9014 RTOE = 10 QBlastInfoEnd

Internet Python – Blast API Looking for
Getting the RID, kind of clunky but works Now I have to poll until the result is done, now i’m looking for Possible statuses, should deal with all cases WAITING FAILED UNKNOWN READY <!--QBlastInfoBegin RID = 7Y28Y0M9014 RTOE = 10 QBlastInfoEnd info_key = 'QBlastInfoBegin\n RID = ' info_begin = response.text.find(info_key) + len(info_key) info_end = response.text.find(' ', info_begin) print(info_begin, info_end, response.text[info_begin:info_end]) QBlastInfoBegin Status=READY QBlastInfoEnd

Internet NCBI Entrez Utilities (eutils) import time maxtries = 10
notready = 1 # Polling loop while notready: response = requests.post(blast, command) # print('\n', response.text) status_key = 'QBlastInfoBegin\n\tStatus=' status_begin = response.text.find(status_key) + len(status_key) status_end = response.text.find('\n', status_begin) # print(status_begin, status_end, ) status = response.text[status_begin:status_end] print(status) if status =='READY': notready = 0 break else: notread += 1 if notready >= maxtries: # don't poll too often, ncbi requests no more than 1/min time.sleep(60) if notready > 0: #polling reached lime print('unable to find result () in {} tries'.format(rid, notready)) # get the final result command = 'CMD=Get&&FORMAT_OBJECT=Searchinfo&RID={}'.format(rid) command = 'CMD=Get&FORMAT_TYPE=XML&RID={}'.format(rid) print(response.text)

Internet NCBI Entrez Utilities (eutils) Tabular Output Text Output
QBlastInfoBegin Status=READY QBlastInfoEnd --></p> <PRE> # blastp # Iteration: 0 # Query: AAG ARV1 [Homo sapiens] # RID: 7Y4XKJJS015 # Database: pdb # Fields: query id, subject ids, % identity, % positives, alignment length, mismatches, gap opens, q. start, q. end, s. start, s. end, evalue, bit score # 1 hits found AAG gi| |pdb|4NV4|A;gi| |pdb|4NV4|B </PRE> Tabular Output <p><p> <PRE> BLASTP Reference: Stephen F. Altschul, Thomas L. Madden, Alejandro A. Schaffer, Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997), "Gapped BLAST and PSI-BLAST: a new generation of protein database search programs", Nucleic Acids Res. 25: Reference for compositional score matrix adjustment: Stephen F. Altschul, John C. Wootton, E. Michael Gertz, Richa Agarwala, Aleksandr Morgulis, Alejandro A. Schaffer, and Yi-Kuo Yu (2005) "Protein database searches using compositionally adjusted substitution matrices", FEBS J. 272: RID: 7Y4XKJJS015 Database: PDB protein database 96,611 sequences; 24,370,354 total letters Query= AAG ARV1 [Homo sapiens] Length=111 Score E Sequences producing significant alignments: (Bits) Value 4NV4_A Chain A, 1.8 Angstrom Crystal Structure of Signal Pept ALIGNMENTS >4NV4_A Chain A, 1.8 Angstrom Crystal Structure of Signal Peptidase I from Bacillus anthracis. 4NV4_B Chain B, 1.8 Angstrom Crystal Structure of Signal Peptidase I Length=176 Score = 26.2 bits (56), Expect = 9.8, Method: Compositional matrix adjust. Identities = 11/30 (37%), Positives = 16/30 (53%), Gaps = 0/30 (0%) Query RHILFNTQINIHGKLYLRWWQLQDSNQNTA 98 RH F GK+ LR+W +QD N + Sbjct RHFGFVKADTVVGKVDLRYWPIQDVQTNFS 174 Posted date: Feb 8, :03 PM Number of letters in database: 24,370,354 Number of sequences in database: 96,611 Lambda K H Gapped Matrix: BLOSUM62 Gap Penalties: Existence: 11, Extension: 1 Number of Sequences: 96611 Number of Hits to DB: Number of extensions: Number of successful extensions: 485 Number of sequences better than 100: 124 Number of HSP's better than 100 without gapping: 0 Number of HSP's gapped: 485 Number of HSP's successfully gapped: 124 Length of query: 111 Length of database: Length adjustment: 76 Effective length of query: 35 Effective length of database: Effective search space: Effective search space used: T: 11 A: 40 X1: 15 (7.0 bits) X2: 38 (14.6 bits) X3: 64 (24.7 bits) S1: 38 (19.2 bits) S2: 47 (22.7 bits) ka-blk-alpha gapped: 1.9 ka-blk-alpha ungapped: ka-blk-alpha_v gapped: ka-blk-alpha_v ungapped: ka-blk-sigma gapped: Text Output

Internet NCBI Entrez Utilities (eutils) XML Output
<?xml version="1.0"?> <!DOCTYPE BlastOutput PUBLIC "-//NCBI//NCBI BlastOutput/EN" " <BlastOutput> <BlastOutput_program>blastp</BlastOutput_program> <BlastOutput_version>BLASTP </BlastOutput_version> <BlastOutput_reference>Stephen F. Altschul, Thomas L. Madden, Alejandro A. Schäffer, Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997), "Gapped BLAST and PSI-BLAST: a new generation of protein database search programs", Nucleic Acids Res. 25: </BlastOutput_reference> <BlastOutput_db>pdb</BlastOutput_db> <BlastOutput_query-ID>Query_46535</BlastOutput_query-ID> <BlastOutput_query-def>AAG ARV1 [Homo sapiens]</BlastOutput_query-def> <BlastOutput_query-len>111</BlastOutput_query-len> <BlastOutput_param> <Parameters> <Parameters_matrix>BLOSUM62</Parameters_matrix> <Parameters_expect>10</Parameters_expect> <Parameters_gap-open>11</Parameters_gap-open> <Parameters_gap-extend>1</Parameters_gap-extend> <Parameters_filter>F</Parameters_filter> </Parameters> </BlastOutput_param> <BlastOutput_iterations> <Iteration> <Iteration_iter-num>1</Iteration_iter-num> <Iteration_query-ID>Query_46535</Iteration_query-ID> <Iteration_query-def>AAG ARV1 [Homo sapiens]</Iteration_query-def> <Iteration_query-len>111</Iteration_query-len> <Iteration_hits> <Hit> <Hit_num>1</Hit_num> <Hit_id>gi| |pdb|4NV4|A</Hit_id> <Hit_def>Chain A, 1.8 Angstrom Crystal Structure of Signal Peptidase I from Bacillus anthracis. >gi| |pdb|4NV4|B Chain B, 1.8 Angstrom Crystal Structure of Signal Peptidase I from Bacillus anthracis.</Hit_def> <Hit_accession>4NV4_A</Hit_accession> <Hit_len>176</Hit_len> <Hit_hsps> <Hsp> <Hsp_num>1</Hsp_num> <Hsp_bit-score> </Hsp_bit-score> <Hsp_score>56</Hsp_score> <Hsp_evalue> </Hsp_evalue> <Hsp_query-from>69</Hsp_query-from> <Hsp_query-to>98</Hsp_query-to> <Hsp_hit-from>145</Hsp_hit-from> <Hsp_hit-to>174</Hsp_hit-to> <Hsp_query-frame>0</Hsp_query-frame> <Hsp_hit-frame>0</Hsp_hit-frame> <Hsp_identity>11</Hsp_identity> <Hsp_positive>16</Hsp_positive> <Hsp_gaps>0</Hsp_gaps> <Hsp_align-len>30</Hsp_align-len> <Hsp_qseq>RHILFNTQINIHGKLYLRWWQLQDSNQNTA</Hsp_qseq> <Hsp_hseq>RHFGFVKADTVVGKVDLRYWPIQDVQTNFS</Hsp_hseq> <Hsp_midline>RH F GK+ LR+W +QD N +</Hsp_midline> </Hsp> </Hit_hsps> </Hit> </Iteration_hits> <Iteration_stat> <Statistics> <Statistics_db-num>96611</Statistics_db-num> <Statistics_db-len> </Statistics_db-len> <Statistics_hsp-len>0</Statistics_hsp-len> <Statistics_eff-space>0</Statistics_eff-space> <Statistics_kappa>0.041</Statistics_kappa> <Statistics_lambda>0.267</Statistics_lambda> <Statistics_entropy>0.14</Statistics_entropy> </Statistics> </Iteration_stat> </Iteration> </BlastOutput_iterations> </BlastOutput> Process finished with exit code 0 XML Output

Internet XML Web services frequently deliver their results in XML format XML in brief XML uses tags in <> to label the parts of a file Much easier to find the various parts of the content More reliable Structure of content can be validated to improve reliability Main Features All tags must have opening AND closing tags; Must be properly nested Tags are case sensitive Tags can have attributes (values must be quoted) <sequence name="lacZ"> XML documents have a single Root element Can be defined in DTD (document type definition)

Internet XML Learning XML
W3C Schools tutorial - Introduction - Many internet data standards are defined in XML XHTML the latest version of HTML WSDL for describing available web services WAP and WML as markup languages for handheld devices RSS languages for news feeds RDF and OWL for describing resources and ontology SMIL for describing multimedia for the web For web services, many projects define their own XML XML documents are usually defined in a DTD (Document Type Definition)

Internet XML XML is tree structured <bookstore>
<book category="COOKING"> <title lang="en">Everyday Italian</title> <author>Giada De Laurentiis</author> <year>2005</year> <price>30.00</price> </book> <book category="CHILDREN"> <title lang="en">Harry Potter</title> <author>J K. Rowling</author> <price>29.99</price> <book category="WEB"> <title lang="en">Learning XML</title> <author>Erik T. Ray</author> <year>2003</year> <price>39.95</price> </bookstore>

Internet XML XML is not really designed to be read by humans
XML is often very verbose most browsers have a built-in XML viewer open file://C:\<some path here>/file.xml Blast XML output in chrome

Internet XML Iteration loops over queries
Iteration_hits loops over the hits

Internet Python – lxml Many python XML packages lxml widely used
reads and writes XML supports XPath queries considers XML documents to be a tree of XML elements lxml is written C, debugger will not work   

Internet Python – lxml Create element tree from a file from lxml import etree blast = etree.parse('blast.xml') last = etree.ElementTree(file='blast.xml') Create element tree from string etree.parse must read a “file like” object blastxml = open('blast.xml', 'r') # read xml file as string, don’t need this if it is an http response xml = blastxml.read() from io import StringIO blast = etree.parse(StringIO(xml))

Internet Python – lxml XPath a query language for XML
in depth: XML->Xpath tutorial at W3schools.com select nodes from an XML document tree element, attribute, text, namespace, processing-instruction, comment, and document nodes selecting nodes is don by matching an XPath expression with your XML tree nodename – selects all matching nodes / – selects from root node // – selects from nodes anywhere in the current document . – current node .. – parent node @ – selects attributes

Internet Python – lxml Xpath (example from W3schools)
sample expressions bookstore - selects one node with the name bookstore /bookstore - same because bookstore is the root node bookstore/book - selects only book nodes that are children of bookstore (3) //author - selects all author nodes <author>Giada De Laurentiis</author> <author>J K. Rowling</author> <author>James McGovern</author> <author>Per Bothner</author> <author>Kurt Cagle</author> <author>James Linn</author> <author>Vaidyanathan Nagarajan</author> <author>Erik T. Ray</author> book//author selects the authors of books only <author>Giada De Laurentiis</author> <author>J K. Rowling</author> <author>Erik T. Ray</author> selects all attributes named lang, no matter where they are <?xml version="1.0" encoding="UTF-8"?> <bookstore> <book category="cooking"> <title lang="en">Everyday Italian</title> <author>Giada De Laurentiis</author> <year>2005</year> <price>30.00</price> </book> <book category="children"> <title lang="en">Harry Potter</title> <author>J K. Rowling</author> <price>29.99</price> <video category="web"> <title lang="en">XQuery Kick Start</title> <author>James McGovern</author> <author>Per Bothner</author> <author>Kurt Cagle</author> <author>James Linn</author> <author>Vaidyanathan Nagarajan</author> <year>2003</year> <price>49.99</price> <book category="web"> <title lang="en">Learning XML</title> <author>Erik T. Ray</author> <price>39.95</price> </bookstore>

Internet Python – lxml blast example
lets say i want a list of hits and evalues i know there is only one query so i don’t care about iterations hits - //Hit find hit elements anywhere for each hit Hit_id (child of <Hit>) //Hsp_evalue

Internet Python – lxml from lxml import etree
blast = etree.parse('hemoglobin_blast.xml') hits = blast.xpath('//Hit') print(hits) for hit in hits: print(hit) [<Element Hit at 0x832878>, <Element Hit at 0x832850>, <Element Hit at 0x832828>, <Element Hit at 0x832800>, <Element Hit at 0x8325f8>, <Element Hit at 0x8324e0>, <Element Hit at 0x832490>, <Element Hit at 0x8324b8>, <Element Hit at 0x832468>, <Element Hit at 0x832288>, <Element Hit at 0x832260>, <Element Hit at 0x8320f8>] <Element Hit at 0x832878> <Element Hit at 0x832850> <Element Hit at 0x832828> <Element Hit at 0x832800> <Element Hit at 0x8325f8> <Element Hit at 0x8324e0> <Element Hit at 0x832490> <Element Hit at 0x8324b8> <Element Hit at 0x832468> <Element Hit at 0x832288> <Element Hit at 0x832260> <Element Hit at 0x8320f8>

Internet Python – lxml from lxml import etree
blast = etree.parse('hemoglobin_blast.xml') hits = blast.xpath('//Hit') print(hits) for hit in hits: id = hit.xpath('Hit_id') print('\n{}'.format(id[0].text)) hsps = hit.xpath('//Hsp_evalue') for hsp in hsps: print(' {}'.format(hsp.text)) gi| |ref|XP_ | e-79 e-68 e-59 e-57 e-54 e-27 e-26 e-25 1.9171e-25 e-25 e-20 e-20 gi| |ref|XP_ | …

ESearch ESearch provides all of the abilities of the web interface parameters db – the ncbi database to search, must be a valid entrez database, here are a few Entrez Database UID E-utility Database Name Gene Gene ID gene Genome Genome ID genome Nucleotide GI number nuccore Protein GI number protein PubMed PMID pubmed Taxonomy TaxID taxonomy term – Entrez text query (must be URL encoded)

Internet Esearch Optional parameters
usehistory=‘y’ – save the results on the ncbi server for further use returns <webenv_string> optional WebEnv=<webenv string> -- use a previously saved search requires usehistory=‘y’ retstart -- Sequential index of the first UID in the retrieved set to be shown in the XML output (default=0 retmax -- Total number of UIDs from the retrieved set to be shown in the XML output (default=20) rettype-- Retrieval type. 'uilist' (default), displays the standard XML output, and 'count', displays only the <Count> tag. retmode-- Retrieval type. Format of the returned output, xml’ (default), or ‘json’ sort -- Specifies the method used to sort UIDs in the ESearch output. The available values may be found in the Display Settings menu on an Entrez search results page. If usehistory is set to ‘y’, the UIDs are loaded onto the History Server in the specified sort order and will be retrieved in that order by ESummary or EFetch. Example values : ‘relevance’ and ‘name’ for Gene, ‘first+author’ and ‘pub+date’ for PubMed. field -- If used, the entire search term will be limited to the specified Entrez field. The following two URLs are equivalent: esearch.fcgi?db=pubmed&term=asthma&field=title esearch.fcgi?db=pubmed&term=asthma[title]

Internet NCBI API keys On May 1, 2018, NCBI will using API keys
enhanced levels of supported access to the E-utilities. After May 1, any site (IP address) posting more than 3 requests per second to the E-utilities without an API key will receive an error message. By including an API key, a site can post up to 10 requests per second by default. Higher rates are available by request Obtain an API key now from the Settings page of your NCBI account Include API key in each E-utility request using the new api_key parameter. Example request including an API key: esummary.fcgi?db=pubmed&id=123456&api_key=ABCDE Example error message if rates are exceeded: {"error":"API rate limit exceeded","count":"11"} Only one API key is allowed per NCBI account A user may request a new key at any time. The new key will invalidate any old API key associated with that NCBI account.

Internet EMBL-EBI Web Services
EMBL/EBI services are usually much better documented than NCBI Use them if you have a choice

Internet EMBL-EBI Web Services

Internet Rest web services Query sent as simple GET or POST
Response returned as XML document Usually provide methods for defining services and parameters through queries Interproscan at EBI (finds protein motifs) Services parameters – returns list of parameter names parameterdetails - Get detailed information about a parameter. resulttypes - Get available result types for a finished job Get the status of a submitted job given the jobID result - Get the job result of the specified type. run - Submit a job with the specified parameters. returns jobID - see next page

Internet

Internet Interproscan

Internet Interproscan import requests
run = ' = title = 'globin' sequence = '''>sp|P |HBA_HUMAN RecName: Full=Hemoglobin subunit alpha; AltName: Full=Alpha-globin; AltName: Full=Hemoglobin alpha chain MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGKKVADALTNA VAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSK YR''' # send the initial query command = {' ': , 'title': title, 'sequence': sequence} response = requests.post(run, command) id = response.text print('job {} submitted'.format(id)) # poll for job completion import time maxtries = 10 notready = 1 status = ' while notready: response = requests.get(status + id) print(' polling... response->{}'.format(response.text)) if 'FINISHED' in response.text: notready = 0 break else: notready += 1 if notready >= maxtries: # don't poll too often time.sleep(20) if notready > 0: # polling reached limit print('unable to find result () in {} tries'.format(rid, notready)) exit(1) print('interproscan {} finished'.format(id)) # get the final result result = ' result += id + '/{}'.format('xml') response = requests.get(result) print('\n', response.text)

Internet Interproscan beginning of output
C:\Users\michael\PycharmProjects\work\venv\Scripts\python.exe C:/Users/michael/PycharmProjects/work/interproscan.py job iprscan5-R p1m submitted polling... response->RUNNING polling... response->FINISHED interproscan iprscan5-R p1m finished <?xml version="1.0" encoding="UTF-8" standalone="yes"?> <protein-matches xmlns=" interproscan-version=" "> <protein> <sequence md5="6077c452d1dc b2b179e2294c7">MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR</sequence> <xref desc="HBA_HUMAN RecName: Full=Hemoglobin subunit alpha; AltName: Full=Alpha-globin; AltName: Full=Hemoglobin alpha chain" db="sp" id="P "/> <matches> <fingerprints-match evalue="1.1E-8" graphscan="IIII"> <signature ac="PR00815" desc="Pi haemoglobin signature" name="PIHAEM"> <entry ac="IPR002339" desc="Haemoglobin, pi" name="Haemoglobin_pi" type="FAMILY"> <go-xref category="MOLECULAR_FUNCTION" db="GO" id="GO: " name="oxygen binding"/> <go-xref category="MOLECULAR_FUNCTION" db="GO" id="GO: " name="heme binding"/> <go-xref category="BIOLOGICAL_PROCESS" db="GO" id="GO: " name="oxygen transport"/> <go-xref category="CELLULAR_COMPONENT" db="GO" id="GO: " name="hemoglobin complex"/> <go-xref category="MOLECULAR_FUNCTION" db="GO" id="GO: " name="iron ion binding"/> </entry> <models> <model ac="PR00815" desc="Pi haemoglobin signature" name="PIHAEM"/> </models> <signature-library-release library="PRINTS" version="42.0"/> </signature> <locations> <fingerprints-location motifNumber="2" pvalue="0.0" score="41.67" start="105" end="116"/> <fingerprints-location motifNumber="3" pvalue="0.0" score="34.62" start="4" end="29"/> <fingerprints-location motifNumber="1" pvalue="0.0" score="55.0" start="49" end="58"/> <fingerprints-location motifNumber="4" pvalue="0.0" score="36.36" start="73" end="83"/> </locations> </fingerprints-match> <fingerprints-match evalue="1.1E-34" graphscan="IIIII"> <signature ac="PR00612" desc="Alpha haemoglobin signature" name="ALPHAHAEM"> <entry ac="IPR002338" desc="Haemoglobin, alpha-type" name="Haemoglobin_a-typ" type="FAMILY">

Internet Screen scraping Dig the information you want out of HTML
Unreliable – vendors change html all the time Polling In the interpro example we must Screen scrape results ID iprscan5-S es Repeated send query to retrieve results Check for success Repeat until results are available Finally screen scrape motif information from file

Internet

Internet Find the form Wrong! This form is search by accession form (upper right corner) <form id="local-search" name="local-search" action="/interpro/search" method="get"> <fieldset> <div class="left"> <label> <input type="text" name="q" id="local-searchbox" onblur="displaySearchInterPro(this);" onfocus="hideSearchInterPro(this);" value=""/> </label>  <span class="examples">Examples: <a href="/interpro/search?q=IPR020405" title="InterPro accession or a number (e.g. IPR or 20405)">IPR020405</a>, <a href="/interpro/search?q=kinase" title="Free text search (e.g. Kinase)">kinase</a>, <a href="/interpro/search?q=P51587" title="UniProtKB Protein accession (e.g. P51587)">P51587</a>, <a href="/interpro/search?q=PF02932" title="Member database signature accession (e.g. PF02932)">PF02932</a>, <a href="/interpro/search?q=GO: " title="GO term search (e.g. GO: )">GO: </a> </span> </div> <div class="right"> <input class="submit" id="searchsubmit" title="Search InterPro" type="submit" value="Search"/> </fieldset> </form>

Internet Find the correct form
<h2>InterProScan sequence search</h2> <p>This form allows you to scan your sequence for matches against the InterPro protein signature databases, using maximum length of 40,000 amino acid long.<br/> Please note that you can only scan one sequence at a time. </p> <form id="sequence_box_form" action="/interpro/sequence-search" method="post"> <fieldset> <legend>Analyse your protein sequence</legend> <div class="form_row"> <textarea id="sequenceBoxId" name="queryString" rows="6"></textarea> </div> <div id="advancedOptions"> deleted huge section here <input id="showAdvancedOptions" name="showAdvancedOptions" type="hidden" value="false"/> <script defer="defer" type="text/javascript"> $("#advancedOptions").accordion({ collapsible:true, heightStyle: "content",/*height will adjust to content*/ active: false, activate:function (event, ui) { var isActive = $("#advancedOptions").accordion("option", "active"); $('input[name=showAdvancedOptions]').val(isActive); } }); </script> <input id="leaveIt" name="leaveIt" type="hidden" value=""/> <div style="clear:both;/*For IE6*/"> <input name="submit" value="Search" class="submit" type="submit"/> | <a href="" onclick="document.getElementById('sequenceBoxId').reset();" class="reset">Clear</a> <a href="#" id="eg_p52647" class="reset">Example protein sequence</a> </fieldset> </form>

Internet Form information chrome – web developer extension firefox
+ Form Tools firefox firebug web developer safari web inspector

Internet Chrome web developer + form tools

Internet

Internet Forms – web developer tools No Yes

Internet

Internet <div class="grid_24"> <div class="error_msg">
<span class="ico_loading"></span> <h2>Your job is currently running... please be patient</h2> <p> The result of your job will appear in this browser window. <span id="isJavascriptOn" style="display: none;"> This page refreshes automatically every 20 seconds. </span> <noscript> Please <a href="/interpro/sequencesearch/iprscan5-S es">refresh</a> this page from time to time to see if your results are ready. </noscript> </p> <p>You may bookmark this page to view your results later if you wish. Results are stored for 7 days.</p> <p>Job ID: <a href="/interpro/sequencesearch/iprscan5-S es">iprscan5-S es</a></p> </div>

Internet

Internet Phobius (http://phobius.sbc.su.se/)
<form method="post" enctype="multipart/form-data“ action="/cgi-bin/predict.pl"> <p>Paste your protein sequence here in Fasta format:</p> <textarea name="protseq" rows="6" cols="80"></textarea><br =""> <b style="font-family: helvetica,arial,sans-serif;"> Or: </b> <span style="font-family: helvetica,arial,sans-serif;"> Select the sequencefile you wish to use </span> <input style="font-family: helvetica,arial,sans-serif;" type="file“ name="protfile“ size="15"><br> <p>Select output format:</p> <input name="format" type="radio" value="short">Short<br> <input name="format" type="radio" value="nog">Long without Graphics<br> <input name="format" type="radio" value="plp" checked="checked">Long with Graphics <p> <input type="submit"> <input type="reset"> </p> </form>

Internet Phobius <!DOCTYPE html
PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" " <html xmlns=" lang="en-US" xml:lang="en-US"> <head> <title>Phobius prediction</title> <meta http-equiv="Content-Type" content="text/html; charset=iso " /> </head> <body> <h2>Phobius prediction</h2><hr /><h4>Prediction of AEE </h4><pre>ID AEE FT SIGNAL FT REGION N-REGION. FT REGION H-REGION. FT REGION C-REGION. FT TOPO_DOM NON CYTOPLASMIC. FT TRANSMEM FT TOPO_DOM CYTOPLASMIC. // </pre><img src="../tmp/aaac.png" alt="Posterior label probability plot"></img><p><font size="-1">The probability data used in the plot is found <a href="../tmp/aaac.plp">here</a>, and the gnuplot script is <a href="../tmp/aaac.gnuplot">here</a>.</font> <hr><script src=" type="text/javascript"></script><script type="text/javascript">_uacct = "UA ";urchinTracker();</script> </body> </html> Process finished with exit code 0

Internet Phobius – get the plot and save
submit = '/cgi-bin/predict.pl' bri1 = '''>AEE Leucine-rich receptor-like protein kinase family protein [Arabidopsis thaliana] MKTFSSFFLSVTTLFFFSFFSLSFQASPSQSLYREIHQLISFKDVLPDKNLLPDWSSNKNPCTFDGVTCR DDKVTSIDLSSKPLNVGFSAVSSSLLSLTGLESLFLSNSHINGSVSGFKCSASLTSLDLSRNSLSGPVTT LTSLGSCSGLKFLNVSSNTLDFPGKVSGGLKLNSLEVLDLSANSISGANVVGWVLSDGCGELKHLAISGN KISGDVDVSRCVNLEFLDVSSNNFSTGIPFLGDCSALQHLDISGNKLSGDFSRAISTCTELKLLNISSNQ FVGPIPPLPLKSLQYLSLAENKFTGEIPDFLSGACDTLTGLDLSGNHFYGAVPPFFGSCSLLESLALSSN NFSGELPMDTLLKMRGLKVLDLSFNEFSGELPESLTNLSASLLTLDLSSNNFSGPILPNLCQNPKNTLQE LYLQNNGFTGKIPPTLSNCSELVSLHLSFNYLSGTIPSSLGSLSKLRDLKLWLNMLEGEIPQELMYVKTL ETLILDFNDLTGEIPSGLSNCTNLNWISLSNNRLTGEIPKWIGRLENLAILKLSNNSFSGNIPAELGDCR SLIWLDLNTNLFNGTIPAAMFKQSGKIAANFIAGKRYVYIKNDGMKKECHGAGNLLEFQGIRSEQLNRLS TRNPCNITSRVYGGHTSPTFDNNGSMMFLDMSYNMLSGYIPKEIGSMPYLFILNLGHNDISGSIPDEVGD LRGLNILDLSSNKLDGRIPQAMSALTMLTEIDLSNNNLSGPIPEMGQFETFPPAKFLNNPGLCGYPLPRC DPSNADGYAHHQRSHGRRPASLAGSVAMGLLFSFVCIFGLILVGREMRKRRRKKEAELEMYAEGHGNSGD RTANNTNWKLTGVKEALSINLAAFEKPLRKLTFADLLQATNGFHNDSLIGSGGFGDVYKAILKDGSAVAI KKLIHVSGQGDREFMAEMETIGKIKHRNLVPLLGYCKVGDERLLVYEFMKYGSLEDVLHDPKKAGVKLNW STRRKIAIGSARGLAFLHHNCSPHIIHRDMKSSNVLLDENLEARVSDFGMARLMSAMDTHLSVSTLAGTP GYVPPEYYQSFRCSTKGDVYSYGVVLLELLTGKRPTDSPDFGDNNLVGWVKQHAKLRISDVFDPELMKED PALEIELLQHLKVAVACLDDRAWRRPTMVQVMAMFKEIQAGSGIDSQSTIRSIEDGGFSTIEMVDMSIKE VPEGKL''' # command = {'protseq':bri1, 'format':'plp'} # response = requests.post(phobius+submit, command) # # print(response.text) plot = requests.get(phobius+'tmp/aaac.png',stream=True) print( plot) with open('../pb_work/examples/phobius.png', 'wb') as image: image.write(plot.content) exit(0)

Internet Phobius get the image tag with beautiful soup
attributes are accessed with list element syntax import requests from bs4 import BeautifulSoup phobius = ' submit = '/cgi-bin/predict.pl' bri1 = '''>AEE Leucine-rich receptor-like protein kinase family protein [Arabidopsis thaliana] MKTFSSFFLSVTTLFFFSFFSLSFQASPSQSLYREIHQLISFKDVLPDKNLLPDWSSNKNPCTFDGVTCR ... PALEIELLQHLKVAVACLDDRAWRRPTMVQVMAMFKEIQAGSGIDSQSTIRSIEDGGFSTIEMVDMSIKE VPEGKL''' command = {'protseq':bri1, 'format':'plp'} response = requests.post(phobius+submit, command) soup = BeautifulSoup(response.content, 'html.parser') img = soup.find('img') # print(img['src']) # plot address is in image tag, beginning at character 3 plot = requests.get(phobius+img['src'][3:],stream=True) print( plot) with open('../pb_work/examples/phobius.png', 'wb') as image: image.write(plot.content)

Week 6 Topics Internet interaction APIs REST services XML.

Similar presentations

Presentation on theme: "Week 6 Topics Internet interaction APIs REST services XML."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Week 6 Topics Internet interaction APIs REST services XML.

Similar presentations

Presentation on theme: "Week 6 Topics Internet interaction APIs REST services XML."— Presentation transcript:

Similar presentations

About project

Feedback