Presentation is loading. Please wait.

Presentation is loading. Please wait.

Sequence Formats Suchat Udomsopagit.

Similar presentations


Presentation on theme: "Sequence Formats Suchat Udomsopagit."— Presentation transcript:

1 Sequence Formats Suchat Udomsopagit

2 Sequences DNA and protein sequences
Can be read and written in a variety of formats Sequence formats are ASCII TEXT Required arrangement of characters, symbols and keywords that specify things e.g. the sequence, ID name, comments, etc. Program should look to find them in seq entry

3 Sequences Never any hidden, unprintable 'control' characters in any sequence format. All standard sequence formats can be printed out or viewed simply by displaying their file.

4 MS word Microsoft WORD format is not a sequence format.
If using a word-processor to type a sequence: Save sequence to a file as ASCII text Try selecting: File, Save As, Save as type Text Do not using word-processors to write sequences Simple text editors should be used: Notepad Wordpad For UNIX Pico nedit

5 Some common formats Single sequence per file
Multiple sequences per file Either single or multiple sequences per file gcg Multiple sequence format (msf) fasta staden clustal embl phylip Plus some others, e.g. MacVector, GeneWorks, DNA Strider etc.

6 Each Sequence analysis program has its own format for storing sequence data!!!

7 FASTA format First line is a Description line
A single-line Contains a greater-than (">") symbol in the first column Followed by lines of sequence data It is recommended that all lines of text be shorter than 80 characters in length

8 FASTA format Description line
>gi|532319|pir|TVFV2E|TVFV2E envelope protein ELRLRYCAPAGFALLKCNDADYDGFKTNCSNVSVVHCTNLMNTTVTTGLLLNGSYSENRT QIWQKHRTSNDSALILLNKHYNLTVTCKRPGNKTVLPVTIMAGLVFHSQKYNLRLRQAWC HFPSNWKGAWKEVKEEIVNLPKERYRGTNDPKRIFFQRQWGDPETANLWFNCHGEFFYCK MDWFLNYLNNLTVDADHNECKNTSGTKSGNKRAPGPCVQRTYVACHIRSVIIWLETISKK TYAPPREGHLECTSTVTGMTVELNYIPKNRTNVTLSPQIESIWAAELDRYKLVEITPIGF APTEVRRYTGGHERQKRVPFVXXXXXXXXXXXXXXXXXXXXXXVQSQHLLAGILQQQKNL LAAVEAQQQMLKLTIWGVK Sequence

9 Standard IUB/IUPAC amino acid and nucleic acid codes
Lower-case letters are accepted and are mapped into upper-case No numerical digits in the sequence. The nucleic acid codes supported are: A  adenosine M  A C (amino) C  cytidine S  G C (strong) G  guanine W  A T (weak) T  thymidine B  G T C U  uridine D  G A T R  G A (purine) H  A C T Y  T C (pyrimidine) V  G C A K  G T (keto) N  A G C T (any) -  gap of indeterminate length

10 Standard IUB/IUPAC amino acid and nucleic acid codes
The accepted amino acid codes are: A  alanine P  proline B  aspartate or asparagine Q  glutamine C  cystine R  arginine D  aspartate S  serine E  glutamate T  threonine F  phenylalanine U  selenocysteine G  glycine V  valine H  histidine W  tryptophan I  isoleucine Y  tyrosine K  lysine Z  glutamate or glutamine L  leucine X  any M  methionine *  translation stop N  asparagine -  gap of indeterminate length

11 FASTA format Multiple sequences Blank lines inserted > mysequence
ACGTCGATCGATCGATGCATCGTGCTAGCTACAGTCGATGCAT CAGTCGATGCTAGCATGCTAGCTGCATCGATCGATGCTACGTA CAGTCGATCGATGCAT > mysequence2 ACCGTACGATGCTAGCTAGCTAGCTACAGTCAGTCGATGCTACG CAGTCGTAGCATGCTAACGTCGATCGTA > mysequence3 CAGTCAGTCGTAGCTAGCTAGCTAGCTAGGGGTATCGATGCTAA CAGTACTTTGCATGCAGCATGCTAGCTAGCTAGCTA

12 Genbank File Format File Header
The first line in the file must have "GENETIC SEQUENCE DATA BANK" in spaces 20 through 46. The next 8 lines may contain arbitrary text. They are ignored but are required to maintain the GenBank format.

13 Genbank File Format Sequence Data Entries First line Second line
Begins with ‘LOCUS’ in the first 5 spaces Followed by genetic locus name or identifier Length of the sequences Type of sequences Second line DEFINITION in the first 10 spaces Followed by free form text to identify the sequence. Third line ACCESSION in the first 9 spaces Spaces must hold the primary accession number Fourth line ORIGIN in the first 6 spaces Nothing else is required on this line, it indicates that the nucleic acid sequence begins on the next line.

14 Genbank File Format Fifth line Last line
Begins the nucleotide sequence. The first 9 spaces of each sequence line may either be blank or may contain the position in the sequence of the first nucleotide on the line. The next 66 spaces hold the nucleotide sequence in six blocks of ten nucleotides. Each of the six blocks begins with a blank space followed by ten nucleotides. Thus the first nucleotide is in space 11 of the line while the last is in space 75. Last line Must have // in the first 2 spaces to indicate termination of the sequence.

15 Genbank File Format LOCUS name size bp type date dd-MON-yyyy
Genbank Locus name total base count DNA, RNA, PROTEIN, MASK, or TEXT dd-MON-yyyy LOCUS name size bp type date

16 Genbank Example LOCUS NM_079846 1190 bp mRNA linear INV 15-DEC-2001
DEFINITION Drosophila melanogaster Triose phosphate isomerase (Tpi), mRNA. ACCESSION NM_079846 VERSION NM_ GI: KEYWORDS . SOURCE fruit fly. ORGANISM Drosophila melanogaster Eukaryota; Metazoa; Arthropoda; Tracheata; Hexapoda; Insecta; Pterygota; Neoptera; Endopterygota; Diptera; Brachycera; Muscomorpha; Ephydroidea; Drosophilidae; Drosophila. REFERENCE 1 (bases 1 to 1190) AUTHORS Shaw-Lee,R.L., Lissemore,J.L. and Sullivan,D.T. TITLE Structure and expression of the triose phosphate isomerase (Tpi) gene of Drosophila melanogaster JOURNAL Mol. Gen. Genet. 230 (1-2), (1991) MEDLINE PUBMED COMMENT PROVISIONAL REFSEQ: This record has not yet been subject to final NCBI review. The reference sequence was derived from AE FEATURES Location/Qualifiers source /organism="Drosophila melanogaster“ /db_xref="taxon:7227“ /chromosome="3“ /map="99E1-99E2“ gene /gene="Tpi“ /note="TPI; TPIS; CG2171; CT6334“ /db_xref="FLYBASE:FBgn “ /db_xref="LocusID:43582“

17 Genbank Example CDS 181..924 /gene="Tpi“ /EC_number="5.3.1.1“
/note="Nucleotide sequence of the Celera sequence differs from the published sequence for this transcript.“ /codon_start=1 /db_xref="FLYBASE:FBgn “ /db_xref="LocusID:43582“ /product="Triose phosphate isomerase“ /protein_id="NP_ “ /db_xref="GI: " /translation="MSRKFCVGGNWKMNGDQKSIAEIAKTLSSAALDPNTEVVIGCPA IYLMYARNLLPCELGLAGQNAYKVAKGAFTGEISPAMLKDIGADWVILGHSERRAIFG ESDALIAEKAEHALAEGLKVIACIGETLEEREAGKTNEVVARQMCAYAQKIKDWKNVV VAYEPVWAIGTGQTATPDQAQEVHAFLRQWLSDNISKEVSASLRIQYGGSVTAANAKE LAKKPDIDGFLVGGASLKPEFVDIINARQ“ misc_feature /note="TIM; Region: Triosephosphate isomerase“ BASE COUNT 279 a 368 c 323 g 220 t ORIGIN 1 ttaatctcga atctgggaaa aatctgagtg gaaaagtcga cggcgagcct ccagtcatcg 61 agttacccac ttgaaattat cagttccaaa cactctaata gcagtcccct tgttttgtcc 121 cccgatccgc agttctacgc caatttcagc accgattgca ccgacagcaa cagcaacaac 181 atgagccgaa agttctgcgt gggaggcaac tggaagatga acggcgacca gaagtccatc 241 gccgagatcg ccaagaccct gagctcggcc gccctcgacc ccaacacgga ggtggtcatc 301 ggctgcccgg ccatctacct gatgtacgcc cgcaacctgc tgccctgcga gctgggtctg 361 gccggccaga atgcctacaa ggtggccaag ggcgcattca ccggcgagat ctcccctgcg 421 atgctgaagg //

18 EMBL File Format European Molecular Biology Laboratory First line
Begins with two letters ID Followed by the EMBL identifier Second line AC, followed by accession number Third line DE, followed by a free form text definition Fourth line SQ, followed by the length of the sequence After the sequence length there is a blank space and the two letters BP.

19 EMBL File Format Fifth line The last line ~ terminator line
Nucleotide sequence begins Each line of sequence begins with four blank spaces Next 66 spaces hold the nucleotide sequence in 6 blocks of 10 nucleotides. Each of the six blocks begins with a blank space followed by ten nucleotides. Thus the first nucleotide is in space 6 of the line while the last is in space 70. The last line ~ terminator line Two characters // in the first two spaces Multiple sequences may appear in each file

20 EMBL Example (1) LINE 1 :ID ID_name LINE 2 :AC Accession number
LINE 3 :DE Describe the sequence any way you want LINE 4 :SQ Length BP LINE 5 : ACGTACGTAC GTACGTACGT ACGTACGTAC GTACGTA... LINE 6 : ACGT... LINE 7 ://

21 EMBL Example (2)

22 EMBL Example (3) ID DMTPIG standard; DNA; INV; 3419 BP. XX
AC X57576; S70377; SV X DT 20-JAN-1992 (Rel. 30, Created) DT 19-AUG-1996 (Rel. 49, Last updated, Version 10) DE D.melanogaster Tpi gene for Triosephosphate isomerase KW glycolytic enzyme; tpi gene; triosephosphate isomerase. OS Drosophila melanogaster (fruit fly) OC Eukaryota; Metazoa; Arthropoda; Tracheata; Hexapoda; Insecta; Pterygota; OC Neoptera; Endopterygota; Diptera; Brachycera; Muscomorpha; Ephydroidea; OC Drosophilidae; Drosophila. RN [1] RP RA Sullivan D.T.; RT ; RL Submitted (07-FEB-1991) to the EMBL/GenBank/DDBJ databases. RL D.T. Sullivan, Biological Research Laboratories, 130 College Pl, Syracuse RL University, Syracuse, NY 13244, USA

23 RN [3] RX MEDLINE; RA Shaw-Lee R.L., Lissemore J.L., Sullivan D.T.; RT "Structure and expression of the triose phosphate isomerase (Tpi) gene of RT Drosophila melanogaster."; RL Mol. Gen. Genet. 230: (1991). XX DR FLYBASE; FBgn ; Tpi. DR SWISS-PROT; P29613; TPIS_DROME. FH Key Location/Qualifiers FH FT source FT /db_xref="taxon:7227" FT /germline FT /organism="Drosophila melanogaster" FT /strain="Oregon-R" FT /clone_lib="EMBL-4" FT CDS join( , ) FT /db_xref="FLYBASE:FBgn " FT /db_xref="SWISS-PROT:P29613" FT /gene="Tpi" FT /EC_number=" " FT /product="triosephosphate isomerase" FT /protein_id="CAA " FT /translation="MSRKFCVGGNWKMNGDQKSIAEIAKTLSSAALDPNTEVVIGCPAI FT YLMYARNLLPCELGLAGQNAYKVAKGAFTGEISPAMLKDIGADWVILGHSERRAIFGES FT DALIAEKAEHALAEGLKVIACIGETLEEREAGKTNEVVARQMCAYAQKIKDWKNVVVAY FT EPVWAIGTGKTATPDQAQEVHASLRQWLSDNISKEVSASLRIQYGGSVTAANAKELAKK FT PDIDGFLVGGASLKPEFLDIINARQ" FT mRNA join( , , ) FT prim_transcript

24 FT exon FT /number=1 FT exon FT /number=2 FT exon FT /number=3 FT intron FT intron FT misc_feature FT /note="intron 1 lariat sequence" FT misc_feature FT /note="intron 2 lariat sequence" FT polyA_signal XX SQ Sequence 3419 BP; 855 A; 933 C; 849 G; 778 T; 4 other; gatctcgagc gagaaatgtg gaacatagtg gaggcctcca gtggcgccga gctgggtgaa accagctacg agttcccttc ccccgctccg gttcccagcg cagcagtgaa cgaaatagca gttccacagt cccaccagct cctcctgctc ctgcgaagcc ctcagttccg tccgcctcct atgacaacca caactacagt ttcagccagg atgaggacga agatgatgat gatctggagt ttgaggacgt attcgtgccg gccagctctg ttccaaatcc cgttcagcct ggcatagatc ccgtggaact gcgtcgctcc ctggctttgg tcatgaggga gaaattgcga tcggatgaca cggactccag gccaatgggc aacaatcagg atcttcccat agatgaacag tccagggaga gaccgctctc cactcaaaca tctcccacaa atggcccact tccggctctt ctgagggcca aactgcttgc tgggcaactc nnnncaatag cgctcactgc ctgccaggat ccacggcgag tcctgctccc caggagcaat ccggtatctt tgtgatcgat agtgaggcga gtcccggctc aaatgggcac aagcctaagt atcgaaaggg cacggcattc actcggagtt cgctgaagaa gagccgatcc tgcaactgta gctccatcgc taagggacga ggggtccacg acgagcccag cagtaatctc tgcagggatc aggagtcctc tgtacttcca cagcatccgc agccagccaa ccatcccaca gagaactttt //

25 National Biomedical Research Foundation (NBRF) format Protein Information Resource (PIR) format
First line Begins with a greater than symbol (>) Immediately followed by 2 character sequence type specifier Specifier Sequence type P protein, complete F protein, fragment DL DNA, linear DC DNA, circular RL RNA, linear RC RNA, circular N functional RNA, other than tRNA N tRNA Then a semicolon (;) Followed by sequence name or identification code for the NBRF database Four to six letters and numbers >P1;CBRT

26 Second line contains two kinds of information
National Biomedical Research Foundation (NBRF) format Protein Information Resource (PIR) format Second line contains two kinds of information First: Sequence name Followed by 3 characters blank space, " - “ Second Organism or organelle name >P1;CBRT Cytochrome b - Rat mitochondrion (SGC1)

27 Amino acid or nucleic acid sequence begins on line three
National Biomedical Research Foundation (NBRF) format Protein Information Resource (PIR) format Amino acid or nucleic acid sequence begins on line three Free format May be interrupted by blanks for ease of reading Protein sequences May contain special punctuation to indicate various indeterminacies in the sequence The last character in the sequence must be an asterisks (*).

28 NBRF/PIR Example LINE 1 :>P1;CBRT
LINE 2 :Cytochrome b - Rat mitochondrion (SGC1) LINE 3 :M T N I R K S H P L F K I I N H S F I D L P A P S LINE 4 : VTHICRDVN Y GWL IRY LINE 5 :TWIGGQPVEHPFIIIGQLASISYFSIILILMPISGIVEDKMLKWN*

29 MolGen/Stanford File Format
Molecular Genetics Group at Stanford U. First line ~ comment line Begins with a semi-colon (;) Followed by descriptive text May be as many comment lines as desired Need not be present Second line Must be present Contains an identifier or name for the sequence

30 MolGen/Stanford File Format
Third line Sequence begins Occupies up to 80 spaces Spaces may be included in the sequence for ease of reading. Terminated with 1 or 2 1 indicates a linear sequence 2 marks a circular sequence

31 MolGen/Stanford Example
LINE 1 :; Describe the sequence any way you want LINE 2 :ECTRNAGLY2 LINE 3 :ACGCACGTAC ACGTACGTAC A C G T C C G T ACG TAC GTA CGT LINE 4 : GCTTA GG G C T A1

32 PHYLIP File Format Interleaved and Sequential formats
Created and used by several phylogenetics programs

33 PHYLIP File Format Interleaved format
a MNTTNCFIAL VHAIREIRAF FLSRATG-KM EFTLYNGERK TFYSRPNNHD a MNTTDCFIAL VTAIREIRAF FLPRATG-RM EFTLHNGERK VFYSRPNNHD c-s8c1 MNTTDCFIAV VNAIKEVRAL FLPRTAG-KM EFTLHDGEKK VFYSRPNNHD c1nov MNTTDCFIAV VNAIREIRAL FLPRTTG-KM EFTLHDGEKK VFYSRPNNHD o1brazl MNTTDCFIAL VQAIREIKAL FLPRTTG-KM ELTLYNGEKK TFYSRPNNHD o1campos MNTTDCFIAL VQAIREIKAL FLPRTTG-KM ELTLYNGEKK TFYSRPNNHD o1kauf MNTTDCFIAL VQAIREIKAL FLSRTTG-KM ELTLYNGEKK TFYSRPNNHD ken MNTTDCFIAL LRAFREIKTL FLSRVRG-KM EFTLYNGEKK TFYSRPNNHD ken MNTTDCFIAL VRAIREFKIL FSLRPLARKM EFTLYNGIKK TFYSRPNKHD ken MNTTDCFIAL VQAIREIKLL FKG--IR-KM KLTLYNGEKK TFYSRPNSHD uga MNTTDCFIAL VQAIREIKSL FRS--SR-KM EFTLYNGEKK TFYSRPNNHD bec MKTTDCFNVL FEIFHRFGQT FKA--DR-KM EFTLYNGEKK TFYSRPNTHG zim MKTTDCFDVL LEIFHRFRQT FKT--DR-KM EFTLYNGEKK TFYSRPNTHG knp MKTTDCFNVL LETFHRFRNV FKT--DR-KM EFTLYNGDKK TFYSRPNTHG zim MKTTGCFDVL IEIAHRLRQL NKT--DR-KM EFTLYNGEKK TFYSRPNTHG zim MKTTDCFNVL LEIIYRFRHT FKT--DR-KM EFTLYNGEKK TFYSRPNKHG knp MKTTDCFSVL FEIFHRLRHT LKT--ER-KM EFTLYNGERK TFYSRPNKHG zam MKTTDCFDAL LEAFHRLRQT FKT--DR-KM EFTLYNGEKK TFYSRPNRHG NCWLNTILQL FRYVDEPFFD WVYNSPENLT LAAIKQLEEL TGLELHEGGP NCWLNTILQL FRYVGEPFFD WVYDSPENLT LEAIEQLEEL TGLELHEGGP NCWLNTILQL FRYVDEPFFD WVYNSPENLT LEAIKQLEEL TGLELREGGP NCWLNAILQL FRYVEEPFFD WVYSTPENLT LEAIKQLEDL TGLELHEGGP NCWLNAILQL FRYVEEPFFD WVYSSPENLT LEAIKQLEDL TGLELHEGGP NCWLNAILQL FRYVDEPFFE WVYDSPENLT VEAIRQLEEL TGLELHEGGP NCWLNAILQL FRYVDEPFFD WVYESPENLT IQAIGQLEEL TGLDLREGGP NCWLNTILQL FRYVDEPFFD WVYNSPENLT LRAIEQLEEL TGLELREGGP NCWLNTILQL FRYVDEPFFD WVYNSPENLT LQAIEQLEEL TGLELHEGGP NCWLNSLLQL FRYVDEPLFE SEYLSPENKT LDMIKQLSDY TKLDLSDGGP NCWLNSLLQL FRYVDEPLFE SEYLSPENKT LDMIKRLSDY TKLDLSDGGP PALVIWNIKH LLQTGIGTAS RPAR-CMVDG TNMCLADFHA GIFLKEQEHA PALVIWNIKH LLHTGIGTAS RPSEVCMVDG TNMCLADFHA GIFLKGQEHA PALVIWNIKH LLHTGIGTAS RPSEVCMVDG TDMCLADFHA GIFMKGREHA PALVIWNIKH LLHTGIGTAS RPSEVCMVDG TDMCLADFHA GIFMKGQEHA PALVIWNIKH LLHTGIGTAS RPSEVCMVDG TDMCLADFHA GIFLKGQEHA Interleaved format Similar to the output of alignment programs The first part of file contains the first part of each of the sequences Only the first parts of the sequences should be preceded by names Then the second part of each sequence, and so on.

34 PHYLIP File Format Sequential format
All of one sequence is given, possibly on multiple lines, before the next starts. YF a MNTTNCFIAL VHAIREIRAF FLSRATG-KM EFTLYNGERK TFYSRPNNHD NCWLNTILQL FRYVDEPFFD WVYNSPENLT LAAIKQLEEL TGLELHEGGP PALVIWNIKH LLQTGIGTAS RPAR-CMVDG TNMCLADFHA GIFLKEQEHA VFACVTSNGW YAIDDEDFYP WTPDPSDVLV FVPYDQEPLN GGWKANVQRK LK---- a MNTTDCFIAL VTAIREIRAF FLPRATG-RM EFTLHNGERK VFYSRPNNHD NCWLNTILQL FRYVGEPFFD WVYDSPENLT LEAIEQLEEL TGLELHEGGP PALVIWNIKH LLHTGIGTAS RPSEVCMVDG TNMCLADFHA GIFLKGQEHA VFACVTSNGW YAIDDDDFYP WTPDPSDVLV FVPYDQEPLN GEWKTKVQQK c-s8c MNTTDCFIAV VNAIKEVRAL FLPRTAG-KM EFTLHDGEKK VFYSRPNNHD NCWLNTILQL FRYVDEPFFD WVYNSPENLT LEAIKQLEEL TGLELREGGP PALVIWNIKH LLHTGIGTAS RPSEVCMVDG TDMCLADFHA GIFMKGREHA VFACVTSNGW YAIDDEDFYP WTPDPSDVLV FVPYDQEPLN EGWKASVQRK LKGAGQ c1nov MNTTDCFIAV VNAIREIRAL FLPRTTG-KM EFTLHDGEKK VFYSRPNNHD PALVIWNIKH LLHTGIGTAS RPSEVCMVDG TDMCLADFHA GIFMKGQEHA VFACVTSNGW YAIDDEDFYP WTPDPSDVLV FVPYDQEPLN EGWKANVQRK o1brazl MNTTDCFIAL VQAIREIKAL FLPRTTG-KM ELTLYNGEKK TFYSRPNNHD NCWLNAILQL FRYVEEPFFD WVYSTPENLT LEAIKQLEDL TGLELHEGGP PALVIWNIKH LLHTGIGTAS RPSEVCMVDG TDMCLADFHA GIFLKGQEHA VFACVTSNGW YAIDDEDFYP WTPDPSDVLV FVPYDQEPLN GEWKAKVQRK o1campos MNTTDCFIAL VQAIREIKAL FLPRTTG-KM ELTLYNGEKK TFYSRPNNHD VFAC…

35 Protein Data Bank (PDB) File Format
Each line is 80 columns wide and is terminated by an end-of-line indicator. The first 6 columns of every line contain a "record name". The list of ATOM records in each polymer chain must be terminated by a TER record. ATOM records for polymer atoms must include non-blank chain ID fields. To use the automatic validation check, the coordinate file must include a complete CRYST1 record defining the unit cell and space group information. Each file should terminate with a line containing only the word END.

36 Protein Data Bank (PDB) File Format
COLUMNS DATA TYPE FIELD DEFINITION Record name "ATOM " Integer serial Atom serial number. Atom name Atom name. Character altLoc Alternate location indicator. Residue name resName Residue name. Character chainID Chain identifier. Integer resSeq Residue sequence number. AChar iCode Code for insertion of residues. Real(8.3) x Orthogonal coordinates for X in Angstroms. Real(8.3) y Orthogonal coordinates for Y in Real(8.3) z Orthogonal coordinates for Z in Real(6.2) occupancy Occupancy. Real(6.2) tempFactor Temperature factor. LString(4) segID Segment identifier, left-justified. LString(2) element Element symbol, right-justified. LString(2) charge Charge on the atom.

37 Protein Data Bank (PDB) File Format
Pattern: RTyp Num Atm Res Ch ResN X Y Z Occ Temp PDB Line ATOM N ASP L FDL 93 ATOM CA ASP L FDL 94 Where: RTyp: Record Type Num: Serial number of the atom. Each atom has a unique serial number. Atm: Atom name (IUPAC format). Res: Residue name (IUPAC format). Ch: Chain to which the atom belongs (in this case, L for light chain of an antibody). ResN: Residue sequence number. X, Y, Z: Cartesian coordinates specifying atomic position in space. Occ: Occupancy factor Temp: Temperature factor (atoms disordered in the crystal have high temperature factors). PDB: The PDB data file unique identifier. Line: Line (record) number in the data file.

38 PDB Example HEADER LYASE JUL QU TITLE CRYSTAL STRUCTURE OF TRYPANOSOMA BRUCEI ORNITHINE TITLE 2 DECARBOXYLASE COMPND MOL_ID: 1; COMPND 2 MOLECULE: ORNITHINE DECARBOXYLASE; COMPND 3 CHAIN: A, B, C, D; COMPND 4 EC: ; COMPND 5 ENGINEERED: YES SOURCE MOL_ID: 1; SOURCE 2 ORGANISM_SCIENTIFIC: TRYPANOSOMA BRUCEI; SOURCE 3 EXPRESSION_SYSTEM: ESCHERICHIA COLI; SOURCE 4 EXPRESSION_SYSTEM_COMMON: BACTERIA; SOURCE 5 EXPRESSION_SYSTEM_STRAIN: B21/DG3; SOURCE 6 EXPRESSION_SYSTEM_VECTOR_TYPE: PLASMID KEYWDS POLYAMINE METABOLISM, PYRIDOXAL 5'-PHOSPHATE, ALPHA-BETA KEYWDS 2 BARREL, LYASE EXPDTA X-RAY DIFFRACTION AUTHOR N.V.GRISHIN,A.L.OSTERMAN,H.B.BROOKS,M.A.PHILLIPS, AUTHOR 2 E.J.GOLDSMITH REVDAT DEC-99 1QU JRNL COMPND REMARK REVDAT NOV-99 1QU JRNL AUTH N.V.GRISHIN,A.L.OSTERMAN,H.B.BROOKS,M.A.PHILLIPS, JRNL AUTH 2 E.J.GOLDSMITH JRNL TITL X-RAY STRUCTURE OF ORNITHINE DECARBOXYLASE FROM JRNL TITL 2 TRYPANOSOMA BRUCEI: THE NATIVE STRUCTURE AND THE JRNL TITL 3 STRUCTURE IN COMPLEX WITH JRNL TITL 4 ALPHA-DIFLUOROMETHYLORNITHINE JRNL REF BIOCHEMISTRY V JRNL REFN ASTM BICHAW US ISSN REMARK REMARK REMARK 2 RESOLUTION ANGSTROMS. REMARK …

39 DBREF 1QU4 A SWS P DCOR_TRYBB DBREF 1QU4 B SWS P DCOR_TRYBB DBREF 1QU4 C SWS P DCOR_TRYBB DBREF 1QU4 D SWS P DCOR_TRYBB SEQRES 1 A GLY ALA MET ASP ILE VAL VAL ASN ASP ASP LEU SER CYS SEQRES 2 A ARG PHE LEU GLU GLY PHE ASN THR ARG ASP ALA LEU CYS SEQRES 3 A LYS LYS ILE SER MET ASN THR CYS ASP GLU GLY ASP PRO SEQRES 4 A PHE PHE VAL ALA ASP LEU GLY ASP ILE VAL ARG LYS HIS SEQRES 5 A GLU THR TRP LYS LYS CYS LEU PRO ARG VAL THR PRO PHE SEQRES 6 A TYR ALA VAL LYS CYS ASN ASP ASP TRP ARG VAL LEU GLY SEQRES 7 A THR LEU ALA ALA LEU GLY THR GLY PHE ASP CYS ALA SER SEQRES 8 A ASN THR GLU ILE GLN ARG VAL ARG GLY ILE GLY VAL PRO SEQRES 9 A PRO GLU LYS ILE ILE TYR ALA ASN PRO CYS LYS GLN ILE SEQRES 10 A SER HIS ILE ARG TYR ALA ARG ASP SER GLY VAL ASP VAL SEQRES 11 A MET THR PHE ASP CYS VAL ASP GLU LEU GLU LYS VAL ALA SEQRES 12 A LYS THR HIS PRO LYS ALA LYS MET VAL LEU ARG ILE SER SEQRES 13 A THR ASP ASP SER LEU ALA ARG CYS ARG LEU SER VAL LYS SEQRES 14 A PHE GLY ALA LYS VAL GLU ASP CYS ARG PHE ILE LEU GLU SEQRES 15 A GLN ALA LYS LYS LEU ASN ILE ASP VAL THR GLY VAL SER SEQRES 16 A PHE HIS VAL GLY SER GLY SER THR ASP ALA SER THR PHE SEQRES 17 A ALA GLN ALA ILE SER ASP SER ARG PHE VAL PHE ASP MET SEQRES 18 A GLY THR GLU LEU GLY PHE ASN MET HIS ILE LEU ASP ILE SEQRES 19 A GLY GLY GLY PHE PRO GLY THR ARG ASP ALA PRO LEU LYS SEQRES 20 A PHE GLU GLU ILE ALA GLY VAL ILE ASN ASN ALA LEU GLU SEQRES 21 A LYS HIS PHE PRO PRO ASP LEU LYS LEU THR ILE VAL ALA SEQRES 22 A GLU PRO GLY ARG TYR TYR VAL ALA SER ALA PHE THR LEU SEQRES 23 A ALA VAL ASN VAL ILE ALA LYS LYS VAL THR PRO GLY VAL SEQRES 24 A GLN THR ASP VAL GLY ALA HIS ALA GLU SER ASN ALA GLN SEQRES 25 A SER PHE MET TYR TYR VAL ASN ASP GLY VAL TYR GLY SER SEQRES 26 A PHE ASN CYS ILE LEU TYR ASP HIS ALA VAL VAL ARG PRO SEQRES 27 A LEU PRO GLN ARG GLU PRO ILE PRO ASN GLU LYS LEU TYR SEQRES 28 A PRO SER SER VAL TRP GLY PRO THR CYS ASP GLY LEU ASP SEQRES 29 A GLN ILE VAL GLU ARG TYR TYR LEU PRO GLU MET GLN VAL SEQRES 30 A GLY GLU TRP LEU LEU PHE GLU ASP MET GLY ALA TYR THR SEQRES 31 A VAL VAL GLY THR SER SER PHE ASN GLY PHE GLN SER PRO SEQRES 32 A THR ILE TYR TYR VAL VAL SER GLY LEU PRO ASP HIS VAL SEQRES 33 A VAL ARG GLU LEU LYS SER GLN LYS SER

40 HET PLP A HET PLP B HET PLP C HET PLP D HETNAM PLP PYRIDOXAL-5'-PHOSPHATE HETSYN PLP VITAMIN B6 COMPLEX FORMUL 5 PLP 4(C8 H10 N1 O6 P1) HELIX LEU A LEU A HELIX LYS A ASN A HELIX ASP A GLY A HELIX SER A ILE A HELIX PRO A GLU A HELIX GLN A SER A HELIX CYS A HIS A HELIX LYS A GLU A HELIX ASP A LEU A HELIX ALA A LEU A HELIX LYS A PHE A HELIX GLY A ALA A HELIX PHE A HIS A HELIX THR A THR A HELIX SER A PHE A SHEET A 6 GLN A 365 PRO A SHEET A 6 LEU A 350 TRP A N TYR A O LEU A SHEET A 6 SER A 313 VAL A O PHE A N SER A SHEET A 6 PHE A 284 THR A N ILE A O TYR A SHEET A 6 PHE A 40 ASP A O PHE A N ALA A SHEET A 6 THR A 404 VAL A O THR A N PHE A SHEET A1 6 GLN A 365 PRO A SHEET A1 6 LEU A 350 TRP A N TYR A O LEU A SHEET A1 6 SER A 313 VAL A O PHE A N SER A SHEET A1 6 PHE A 284 THR A N ILE A O TYR A SHEET A1 6 TRP A 380 PHE A N LEU A O VAL A SHEET A1 6 PRO A 338 PRO A O LEU A N LEU A 382

41 CRYST P ORIGX ORIGX ORIGX SCALE SCALE SCALE ATOM N ASP A N ATOM CA ASP A C ATOM C ASP A C ATOM O ASP A O ATOM CB ASP A C ATOM CG ASP A C ATOM OD1 ASP A O ATOM OD2 ASP A O ATOM N GLU A N ATOM CA GLU A C ATOM C GLU A C ATOM O GLU A O ATOM CB GLU A C ATOM CG GLU A C ATOM CD GLU A C ATOM OE1 GLU A O ATOM OE2 GLU A O ATOM N GLY A N ATOM CA GLY A C ATOM C GLY A C ATOM O GLY A O ATOM N ASP A N ATOM CA ASP A C ATOM C ASP A C ATOM O ASP A O ATOM CB ASP A C ATOM CG ASP A C ATOM OD1 ASP A O ATOM CA PHE A C ... CONECT CONECT MASTER END

42 Conversion of Sequence Formats
readseq (all flavors of UNIX) 1. IG/Stanford 10. Olsen (in-only) 2. GenBank/GB 11. Phylip3.2 (Sequential) 3. NBRF 12. Phylip (Interleaved) 4. EMBL 13. Plain/Raw 5. GCG 14. PIR/CODATA 6. DNAStrider 15. MSF 7. Fitch 16. ASN.1 8. Pearson/Fasta 17. PAUP/NEXUS 9. Zuker (in-only) 18. Pretty (out-only)

43 Conversion of Sequence Formats
seqret (EMBOSS) gcg GCG 9.x and 10.x format embl swiss fasta genbank nbrf pir NBRF (PIR) codata CODATA format. strider DNA strider format clustal phylip PHYLIP non-interleaved multiple alignment format. acedb ACeDB format msf Wisconsin Package GCG's MSF multiple sequence format. hennig86 Hennig86 format jackknifer Jackknifer format jackknifernon Jackknifernon format nexus paup Nexus/PAUP format treecon Treecon format mega Mega format ig IntelliGenetics format. staden text

44 Conversion of Sequence Formats
Using Perl Downloadable from Biomolecular Engineering Research Center (BMERC) pdb-to-seq.pl: pdb  several standard formats fa2tbl.pl: fasta  sequence table (.tbl) file tbl2fa.pl: .tbl file  fasta etc.

45 Conversion of Sequence Formats
Web-based WWW READSEQ Sequence Conversion at NIH WWW READSEQ at Human Genome Mapping Project (HGMP) Center

46

47 Conversion of Sequence Formats
Biological software at Institut Pasteur READSEQ EMBOSS Abiview: trace files (ABI)  fasta EMBOSS: cutseq: Removes a specified section from a sequence pasteseq: Insert one sequence into another. nthseq: Writes one sequence from a multiple set of sequences. extractseq: Extract regions from a sequence.

48 Conversion of Sequence Formats
Windows-based program SeqVerter Downloadable from GeneStudio, Inc. (free) Read: ABI traces, Clustal, DCSE, DNASIS, DNAStar, DNAStrider (including binary), EMBL, FASTA, GDE, GenBank, IBI/Pustell, Macaw, MSF, Nexus/PAUP, PHYLIP Interleaved, PIR/NBRF, SCF 2.0 and SCF 3.0 traces, Swiss-Prot, and TreeCon. Write: Clustal, DNASIS, DNAStar, FASTA and FASTA-SequIn, GenBank, IBI/Pustell, MSF, Nexus/PAUP, PHYLIP Interleaved, and TreeCon. Tutorial also available

49 Reference Fundamentals of Sequence Analysis. (ppt)
Fourie Joubert. Bioinformatics Training at SANBI, 2001 EMBOSS Sequence Formats Exchanging Sequence Data Sequence Analysis Tools Sequence Formats Sequence file formats Mount, DW. (2001) Bioinformatics: Sequence and Genome Analysis. Chapter 2: Collecting and Storing Sequences in the Laboratory. Cold Spring Harbor Laboratory Press. NY. pp


Download ppt "Sequence Formats Suchat Udomsopagit."

Similar presentations


Ads by Google