Presentation is loading. Please wait.

Presentation is loading. Please wait.

Alignment Sequence, Structure, Network

Similar presentations


Presentation on theme: "Alignment Sequence, Structure, Network"— Presentation transcript:

1 Alignment Sequence, Structure, Network
Jong Bhak

2 Alignment is the key in bioinformatics
Alignment is the best method in comparing things in the whole universe The universe is a gigantic sequence

3 Amino Acids Representation
Ala alanine Met methionine Asp aspartate Phe phenylalanine Arg arginine Pro proline Asn asparagine Ser serine Cys cysteine Thr threonine Glu glutamate Trp tryptophan Gln glutamine Tyr tyrosine Gly glycine Val valine Glx glutamate or glutamine *** any His histidine --- gap of indeterminate length Ileu isoleucine TGA translation stop Lys lysine TAG translation stop Leu leucine TAA translation stop

4 Single Sequence representations
There are several commonly used pure sequence representation formats in “flat files” FASTA (most commonly used for raw sequence data) PIR Representations in Databases (such as MySQL) As columns and rows Representations in programs or objects @codons = $myCodonTable->revtranslate('A'); Flat file FASTA format  > gi|532319|pir|TVFV2E|TVFV2E envelope protein CCTCTCGGAGCTGGAAATGCAGCTATTGAGATCTTCGAATGCTGC AGCTGGAGGCGGAGGCAGCTGGGGAGGTCCGAGCGATGTGACC GGCCGCCATCGCTCGTCTCTTCCTCTCTCCTGCCGCCTCCTGTGT CGAAAATAACTTTTTTAGTCTAAAGAAAGAAAG >gi|532319|pir|TVFV2E|TVFV2E envelope protein ELRLRYCAPAGFALLKCNDADYDGFKTNCSNVSVVHCTNLMNTTLLL SYSENRTAPTEVRRYTGGHERQKRVPFVXXXXXXXXXXXXXX

5 Accessing Bioperl CodonTable (from object oriented module)
use Bio::Tools::CodonTable; # defaults to ID 1 "Standard" $myCodonTable = Bio::Tools::CodonTable->new(); $myCodonTable2 = Bio::Tools::CodonTable -> new ( -id => 3 ); # change codon table $myCodonTable->id(5); # examine codon table print join (' ', "The name of the codon table no.", $myCodonTable->id(4), "is:", $myCodonTable->name(), "\n"); # translate a codon $aa = $myCodonTable->translate('ACU'); $aa = $myCodonTable->translate('act'); $aa = $myCodonTable->translate('ytr'); # reverse translate an amino acid @codons = $myCodonTable->revtranslate('A'); @codons = $myCodonTable->revtranslate('Ser'); @codons = $myCodonTable->revtranslate('Glx'); @codons = $myCodonTable->revtranslate('cYS', 'rna');

6 FASTA (flat file) Sequences are expected to be represented in the standard IUB/IUPAC amino acid and nucleic acid codes * International Union of Pure and Applied Chemistry Lower-case letters are accepted A single hyphen or dash can be used to represent a gap of indeterminate length In amino acid sequences, U and * are acceptable letters Numerical digits in the query sequence should either be removed or replaced by appropriate letter codes (e.g., N for unknown nucleic acid residue or X for unknown amino acid residue

7 Nucleic Acids’ FASTA A --> adenosine M --> A C (amino)
C --> cytidine S --> G C (strong) G --> guanine W --> A T (weak) T --> thymidine B --> G T C U --> uridine D --> G A T R --> G A (purine) H --> A C T Y --> T C (pyrimidine) V --> G C A K --> G T (keto) N --> A G C T (any) X --> for unknown gap of indeterminate length

8 Protein sequences in FASTA
A alanine P proline B aspartate or asparagine Q glutamine C cystine R arginine D aspartate S serine E glutamate T threonine F phenylalanine U selenocysteine G glycine V valine H histidine W tryptophan I isoleucine Y tyrosine K lysine Z glutamate or glutamine L leucine X any M methionine * translation stop N asparagine gap of indeterminate length

9 PIR (NBRF) sequence format
>P1;CRAB_ANAPL ALPHA CRYSTALLIN B CHAIN (ALPHA (BCRYSTALLIN). MDITIHNPLIRRPLFSWLAPSRIFDQIFGEHLQESELLPASP LSPFLMRSPIFRMPSWL ETGLSEMRLE KDKFSVNLDV KHFSPEELKVKVLGDMVEIHGKHEERQDEHGFIAREFNR KYRIPADVDPL TITSSLSLDG VLTVSAPRKQ SDVPERSIP TREEKPAIAG AQRK*

10 PIR format A sequence in PIR format consists of: 1.One line starting with a. a ">" (greater-than) sign, followed by b. a two-letter code describing the sequence type (P1, F1, DL, DC, RL, RC, or XX), followed by c. a semicolon, followed by d. the sequence identification code (the database ID-code). 2. One line containing a textual description of the sequence. 3. One or more lines containing the sequence itself. The end of the sequence is marked by a "*" (asterisk) character. A file in PIR format may comprise more than one sequence. The PIR format is also often referred to as the NBRF format.

11 GenBank style (flat file)
LOCUS ABCAARAA_1 DEFINITION A.aceti acetic acid resistance protein (aarA) gene, complete cds; acetic acid resistance protein (aarA). DATE SEP-1990 ACCESSION M34830 ORGANISM Acetobacter aceti Eubacteria; Proteobacteria; alpha subdivision; Acetobacteraceae; Acetobacter. COMMENT CDS /db_xref="PID:g141730" WEIGHT LENGTH ORIGIN Translated using phase 1 1 MSASQKEGKL STATISVDGK SAEMPVLSGT LGPDVIDIRK LPAQLGVFTF DPGYGETAAC 61 NSKITFIDGD KGVLLHRGYP IAQLDENASY EEVIYLLLNG ELPNKVQYDT FTNTLTNHTL 121 LHEQIRNFFN GFRRDAHPMA ILCGTVGALS AFYPDANDIA IPANRDLAAM RLIAKIPTIA 181 AWAYKYTQGE AFIYPRNDLN YAENFLSMMF ARMSEPYKVN PVLARAMNRI LILHADHEQN 241 ASTSTVRLAG STGANPFACI AAGIAALWGP AHGGANEAVL KMLARIGKKE NIPAFIAQVK 301 DKNSGVKLMG FGHRVYKNFD PRAKIMQQTC HEVLTELGIK DDPLLDLAVE LEKIALSDDY 361 FVQRKLYPNV DFYSGIILKA MGIPTSMFTV LFAVARTTGW VSQWKEMIEE PGQRISRPRQ 421 LYIGAPQRDY VPLAKR //

12 EMBL style ID CM23SRIBR converted; DNA; UNC; 805 BP. XX AC X80636;
DT 22-MAR-1995 DE C.mucosalis gene for 23S ribosomal RNA (fragment) OS Campylobacter mucosalis CC SEQIO retrieval from EMBL-format entry Feb-1996 SQ Sequence 805 BP; 226 A; 158 C; 224 G; 194 T; 3 other; gattctgcgc ggaaaatata acggggctaa aatgagtacc gaagctttag acttagtttt actaagtggt aggagcgttc tattcagcgt tgaaggtgta ccggtaagga gcgctggagc ggatagaagt gagcatgcag gcatgagtag cgataattgg ggtgagaatc cccaacgccg taarcccaag gtttcctacg cgatgctcgt catcgtaggg ttagccgggt cctaagcaaa gtccgaaagg ggtatgcgat ggaaaattgg ttaatattcc aatgccaaca ttattgtgcg atggaaggac gcttagagtt aaaggagcca gctgatggaa gtgctggtcg aaaggtgtag gttgagttac aggcaaatcc gtaactcttt atccgagacc ccacaggcgt ttgaagttct tcggaatgga tgacgaatcc ttgatactgt cgagccaaga aaagtttcta agtttagata atgttgcccg taccgtaaac cgacacaggt gggtgggatg agtattctaa ggcgcgtgga agaactctct tcaaggaact ctgcaaaata gcaccgtatc ttcggtataa ggtgtgccta actttgtgaa ggatttactc cgtaagcatt gaaggttaca acaaagagtc cctcccgact gtttaccaaa aacacagcac tctgctaact cgtaagagga tgtatagggt gtgacgcctg cccggtgctc gaaggttaat tgatggggty agcagyaatg cgaagctctt gatcgaagcc cgagtaaacg gccgccgtaa ctata //

13 Swissprot style ID 104K_THEPA CONVERTED; PRT; 924 AA. AC P15711;
DT 01-AUG-1992 DE KD MICRONEME-RHOPTRY ANTIGEN. OS THEILERIA PARVA. CC -!- DEVELOPMENTAL STAGE: SPOROZOITE ANTIGEN. CC -!- SUBCELLULAR LOCATION: IN MICRONEME/RHOPTRY COMPLEXES. CC CC SEQIO retrieval from Swiss-Prot database entry Feb-1996 SQ SEQUENCE AA; MKFLILLFNI LCLFPVLAAD NHGVGPQGAS GVDPITFDIN SNQTGPAFLT AVEMAGVKYL QVQHGSNVNI HRLVEGNVVI WENASTPLYT GAIVTNNDGP YMAYVEVLGD PNLQFFIKSG DAWVTLSEHE YLAKLQEIRQ AVHIESVFSL NMAFQLENNK YEVETHAKNG ANMVTFIPRN GHICKMVYHK NVRIYKATGN DTVTSVVGFF RGLRLLLINV FSIDDNGMMS NRYFQHVDDK YVPISQKNYE TGIVKLKDYK HAYHPVDLDI KDIDYTMFHL ADATYHEPCF KIIPNTGFCI TKLFDGDQVL YESFNPLIHC INEVHIYDRN NGSIICLHLN YSPPSYKAYL VLKDTGWEAT THPLLEEKIE ELQDQRACEL DVNFISDKDL YVAALTNADL NYTMVTPRPH RDVIRVSDGS EVLWYYEGLD NFLVCAWIYV SDGVASLVHL RIKDRIPANN DIYVLKGDLY WTRITKIQFT QEIKRLVKKS KKKLAPITEE DSDKHDEPPE GPGASGLPPK APGDKEGSEG HKGPSKGSDS SKEGKKPGSG KKPGPAREHK PSKIPTLSKK PSGPKDPKHP RDPKEPRKSK SPRTASPTRR PSPKLPQLSK LPKSTSPRSP PPPTRPSSPE RPEGTKIIKT SKPPSPKPPF DPSFKEKFYD DYSKAASRSK ETKTTVVLDE SFESILKETL PETPGTPFTT PRPVPPKRPR TPESPFEPPK DPDSPSTSPS EFFTPPESKR TRFHETPADT PLPDVTAELF KEPDVTAETK SPDEAMKRPR SPSEYEDTSP GDYPSLPMKR HRLERLRLTT TEMETDPGRM AKDASGKPVK LKRSKSFDDL TTVELAPEPK ASRIVVDDEG TEADDEETHP PEERQKTEVR RRRPPKKPSK SPRPSKPKKP KKPDSAYIPS ILAILVVSLI VGIL //

14 Sequence profile/model representations
Models : Hidden Markov Models Profiles : A propensity mapping of multiple sequences.

15 Alignment AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAA

16 AAAAAAATAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAA
AAAAAAAGGGGGGGAAAAAAAAAAAAAAAAAAAAA AAAA

17 Gapped alignment AAAAAA_____ATAAAAAAAAAAAAAAAAAAAAAA AAAA
AAAAAAATAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAA AAAAAAATAAAAAAAAAAA___AAAAAAAAAAAAA GAAA AAAA__AGGGGG____AAAAAAAAAAA_____AAA AAAA

18 Sequence identity?

19 Sequence Homology?

20 Genetic Distance?

21 Distance matrix

22 Exchange matrix A->G ? A->T ? K->M ?

23 HMM


Download ppt "Alignment Sequence, Structure, Network"

Similar presentations


Ads by Google