Presentation is loading. Please wait.

Presentation is loading. Please wait.

What is sequencing? Video: https://www.youtube.com/watch?v=womKfik WlxM (Illumina video) https://www.youtube.com/watch?v=womKfik WlxM.

Similar presentations


Presentation on theme: "What is sequencing? Video: https://www.youtube.com/watch?v=womKfik WlxM (Illumina video) https://www.youtube.com/watch?v=womKfik WlxM."— Presentation transcript:

1 What is sequencing? Video: https://www.youtube.com/watch?v=womKfik WlxM (Illumina video) https://www.youtube.com/watch?v=womKfik WlxM

2 Sequences: what's out there? NCBI: total nucleotide repository, "reference" sequences (more later), tools, features, annotations, etc. https://www.youtube.com/watch?v=Phxkg5H5Q6E

3 Sequences: what's out there? EBI: EU counterpart, functionally equivalent (tends to have a bit less data, a bit better tools)

4 Sequences: what's out there? Eric Altermann et al. PNAS 2005;102:3906-3912 DNA:  Complete genomes sequences:  Draft genome sequences: has lower accuracy, partially assembled, useful but annotation often improves dramatically.  Miscellany: GenBank accepts any identified sequence.

5 Raw short reads Sequences: what's out there? European Nucleotide Archive (ENA) Sequence Read Archive

6 Raw short reads Sequences: what's out there? European Nucleotide Archive (ENA) Sequence Read Archive

7 Sequences: what's out there? SRA and ENA are recently been plant to be joined by the INSDC. Standardizing deposition protocols, storage formats, access patterns, etc..

8 Sequences: what's out there? RNA:  CDNA/ESTs: Not used so much anymore – single pass, high quality sequences from RTed mRNAs Can be used to catalog portions of genomes that are actively transcribed. Great for organisms without high quality sequenced genomes or annotations ESTs are often 300-800 bp Early efforts resulted in the identification of many hundreds of genes novel at the time. DbEST is a division of GeneBank

9 Sequences: what's out there? RNA:  RNA-seq US EU Underutilized

10 Sequences: what's out there? Amino acids: Won't discuss today, but AA seqs. typically handled very differently and in different databases Features: annotations, from location to function. Loci are referred to as "features", which can be anything: Genes, introns/exons, polymorphisms, regulatory elements, conserved regions, islands, etc.

11 Alignments Pairwise alignment is the process of lining up two sequences to achieve maximal levels of identity Fig 3.5 Pevsner. Pairwise alignment of human beta globin (query) and myoglobin (subject) Sequences: what's out there?

12 Basic Local Alignment Search Tool (BLAST) It is an algorithm that allows the user to select one sequence (query) and perform pairwise sequence alignment between the target and the entire database of sequences, and identify the ones that resemble.

13 Basic Local Alignment Search Tool (BLAST) We can assess the relatedness of any two proteins by performing a pairwise alignment using NCBI pairwise BLAST tool. Perform the following steps: 1. Choose the protein BLAST program and select “BLAST 2 sequences” for our comparison of two proteins. An alternative is to select blastn (for “BLAST nucleotides”) for DNA–DNA comparison. 2. Enter the sequences or their accession numbers. Here we use the sequence of human beta globin in the fasta format, and for myoglobin we use the accession number (Fig. 3.4). 3. Select any optional parameters: Scoring matrices: BLOSUM#, PAM# Gap extension penalty Change reward and penalty values

14 Basic Local Alignment Search Tool (BLAST) We can assess the relatedness of any two proteins by performing a pairwise alignment using NCBI pairwise BLAST tool. Perform the following steps: 1. Choose the protein BLAST program and select “BLAST 2 sequences” for our comparison of two proteins. An alternative is to select blastn (for “BLAST nucleotides”) for DNA–DNA comparison. 2. Enter the sequences or their accession numbers. Here we use the sequence of human beta globin in the fasta format, and for myoglobin we use the accession number (Fig. 3.4). 3. Select any optional parameters: Scoring matrices: BLOSUM#, PAM# Gap extension penalty Change reward and penalty values

15 Basic Local Alignment Search Tool (BLAST) We can assess the relatedness of any two proteins by performing a pairwise alignment using NCBI pairwise BLAST tool. Perform the following steps: 1. Choose the protein BLAST program and select “BLAST 2 sequences” for our comparison of two proteins. An alternative is to select blastn (for “BLAST nucleotides”) for DNA–DNA comparison. 2. Enter the sequences or their accession numbers. Here we use the sequence of human beta globin in the fasta format, and for myoglobin we use the accession number (Fig. 3.4). 3. Select any optional parameters: Scoring matrices: BLOSUM#, PAM# Gap extension penalty Change reward and penalty values 4. Click BLAST. Output includes a pairwise alignment using the single letter amino acid code. NOTE: Similar pairs of residues are structurally or functionally related. That means they may look different but they are related because they share similar biochemical properties.

16 BLAST algorithm 1)Make a k-letter word list of the query sequence. For example K=3 2)List the possible matching words. BLAST cares only about the high scoring words. We use a scoring matrix to compare the work in the list in 1) with all the 3 letter words. For example if we have PQG, different scores are obtained when compared to PEG and PQA. Only keep words that surpass a threshold T. 3)Organize the remaining high-scoring words into an efficient search tree. 4)Repeat steps 2-3) for each k-letter word in the query sequence.

17 BLAST algorithm 1)Make a k-letter word list of the query sequence. For example K=3 2)List the possible matching words. BLAST cares only about the high scoring words. We use a scoring matrix to compare the work in the list in 1) with all the 3 letter words. For example if we have PQG, different scores are obtained when compared to PEG and PQA. Only keep words that surpass a threshold T. 3)Organize the remaining high-scoring words into an efficient search tree. 4)Repeat steps 2-3) for each k-letter word in the query sequence.

18 BLAST algorithm 1)Make a k-letter word list of the query sequence. For example K=3 2)List the possible matching words. BLAST cares only about the high scoring words. We use a scoring matrix to compare the work in the list in 1) with all the 3 letter words. For example if we have PQG, different scores are obtained when compared to PEG and PQA. Only keep words that surpass a treshold T. 3)Organize the remaining high-scoring words into an efficient search tree. 4)Repeat steps 2-3) for each k-letter word in the query sequence. 5)Scan the database sequences for exact matches witht the remaining high scoring words. High scoring segment pairs (HSPs)

19 BLAST algorithm 1)Make a k-letter word list of the query sequence. For example K=3 2)List the possible matching words. BLAST cares only about the high scoring words. We use a scoring matrix to compare the work in the list in 1) with all the 3 letter words. For example if we have PQG, different scores are obtained when compared to PEG and PQA. Only keep words that surpass a treshold T. 3)Organize the remaining high-scoring words into an efficient search tree. 4)Repeat steps 2-3) for each k-letter word in the query sequence. 5)Scan the database sequences for exact matches witht the remaining high scoring words. 6)List all the HSPs in the database whos score is high enough to be cosnsidered. 7)Evaluate the significance of the HSP score. (e-value) 8)Report every match whose expect score is lower than a threshold parameter E. E-value: the number of times that an unrelated database sequence would obtain a score S higher than x by chance.

20 BLAST PAM# BLOSUM# BLOSUM62 scoring matrix of Henikoff and Henikoff (1992).

21 Sequences: what's out there? A note on gene IDs How do you put a standard identifier on _anything_ in genomics? Partially overlapping ID systems: GenBank, RefSeq, EMBL-Bank, EMBL, UniGene, UniRef, HomoloGene, KO, every array platform, Entrez, HGNC, KEGG, UCSC, every model organism DB... And these just cover genes! Different competing systems for proteins, functions, diseases, physiology, you name it A large % of the reason we learn Python is so you can automate things like gene ID conversion

22 What can you get where? http://www.ncbi.nlm.nih.gov/genbank/ http://www.ncbi.nlm.nih.gov/refseq/ Tutorial: https://www.youtube.com/watch?v=g5a__okj5Zs

23 What can you get where? http://www.ensembl.org/index.html Tutorial: http://www.ensembl.org/Multi/Help/Movie?db=core;id=188

24 What can you get where? USC Genome Browser https://genome.ucsc.edu http://jbrowse.org


Download ppt "What is sequencing? Video: https://www.youtube.com/watch?v=womKfik WlxM (Illumina video) https://www.youtube.com/watch?v=womKfik WlxM."

Similar presentations


Ads by Google