Presentation is loading. Please wait.

Presentation is loading. Please wait.

Lecture 1 BNFO 601 Usman Roshan. Course overview Perl progamming language (and some Unix basics) –Unix basics –Intro Perl exercises –Sequence alignment.

Similar presentations


Presentation on theme: "Lecture 1 BNFO 601 Usman Roshan. Course overview Perl progamming language (and some Unix basics) –Unix basics –Intro Perl exercises –Sequence alignment."— Presentation transcript:

1 Lecture 1 BNFO 601 Usman Roshan

2 Course overview Perl progamming language (and some Unix basics) –Unix basics –Intro Perl exercises –Sequence alignment in Perl Sequence analysis –Algorithms for exact and heuristic pairwise alignment –Substitution matrices and gap penalty training –Heuristics for pairwise alignment and BLAST, FASTA database search –Profile alignment and multiple sequence alignment –Computational complexity

3 Overview (contd) Tuesdays --- meet in GITC 2305 Grade: 50% mid-term and 50% final exam Exams will cover Perl and bioinformatics algorithms Recommended Texts: –Introduction to Bioinformatics by Arthur Lesk –Beginning Perl for Bioinformatics by James Tisdall –Introduction to Bioinformatics Algorithms by Pavel Pevzner

4 Nothing in biology makes sense, except in the light of evolution AAGACTT -3 mil yrs -2 mil yrs -1 mil yrs today AAGACTT T_GACTTAAGGCTT _GGGCTTTAGACCTTA_CACTT ACCTT (Cat) ACACTTC (Lion) TAGCCCTTA (Monkey) TAGGCCTT (Human) GGCTT (Mouse) T_GACTTAAGGCTT AAGACTT _GGGCTTTAGACCTTA_CACTT AAGGCTTT_GACTT AAGACTT TAGGCCTT (Human) TAGCCCTTA (Monkey) A_C_CTT (Cat) A_CACTTC (Lion) _G_GCTT (Mouse) _GGGCTTTAGACCTTA_CACTT AAGGCTTT_GACTT AAGACTT

5 Representing DNA in a format manipulatable by computers DNA is a double-helix molecule made up of four nucleotides: –Adenosine (A) –Cytosine (C) –Thymine (T) –Guanine (G) Since A (adenosine) always pairs with T (thymine) and C (cytosine) always pairs with G (guanine) knowing only one side of the ladder is enough We represent DNA as a sequence of letters where each letter could be A,C,G, or T. For example, for the helix shown here we would represent this as CAGT.

6 Transcription and translation

7 Amino acids Proteins are chains of amino acids. There are twenty different amino acids that chain in different ways to form different proteins. For example, FLLVALCCRFGH (this is how we could store it in a file) This sequence of amino acids folds to form a 3-D structure

8 Protein folding

9 The protein folding problem is to determine the 3-D protein structure from the sequence. Experimental techniques are very expensive. Computational are cheap but difficult to solve. By comparing sequences we can deduce the evolutionary conserved portions which are also functional (most of the time).

10 Protein structure Primary structure: sequence of amino acids. Secondary structure: parts of the chain organizes itself into alpha helices, beta sheets, and coils. Helices and sheets are usually evolutionarily conserved and can aid sequence alignment. Tertiary structure: 3-D structure of entire chain Quaternary structure: Complex of several chains

11 Key points DNA can be represented as strings consisting of four letters: A, C, G, and T. They could be very long, e.g. thousands and even millions of letters Proteins are also represented as strings of 20 letters (each letter is an amino acid). Their 3-D structure determines the function to a large extent.

12 Pairwise sequence alignment How to align two sequences?

13 Pairwise alignment

14

15 Dynamic programming Define V(i,j) to be the optimal pairwise alignment score between S 1..i and T 1..j (|S|=m, |T|=n)

16 Dynamic programming Time and space complexity is O(mn) Define V(i,j) to be the optimal pairwise alignment score between S 1..i and T 1..j (|S|=m, |T|=n)

17 Tabular computation of scores

18 Traceback to get alignment

19 How do we understand this dynamic programming algorithm? Let’s first look at some example alignments Let’s look at gaps. How do we know where to insert gaps Let’s look at the structure of an optimal alignment of two sequences x and y and how it relates optimal alignments of subsequences of x and y

20 Dynamic programming Animation slides by Elizabeth Thomas in Cold Spring Harbor Labs (CSHL) http://meetings.cshl.org/tgac/tgac/flash/DynamicProgramming.swf

21 How do we pick gap parameters?

22 Structural alignments Recall that proteins have 3-D structure.

23 Structural alignment - example 1 Alignment of thioredoxins from human and fly taken from the Wikipedia website. This protein is found in nearly all organisms and is essential for mammals. PDB ids are 3TRX and 1XWC.

24 Structural alignment - example 2 Computer generated aligned proteins Unaligned proteins. 2bbm and 1top are proteins from fly and chicken respectively. Taken from http://bioinfo3d.cs.tau.ac.il/Align/FlexProt/flexprot.html

25 Structural alignments We can produce high quality manual alignments by hand if the structure is available. These alignments can then serve as a benchmark to train gap parameters so that the alignment program produces correct alignments.

26 Benchmark alignments Protein alignment benchmarks –BAliBASE, SABMARK, PREFAB, HOMSTRAD are frequently used in studies for protein alignment. –Proteins benchmarks are generally large and have been in the research community for sometime now. –BAliBASE 3.0BAliBASE 3.0

27 Biologically realistic scoring matrices PAM and BLOSUM are most popular PAM was developed by Margaret Dayhoff and co-workers in 1978 by examining 1572 mutations between 71 families of closely related proteins BLOSUM is more recent and computed from blocks of sequences with sufficient similarity

28 PAM We need to compute the probability transition matrix M which defines the probability of amino acid i converting to j Examine a set of closely related sequences which are easy to align---for PAM 1572 mutations between 71 families Compute probabilities of change and background probabilities by simple counting

29 PAM In this model the unit of evolution is the amount of evolution that will change 1 in 100 amino acids on the average The scoring matrix S ab is the ratio of M ab to p b

30 PAM M ij matrix (x10000)

31 Next week Basics of Unix Perl programming –Basics –Exercises –Dynamic programming alignment solution in Perl


Download ppt "Lecture 1 BNFO 601 Usman Roshan. Course overview Perl progamming language (and some Unix basics) –Unix basics –Intro Perl exercises –Sequence alignment."

Similar presentations


Ads by Google