Introduction to Bioinformatics Research Project The Feb 28 entry on the calendar (and the Research Project topic page) links to advice on how to choose.

Introduction to Bioinformatics Research Project The Feb 28 entry on the calendar (and the Research Project topic page) links to advice on how to choose a research group and how to begin a protein-centered research project, including how to find useful articles. Here I present an example of a DNA-centered research project, beginning mostly after the useful-article stage

How to choose your research group Mueser TC et al (2010) Virol J 7:359 In this alternate universe, I’m in the DNA replication group. I’m particularly interested in the DNA sequences that determine the initiation of DNA replication. I’ve even read an article or two about them, discovering...

Origin of DNA replication Circular, dsDNA genome Origin...that DNA in prokaryotes and their phages is primarily circular. To replicate it, the circle has to be opened at some point. That point is called the origin of replication.

Origin of DNA replication Circular, dsDNA genome Origin Bidirectional initiation Opening the circle at the origin exposes two single-strands. Both are replicated, with the replication fork moving in both directions, away from the origin.

Origin of DNA replication Circular, dsDNA genome Bidirectional initiation Origin Elongation Separation Eventually, two separate daughter circles are formed....But enough chatting. The issue is how is the starting point chosen?

Origin of DNA replication Origin Zooming in on the origin, we see the two intertwined strands at oriC (i.e., the Origin of the Chromosome)

Origin of DNA replication Origin + What makes the origin special is that it binds proteins essential for initiating replication. The picture shows green DnaA protein binding to the origin – also a protein called FIS (more on this in a moment).

Origin of DNA replication Origin + + DnaA binds not only to DNA but also to each other. With the help of a second DNA- binding protein, IHF (keep waiting), the bound DnaA proteins form a blob that distorts the DNA. The two strands of DNA separate at a nearby AT-rich region (you may recall that AT-rich regions are less stable than GC-rich regions)

Origin of DNA replication Origin + + FIS Factor for Inversion Stimulation in Phage Mu That’s the general idea. For the rest of this project, I’m going to focus on DnaA, but before leaving the other protein behind... (I hate throwing around undefined acronyms...) FIS was first discovered as a protein important in gene regulation by a phage.

Origin of DNA replication Origin + + IHF Integration Host Factor for lysogeny of Phage Lambda Same with IHF. It was first found as a protein used by a phage to integrate its genome into the bacterial genome. It’s amazing how many things were first found in phages.

Origin of DNA replication Origin + + How to recognize origin of replication? But back to the main question at hand. I want to learn how to recognize origins of replication. If I build a tool that can find known bacterial origins, maybe I can use the tool to search for origins in bacteriophages. Do phages have the same sorts of origins? Don’t know.

Origin of DNA replication Origin + + How to recognize origin of replication? But how to tell? One thing that distinguishes origins is their ability to bind DnaA protein -- if DnaA binds to a specific sequence, then origins must have multiple copies of them in close proximity. Does DnaA bind to a specific sequence?

Origin of DNA replication Kaguni (2006) Annu Rev Microbiol 60:351-371. DnaA binding site Is DnaA binding to DNA specific? I found an article that says the answer is yes. The E. coli origin of replication, pictured above, has five specific binding sites for DnaA. I need to learn more about that sequence. Orange colored boxes are nice, but at this point, I need to get closer to the truth, closer to the sequence.

Origin of DNA replication Kaguni (2006) Annu Rev Microbiol 60:351-371. Fuller et al (1984) Cell 38:889-900. DnaA binding site Here’s the sequence of the E. coli origin region. R1-R4 represent the sequences protected by DnaA when it binds. Are the all the same sequence?

Origin of DNA replication Kaguni (2006) Annu Rev Microbiol 60:351-371. Fuller et al (1984) Cell 38:889-900. DnaA binding site For example R1 and R2... Are they the same sequence? Why are there two sets of nucleotides in each box?

Origin of DNA replication Kaguni (2006) Annu Rev Microbiol 60:351-371. Fuller et al (1984) Cell 38:889-900. DnaA binding site If you notice that both strands of the DNA are shown, then you can make more sense of the boxes.

Origin of DNA replication Kaguni (2006) Annu Rev Microbiol 60:351-371. Fuller et al (1984) Cell 38:889-900. DnaA binding site Putting all the boxes together (choosing one of the two strands arbitrarily), I begin to see a pattern. Kaguni said there was also R5 (M). Where’s that? R1 TTATCCACA R2 TTATACACA R3 TTATCCAAA R4 TTATCCACA

Origin of DNA replication Fuller et al (1984) Cell 38:889-900. Enough orange boxes! Even enough paper sequences! If I’m going to make an origin-finding tool, I need to test it on a known case – Why not this case? Can I find the E. coli origin by DnaA-binding sequences? R1 TTATCCACA R2 TTATACACA R3 TTATCCAAA R4 TTATCCACA

My goal is to make a general origin-finding tool, using the E. coli origin as a test case. I therefore need to find the coordinates of the E. coli origin, so I can tell if my tool is working. Since I'm going to build the tool in BioBIKE, I need the coordinates known to BioBIKE. There's no point finding the origin in Genbank or anywhere else. PhAnToMe is where you’ll find E. coli and phage sequences.

How do I find the E. coli origin in E. coli? My general origin-finding tool will look for DnaA-binding sites. I think that will work to find the E. coli origin, but I don't know it will work. I need the coordinates of the E. coli origin so I can test my unproven tool with a known case. So, how can I find the E. coli origin with absolute certainty? What do I have in hand to enable me to find it?

What do I have in hand to enable me to find the origin? Of course I have the sequence. That's essentially foolproof, so long as I have available the E. coli genome sequence to search through. Looking for the sequence is much more certain than looking for DnaA boxes or some region annotated as “the origin”

One strategy is to display the sequence of E. coli K12 (which is the standard laboratory strain).

Searching for some portion of the published origin sequence should get me to the right place in the genome. It doesn’t matter much which part of the origin I choose.

How could that be?!? I recheck the sequence... No problem.

When some strategy fails for no apparent reason and defies your best efforts to understand why, it is a generally a good idea to try something completely different, even though the different strategy may not sound any more promising. It is the worm that wiggles that gets off the hook. So I try searching the E. coli genome for the same sequence, using a high threshold (expect value of 10, which would allow even rare random matches to sneak through).

That was informative! The first match goes from the beginning to end (Q-start=1, Q-end=30) of the 30-nucleotide sequence I gave it, but the match was only 96.67%. There must be a mismatch somewhere! The other matches are very partial with poor E-values. I’ll ignore them.

Where is the mismatch? The ALIGNMENT-OF function allows me to compare the 30-nucleotide query sequence with the actual sequence from E. coli. I used the coordinates provided by SEQUENCE- SIMILAR-TO to pick out the relevant portion of the genome.

Ah! The original article from which I got the origin sequence had an error in it, an extra G! This is not so surprising. In 1984 (the year of the article), all sequencing was done by hand with little redundancy. In any event, I think I found the origin – around coordinate 3923300

Note how I got to this region: Clearing the Search field, entering the coordinate in the Go To field, and clicking Go. Don’t be concerned about the blank lines on the top and the mayhem on the right. The E. coli genome happens to have lots of sequence features that people have annotated, and the Sequence Viewer doesn’t handle them very well.

First to confirm: Is this the right sequence? The first 30 nucleotides should match, of course (except for one). What about the rest? I’ll check the first 80... Check!

Does the region have the DnaA-binding motifs? I could search for each individual sequence, but it’s more efficient to search for the pattern that encompasses all of them....Why only two? What happened to the other two? (you might want to look several slides back at the sequence) R1 TTATCCACA R2 TTATACACA R3 TTATCCAAA R4 TTATCCACA

I can't depend on my own eyes. I need to automate the process. MATCHES-OF-PATTERN will search for the same DnaA-binding pattern but return all the results at once. There’s no preference which of the two strands a DnaA protein will bind to, so I specify BOTH-STRANDS.

Note that the results are shown formatted in a popup window for immediate gratification and also in the result pane for further use. There are a lot of sequences matching the pattern! How many? And how many would you expect by chance?

How many? That’s the easy one. I just counted the list (using * to indicate the previous result) How many expected by chance? Not much worse. You’ve done this sort of calculation many times in the past and will do so many times in the future. You should reach the conclusion that most of the matches are garbage.

If a mere match to a DnaA-binding sequence is not informative, then how can we recognize an origin? What’s distinctive about the origin is that it contains a cluster of DnaA-binding sites. Unfortunately, it is difficult to recognize clusters of sites because the sites’ coordinates are not sorted. That’s the next step. (And then to clean up the screen)

That’s much better! With the sorted list, I can see the cluster of four DnaA-binding sites at the known origin of E. coli (at coordinate ~3923000). Maybe there are other clusters? I’m not sure I’m up to peering through the entire list. However, I can see how I’d do it, examining each line with respect to its neighbors and keeping only those sites that are close to other sites. I need to automate this process to create the tool that can scan hundreds of genomes looking for origins of replication.

Automation of this sort of thing will come later. Can't do everything at once. For now, I'll package the progress I've made to enable me to experiment easily. I'll take the steps I've developed and put it into a function

My function consists of no more than what I did step by step. Now it has a name. Also, I generalized it to work with any genome, not just E. coli. Does it work?

Yes! Executing the function (now on my FUNCTION button) with E. coli as the argument gives exactly the same result as I got before. Will it work with other organisms?

That’s much better! With the sorted list, I can see the cluster of four DnaA-binding sites at the known origin of E. coli (at coordinate ~3923000). Maybe there are other clusters? I’m not sure I’m up to peering through the entire list. However, I can see how I’d do it, examining each line with respect to its neighbors and keeping only those sites that are close to other sites. I need to automate this process to create the tool that can scan hundreds of genomes looking for origins of replication. Maybe! I tried it on Yersinia pestis (causative agent of the plague) and got a very provocative result. What's the odds that five DnaA-sites would come up in the first 2000 nucleotides by chance? (do the calculation)

That’s much better! With the sorted list, I can see the cluster of four DnaA-binding sites at the known origin of E. coli (at coordinate ~3923000). Maybe there are other clusters? I’m not sure I’m up to peering through the entire list. However, I can see how I’d do it, examining each line with respect to its neighbors and keeping only those sites that are close to other sites. I need to automate this process to create the tool that can scan hundreds of genomes looking for origins of replication. With this function in hand, I can experiment, checking whether my method is any good. I will undoubtedly find that it could be improved in lots of ways. The ability to do quick experiments and gain rapid feedback enables my ideas to evolve.

Origin of DNA replication Algorithm (where it stands) * Search genome sequence for DnaA-binding sites - TTAT[CA]CACA - (not perfect – allow one mismatch?) - Use MATCHES-OF-PATTERN * Sort sites by coordinate - Use SORT * Look for clusters of sites - (How???) (Eventually) Apply to all phage genomes

* Make problem tangible Morals of the Story Abstractions can give you a comforting big picture, but you won't make any progress unless you can connect the abstractions to reality

* Make problem tangible Morals of the Story * Test ideas by experimentation Develop your methods using cases where the answer is already known.

* Make problem tangible Morals of the Story * Test ideas by experimentation * Package your insights into functions Start with an imperfect function and let it evolve as you gain more experience.

* Make problem tangible Morals of the Story * Test ideas by experimentation * Package your insights into functions Try weird cases. Figure out why the method fails (if it fails) and what would make it not work (if it works). Do lots of experiments. * Test the limits of your method

* Make problem tangible Morals of the Story * Test ideas by experimentation * Package your insights into functions * Test the limits of your method * When things don't work (inevitable), cope Try something different. Try lots of somethings different.

* Make problem tangible Morals of the Story * Test ideas by experimentation * Package your insights into functions * Test the limits of your method * When things don't work (inevitable), cope * When things continue not to work, talk with others Sometimes pooled confusion can lead to light.

TATTCAAAATGAATTATATCGGTAA ATATCTGCAACTTTAAACCTGAATGA GGATTTAGTATTGCTGGGCCAGCCCAAA GTTTAGAATTTTCATCAACTTTGCACAATG A TGGAAAACGTGAATTCAAAAGGATTGCTAT AT ATTATTAAGAAAACATTTGGAATTCGAGAAC CGG AATATGGCATTCCGCAAATTAGAGAACGGAAT AGGTA TTCCTAAAAAAACACATTCTCTGCAATTTTTAAG ATGAGT ATTATACCTGCACTAACTTTGTGGGACGCAATATCA GAGCAACC CTATCATTTAAAACCTCAAAATACTTATCAGACTTGG GGAACATTCT GACCGTTTAGTAGAACGTTTCCGGCATATAAAATGGGGTGA AGTGGTAATG GTGAATTATCAAACAAATCATATGATCAGAATAATCGCCGTTTAAA TCCATCCTTTTC AACATCGAAATTTAACAGCCCGTGAAGGAGCTAGAATCCAATCTTTTCCAG GAAGAAAGATTTG ATGAAAAATTTCTTTGTCAATATAATCAAATCGGTAATGCTGTACCCCCTCTTCTCGCTA GTGCATGGATCAAATC TTGAACAAAAAGAGAATCATCGTACAAAATACAGAGATACTGAAAGCAGGACTTTCCTTAGAGAAATCA AGATGATTCAATTATTACTCAAAGAGTGGAACTTCTCACTAAATATAAAGATTTTTTAGATCAGCAGCATTATGCAGAAAAATTTGATTCAAGATCCAACC GCTCATAATCCTTACTGAGACGACGGTACTGGTTTAACCAGCCAAATGTTCTTTCTACTACCCACCGTTTGGGCAAAACCTGAAATTCTTGATTAGTACGCCGGATTACCTCAACATGAGCTTGAAT CTA GGCGGCAAGTAATCTTTCTCCAGCATTTGCTTCACTTACAACCACTTTTAACAAAAGTCCCAGACTATCAACCAAAGTTTGCCGCTTTCGTCCTTTTACCTTCTTGCCACCATCAAAACCGTACACATCCCCCTTTTTTCAGTCGTTTTTACCGACTGGCTGTCTGCC GCGATCGCCGTGGGTTGAGTTGACTTCCCCATTTTTTGACGAACTTGATCGCGCAAAGTATGATTCATTTCAGTTGAACTAGGAGGAAAATCCCCTGGAAGCATATCCCACTGAATTCGAATTCGAATTCGAATTCGAATTCGAC AACCTGTTTTCAGATGGTAGTAGATAGCGTTGCATAC TTCTCGCATATCAGTTGTTCGGGGATGCCCACCGCATTTAGCGGGTGGAATCAAAGGAGCTAAAATTGCCCATTCTGAGTCATTAAGGTCTGTAGAATAAGACTTTCGTCTCATTGTTTCCTATGTAAATACACTCTACAAACAGTATCTTATCGCTGCCTTTTTATCTTAGCTCTCCTTTAGATTTACTTTATAAATAGCCTCTTAGAAGAATTTCTTTATTATTTATTTAAAGATTTAGTACAAGAT TTCGGGCAGAACGCTCTTATTGGTAAGTCACACACGTTCAAAGATATTTTCTTCGTACCACCAAAATATTCTGAAATGCTCAAGCGACCTTATGCGCGAATTGAGAGAAAAGATCATGATTTCGTAATTGGTGCAACTGTTCAAGCATCGCTTGAAGCAGCACCTCCTCCAGAACAAAACCATGCTTGAGGGATCTTCACGCGCAGCAGAGGATTTAAAAGCGAGAAATCCTAACAGTTTATACC TTGTGGTTATGGAATGGATAAAACTGACCAATGATGTAAATTTACGAAAATATAAAGTTGATCAAATTTATGTACTACGTCAGCAAAAAAATACTGATAGAGAGTTTAGGTATGAGTCAACTTACATAAAAAAT

Introduction to Bioinformatics Research Project The Feb 28 entry on the calendar (and the Research Project topic page) links to advice on how to choose.

Similar presentations

Presentation on theme: "Introduction to Bioinformatics Research Project The Feb 28 entry on the calendar (and the Research Project topic page) links to advice on how to choose."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Introduction to Bioinformatics Research Project The Feb 28 entry on the calendar (and the Research Project topic page) links to advice on how to choose.

Similar presentations

Presentation on theme: "Introduction to Bioinformatics Research Project The Feb 28 entry on the calendar (and the Research Project topic page) links to advice on how to choose."— Presentation transcript:

Similar presentations

About project

Feedback