Introduction to Bioinformatics Research Project The Feb 28 entry on the calendar (and the Research Project topic page) links to advice on how to choose.

Slides:



Advertisements
Similar presentations
Accessing electronic journals from off- campus This causes lots of headaches, but dont despair, heres how to do it! (Please note – this presentation is.
Advertisements

The essentials managers need to know about Excel
Microsoft ® Office Outlook ® 2007 Training Retrieve, back up, or share messages Sweetwater ISD presents:
Exams and Revision Some hints and tips.
Click to start This is best viewed as a slide show. To view it, click Slide Show on the top tool bar, then View show. Integration of experimental evidence.
DURING READING STRATEGIES
KompoZer. This is what KompoZer will look like with a blank document open. As you can see, there are a lot of icons for beginning users. But don't be.
Introduction to Bioinformatics Tuesday, 12 March.
M2 – Explain the tools and techniques used in the creation of an interactive website. By Arturas Vitkovskij.
Unit 3 Day 4 FOCS – Web Design. No Journal Entry.
Modifying existing content Adding/Removing content on a page using jQuery.
Team Meeting Communication Skills
Bring Success in Beliefs. You don’t have to wait for someone to accept, to promote, to select... to somehow "discover." Access is nearly unlimited;
Programming with Alice Computing Institute for K-12 Teachers Summer 2011 Workshop.
Introduction to Bioinformatics Research Project Most of the remainder of the semester will be devoted to a research project focused on phage genome analysis.
Microsoft ® Office Word 2007 Training Bullets, Numbers, and Lists ICT Staff Development presents:
Introduction to Bioinformatics Wednesday, 2 March 2011 Genome analysis Hatfull et al (2008) Break up into research groups Organization of research groups.
Mystery of the Matching Marks part 2. Let’s look at our two sets of chromosomes again, side-by-side. This time, Focus on their DIFFERENCES: What do you.
ESSAY WRITING Can be fun.
Inventory Throughout this slide show there will be hyperlinks (highlighted in blue) follow the hyperlinks to navigate to the specified Topic or Figure.
Microsoft ® Office PowerPoint ® 2003 Training Playing movies [Your company name] presents:
Time Management.
Internet Research Finding Free and Fee-based Obituaries Online.
Wordpress SEO. Your Own Website If you want your own website, we have designed Wordpress website templates that you can purchase that have pretty much.
How to Fill Out the CARD Form (Course Assessment Reporting Data Form)
Manage your mailbox V: Retrieve, back up, or share messages Use your stored messages Whether you’re using the Personal Folders method or the Archive method.
HBar OR Reader Documentation A copy of the PowerPoint Viewer is shipped with the HBar OR Reader on the HBar Official Records [OR] CD. The PowerPoint Viewer.
XP New Perspectives on Microsoft Access 2002 Tutorial 41 Microsoft Access 2002 Tutorial 4 – Creating Forms and Reports.
JQuery Page Slider. Our goal is to get to the functionality of the Panic Coda web site.Panic Coda web site.
Adding Roads and Benches To A Pit Design ©2007 Dr. B. C. Paul {Note – The Name MineSight® and the Program described are property of Mintec Inc – Tucson,
Microsoft ® Office Outlook ® 2007 Training See and Use Multiple Calendars ICT Staff Development presents:
©Marian Small, 2010 Big Ideas K-3 Session 2 Marian Small.
Searching ProQuest: Basic Keyword Search At first glance, how would you search this database?
Moodle (Course Management Systems). Assignments 1 Assignments are a refreshingly simple method for collecting student work. They are a simple and flexible.
Excursions in Modern Mathematics, 7e: Copyright © 2010 Pearson Education, Inc. 6 The Mathematics of Touring 6.1Hamilton Paths and Hamilton Circuits.
What is Museum Box? A Museum box is a way of presenting information that allows you to create a cube project that can be shared with others. You can use.
PHP meets MySQL.
File Upload Competitive Analysis. Catalyst - Browse in-line Of interest:
Anotation: Gene of which little is known What follows is a simulation of an orf page in the proposed graphical interface. The interface does not yet exist.
Lab 3 – BLAST – Directed It’s a BLAST! (too easy?)
SharePoint document libraries I: Introduction to sharing files Sharjah Higher Colleges of Technology presents:
Evidence for Evolution
Moving Around in Scratch The Basics… -You do want to have Scratch open as you will be creating a program. -Follow the instructions and if you have questions.
How to read a scientific paper
Step 2: Inviting to Challenge Group. DON’T! Before getting into the training, it’s important that you DON’T just randomly send someone a message asking.
Moodle (Course Management Systems). Forums, Chats, and Messaging.
N-space Snakes are special maximal length loops through an N-space cube. They ’ re full of intriguing symmetries, puzzles and surprises. They ’ re simple.
Downloading and Installing Autodesk Inventor Professional 2015 This is a 4 step process 1.Register with the Autodesk Student Community 2.Downloading the.
Writing Personal Essays. Narration  Narration means the telling of an event in time or a sequence of events that exist in time. (Usually in chronological.
Comparative Civilizations 12 Introduction. Course Structure This is very much a web-based course. We also use plenty of text-based material from the Library,
Creating a Historical Tour in Alice By Jenna Hayes May 2010.
Conditional Statements.  Quiz  Hand in your jQuery exercises from last lecture  They don't have to be 100% perfect to get full credit  They do have.
Learning PowerPoint Presenting your ideas as a slide show… …on the computer!
Click anywhere to go on to the next slide This demonstration is best viewed as a slide show, enabling you to simulate a session and make changes in cursor.
Click anywhere to go on to the next slide This demonstration is best viewed as a slide show, enabling you to simulate a session and make changes in cursor.
Analysis: Tools for directly examining sequence What follows is a simulation of the proposed sequence interface. A PC-based prototype exists, but the interface.
Writing an Effective Introduction AKA: How To Make Your Teacher Not Completely Dread Reading Your Paper.
Protein sequence databases Petri Törönen Shamelessly copied from material done by Eija Korpelainen This also includes old material from my thesis
Methods for Multiplication Tutorial By: Melinda Hallock.
THE “COLLEGES I AM THINKING ABOUT” LIST IN YOUR FAMILY CONNECTIONS ACCOUNT.
Chapter 27 Phage Strategies
Mystery of the Matching Marks 2  For some reason, a GUNSHOT seems to suggest a CRIME SCENE… DO I HAVE YOUR ATTENTION? with BULLETS … and BULLET MARKS.
How does a person read and use an Academic Paper?
Weebly Elements, Continued
GDSS – Digital Signature
BLAST.
What’s New in Time & Attendance
Easy-Speak How easy is it?
Lab 3 – BLAST – Directed It’s a BLAST! (too easy?)
Presentation transcript:

Introduction to Bioinformatics Research Project The Feb 28 entry on the calendar (and the Research Project topic page) links to advice on how to choose a research group and how to begin a protein-centered research project, including how to find useful articles. Here I present an example of a DNA-centered research project, beginning mostly after the useful-article stage

How to choose your research group Mueser TC et al (2010) Virol J 7:359 In this alternate universe, I’m in the DNA replication group. I’m particularly interested in the DNA sequences that determine the initiation of DNA replication. I’ve even read an article or two about them, discovering...

Origin of DNA replication Circular, dsDNA genome Origin...that DNA in prokaryotes and their phages is primarily circular. To replicate it, the circle has to be opened at some point. That point is called the origin of replication.

Origin of DNA replication Circular, dsDNA genome Origin Bidirectional initiation Opening the circle at the origin exposes two single-strands. Both are replicated, with the replication fork moving in both directions, away from the origin.

Origin of DNA replication Circular, dsDNA genome Bidirectional initiation Origin Elongation Separation Eventually, two separate daughter circles are formed....But enough chatting. The issue is how is the starting point chosen?

Origin of DNA replication Origin Zooming in on the origin, we see the two intertwined strands at oriC (i.e., the Origin of the Chromosome)

Origin of DNA replication Origin + What makes the origin special is that it binds proteins essential for initiating replication. The picture shows green DnaA protein binding to the origin – also a protein called FIS (more on this in a moment).

Origin of DNA replication Origin + + DnaA binds not only to DNA but also to each other. With the help of a second DNA- binding protein, IHF (keep waiting), the bound DnaA proteins form a blob that distorts the DNA. The two strands of DNA separate at a nearby AT-rich region (you may recall that AT-rich regions are less stable than GC-rich regions)

Origin of DNA replication Origin + + FIS Factor for Inversion Stimulation in Phage Mu That’s the general idea. For the rest of this project, I’m going to focus on DnaA, but before leaving the other protein behind... (I hate throwing around undefined acronyms...) FIS was first discovered as a protein important in gene regulation by a phage.

Origin of DNA replication Origin + + IHF Integration Host Factor for lysogeny of Phage Lambda Same with IHF. It was first found as a protein used by a phage to integrate its genome into the bacterial genome. It’s amazing how many things were first found in phages.

Origin of DNA replication Origin + + How to recognize origin of replication? But back to the main question at hand. I want to learn how to recognize origins of replication. If I build a tool that can find known bacterial origins, maybe I can use the tool to search for origins in bacteriophages. Do phages have the same sorts of origins? Don’t know.

Origin of DNA replication Origin + + How to recognize origin of replication? But how to tell? One thing that distinguishes origins is their ability to bind DnaA protein -- if DnaA binds to a specific sequence, then origins must have multiple copies of them in close proximity. Does DnaA bind to a specific sequence?

Origin of DNA replication Kaguni (2006) Annu Rev Microbiol 60: DnaA binding site Is DnaA binding to DNA specific? I found an article that says the answer is yes. The E. coli origin of replication, pictured above, has five specific binding sites for DnaA. I need to learn more about that sequence. Orange colored boxes are nice, but at this point, I need to get closer to the truth, closer to the sequence.

Origin of DNA replication Kaguni (2006) Annu Rev Microbiol 60: Fuller et al (1984) Cell 38: DnaA binding site Here’s the sequence of the E. coli origin region. R1-R4 represent the sequences protected by DnaA when it binds. Are the all the same sequence?

Origin of DNA replication Kaguni (2006) Annu Rev Microbiol 60: Fuller et al (1984) Cell 38: DnaA binding site For example R1 and R2... Are they the same sequence? Why are there two sets of nucleotides in each box?

Origin of DNA replication Kaguni (2006) Annu Rev Microbiol 60: Fuller et al (1984) Cell 38: DnaA binding site If you notice that both strands of the DNA are shown, then you can make more sense of the boxes.

Origin of DNA replication Kaguni (2006) Annu Rev Microbiol 60: Fuller et al (1984) Cell 38: DnaA binding site Putting all the boxes together (choosing one of the two strands arbitrarily), I begin to see a pattern. Kaguni said there was also R5 (M). Where’s that? R1 TTATCCACA R2 TTATACACA R3 TTATCCAAA R4 TTATCCACA

Origin of DNA replication Fuller et al (1984) Cell 38: Enough orange boxes! Even enough paper sequences! If I’m going to make an origin-finding tool, I need to test it on a known case – Why not this case? Can I find the E. coli origin by DnaA-binding sequences? R1 TTATCCACA R2 TTATACACA R3 TTATCCAAA R4 TTATCCACA

My goal is to make a general origin-finding tool, using the E. coli origin as a test case. I therefore need to find the coordinates of the E. coli origin, so I can tell if my tool is working. Since I'm going to build the tool in BioBIKE, I need the coordinates known to BioBIKE. There's no point finding the origin in Genbank or anywhere else. PhAnToMe is where you’ll find E. coli and phage sequences.

How do I find the E. coli origin in E. coli? My general origin-finding tool will look for DnaA-binding sites. I think that will work to find the E. coli origin, but I don't know it will work. I need the coordinates of the E. coli origin so I can test my unproven tool with a known case. So, how can I find the E. coli origin with absolute certainty? What do I have in hand to enable me to find it?

What do I have in hand to enable me to find the origin? Of course I have the sequence. That's essentially foolproof, so long as I have available the E. coli genome sequence to search through. Looking for the sequence is much more certain than looking for DnaA boxes or some region annotated as “the origin”

One strategy is to display the sequence of E. coli K12 (which is the standard laboratory strain).

Searching for some portion of the published origin sequence should get me to the right place in the genome. It doesn’t matter much which part of the origin I choose.

How could that be?!? I recheck the sequence... No problem.

When some strategy fails for no apparent reason and defies your best efforts to understand why, it is a generally a good idea to try something completely different, even though the different strategy may not sound any more promising. It is the worm that wiggles that gets off the hook. So I try searching the E. coli genome for the same sequence, using a high threshold (expect value of 10, which would allow even rare random matches to sneak through).

That was informative! The first match goes from the beginning to end (Q-start=1, Q-end=30) of the 30-nucleotide sequence I gave it, but the match was only 96.67%. There must be a mismatch somewhere! The other matches are very partial with poor E-values. I’ll ignore them.

Where is the mismatch? The ALIGNMENT-OF function allows me to compare the 30-nucleotide query sequence with the actual sequence from E. coli. I used the coordinates provided by SEQUENCE- SIMILAR-TO to pick out the relevant portion of the genome.

Ah! The original article from which I got the origin sequence had an error in it, an extra G! This is not so surprising. In 1984 (the year of the article), all sequencing was done by hand with little redundancy. In any event, I think I found the origin – around coordinate

Note how I got to this region: Clearing the Search field, entering the coordinate in the Go To field, and clicking Go. Don’t be concerned about the blank lines on the top and the mayhem on the right. The E. coli genome happens to have lots of sequence features that people have annotated, and the Sequence Viewer doesn’t handle them very well.

First to confirm: Is this the right sequence? The first 30 nucleotides should match, of course (except for one). What about the rest? I’ll check the first Check!

Does the region have the DnaA-binding motifs? I could search for each individual sequence, but it’s more efficient to search for the pattern that encompasses all of them....Why only two? What happened to the other two? (you might want to look several slides back at the sequence) R1 TTATCCACA R2 TTATACACA R3 TTATCCAAA R4 TTATCCACA

I can't depend on my own eyes. I need to automate the process. MATCHES-OF-PATTERN will search for the same DnaA-binding pattern but return all the results at once. There’s no preference which of the two strands a DnaA protein will bind to, so I specify BOTH-STRANDS.

Note that the results are shown formatted in a popup window for immediate gratification and also in the result pane for further use. There are a lot of sequences matching the pattern! How many? And how many would you expect by chance?

How many? That’s the easy one. I just counted the list (using * to indicate the previous result) How many expected by chance? Not much worse. You’ve done this sort of calculation many times in the past and will do so many times in the future. You should reach the conclusion that most of the matches are garbage.

If a mere match to a DnaA-binding sequence is not informative, then how can we recognize an origin? What’s distinctive about the origin is that it contains a cluster of DnaA-binding sites. Unfortunately, it is difficult to recognize clusters of sites because the sites’ coordinates are not sorted. That’s the next step. (And then to clean up the screen)

That’s much better! With the sorted list, I can see the cluster of four DnaA-binding sites at the known origin of E. coli (at coordinate ~ ). Maybe there are other clusters? I’m not sure I’m up to peering through the entire list. However, I can see how I’d do it, examining each line with respect to its neighbors and keeping only those sites that are close to other sites. I need to automate this process to create the tool that can scan hundreds of genomes looking for origins of replication.

Automation of this sort of thing will come later. Can't do everything at once. For now, I'll package the progress I've made to enable me to experiment easily. I'll take the steps I've developed and put it into a function

My function consists of no more than what I did step by step. Now it has a name. Also, I generalized it to work with any genome, not just E. coli. Does it work?

Yes! Executing the function (now on my FUNCTION button) with E. coli as the argument gives exactly the same result as I got before. Will it work with other organisms?

That’s much better! With the sorted list, I can see the cluster of four DnaA-binding sites at the known origin of E. coli (at coordinate ~ ). Maybe there are other clusters? I’m not sure I’m up to peering through the entire list. However, I can see how I’d do it, examining each line with respect to its neighbors and keeping only those sites that are close to other sites. I need to automate this process to create the tool that can scan hundreds of genomes looking for origins of replication. Maybe! I tried it on Yersinia pestis (causative agent of the plague) and got a very provocative result. What's the odds that five DnaA-sites would come up in the first 2000 nucleotides by chance? (do the calculation)

That’s much better! With the sorted list, I can see the cluster of four DnaA-binding sites at the known origin of E. coli (at coordinate ~ ). Maybe there are other clusters? I’m not sure I’m up to peering through the entire list. However, I can see how I’d do it, examining each line with respect to its neighbors and keeping only those sites that are close to other sites. I need to automate this process to create the tool that can scan hundreds of genomes looking for origins of replication. With this function in hand, I can experiment, checking whether my method is any good. I will undoubtedly find that it could be improved in lots of ways. The ability to do quick experiments and gain rapid feedback enables my ideas to evolve.

Origin of DNA replication Algorithm (where it stands) * Search genome sequence for DnaA-binding sites - TTAT[CA]CACA - (not perfect – allow one mismatch?) - Use MATCHES-OF-PATTERN * Sort sites by coordinate - Use SORT * Look for clusters of sites - (How???) (Eventually) Apply to all phage genomes

* Make problem tangible Morals of the Story Abstractions can give you a comforting big picture, but you won't make any progress unless you can connect the abstractions to reality

* Make problem tangible Morals of the Story * Test ideas by experimentation Develop your methods using cases where the answer is already known.

* Make problem tangible Morals of the Story * Test ideas by experimentation * Package your insights into functions Start with an imperfect function and let it evolve as you gain more experience.

* Make problem tangible Morals of the Story * Test ideas by experimentation * Package your insights into functions Try weird cases. Figure out why the method fails (if it fails) and what would make it not work (if it works). Do lots of experiments. * Test the limits of your method

* Make problem tangible Morals of the Story * Test ideas by experimentation * Package your insights into functions * Test the limits of your method * When things don't work (inevitable), cope Try something different. Try lots of somethings different.

* Make problem tangible Morals of the Story * Test ideas by experimentation * Package your insights into functions * Test the limits of your method * When things don't work (inevitable), cope * When things continue not to work, talk with others Sometimes pooled confusion can lead to light.

TATTCAAAATGAATTATATCGGTAA ATATCTGCAACTTTAAACCTGAATGA GGATTTAGTATTGCTGGGCCAGCCCAAA GTTTAGAATTTTCATCAACTTTGCACAATG A TGGAAAACGTGAATTCAAAAGGATTGCTAT AT ATTATTAAGAAAACATTTGGAATTCGAGAAC CGG AATATGGCATTCCGCAAATTAGAGAACGGAAT AGGTA TTCCTAAAAAAACACATTCTCTGCAATTTTTAAG ATGAGT ATTATACCTGCACTAACTTTGTGGGACGCAATATCA GAGCAACC CTATCATTTAAAACCTCAAAATACTTATCAGACTTGG GGAACATTCT GACCGTTTAGTAGAACGTTTCCGGCATATAAAATGGGGTGA AGTGGTAATG GTGAATTATCAAACAAATCATATGATCAGAATAATCGCCGTTTAAA TCCATCCTTTTC AACATCGAAATTTAACAGCCCGTGAAGGAGCTAGAATCCAATCTTTTCCAG GAAGAAAGATTTG ATGAAAAATTTCTTTGTCAATATAATCAAATCGGTAATGCTGTACCCCCTCTTCTCGCTA GTGCATGGATCAAATC TTGAACAAAAAGAGAATCATCGTACAAAATACAGAGATACTGAAAGCAGGACTTTCCTTAGAGAAATCA AGATGATTCAATTATTACTCAAAGAGTGGAACTTCTCACTAAATATAAAGATTTTTTAGATCAGCAGCATTATGCAGAAAAATTTGATTCAAGATCCAACC GCTCATAATCCTTACTGAGACGACGGTACTGGTTTAACCAGCCAAATGTTCTTTCTACTACCCACCGTTTGGGCAAAACCTGAAATTCTTGATTAGTACGCCGGATTACCTCAACATGAGCTTGAAT CTA GGCGGCAAGTAATCTTTCTCCAGCATTTGCTTCACTTACAACCACTTTTAACAAAAGTCCCAGACTATCAACCAAAGTTTGCCGCTTTCGTCCTTTTACCTTCTTGCCACCATCAAAACCGTACACATCCCCCTTTTTTCAGTCGTTTTTACCGACTGGCTGTCTGCC GCGATCGCCGTGGGTTGAGTTGACTTCCCCATTTTTTGACGAACTTGATCGCGCAAAGTATGATTCATTTCAGTTGAACTAGGAGGAAAATCCCCTGGAAGCATATCCCACTGAATTCGAATTCGAATTCGAATTCGAATTCGAC AACCTGTTTTCAGATGGTAGTAGATAGCGTTGCATAC TTCTCGCATATCAGTTGTTCGGGGATGCCCACCGCATTTAGCGGGTGGAATCAAAGGAGCTAAAATTGCCCATTCTGAGTCATTAAGGTCTGTAGAATAAGACTTTCGTCTCATTGTTTCCTATGTAAATACACTCTACAAACAGTATCTTATCGCTGCCTTTTTATCTTAGCTCTCCTTTAGATTTACTTTATAAATAGCCTCTTAGAAGAATTTCTTTATTATTTATTTAAAGATTTAGTACAAGAT TTCGGGCAGAACGCTCTTATTGGTAAGTCACACACGTTCAAAGATATTTTCTTCGTACCACCAAAATATTCTGAAATGCTCAAGCGACCTTATGCGCGAATTGAGAGAAAAGATCATGATTTCGTAATTGGTGCAACTGTTCAAGCATCGCTTGAAGCAGCACCTCCTCCAGAACAAAACCATGCTTGAGGGATCTTCACGCGCAGCAGAGGATTTAAAAGCGAGAAATCCTAACAGTTTATACC TTGTGGTTATGGAATGGATAAAACTGACCAATGATGTAAATTTACGAAAATATAAAGTTGATCAAATTTATGTACTACGTCAGCAAAAAAATACTGATAGAGAGTTTAGGTATGAGTCAACTTACATAAAAAAT