Presentation is loading. Please wait.

Presentation is loading. Please wait.

[Bejerano Aut07/08] 1 MW 11:00-12:15 in Redwood G19 Profs: Serafim Batzoglou, Gill Bejerano TA: Cory McLean.

Similar presentations


Presentation on theme: "[Bejerano Aut07/08] 1 MW 11:00-12:15 in Redwood G19 Profs: Serafim Batzoglou, Gill Bejerano TA: Cory McLean."— Presentation transcript:

1 http://cs273a.stanford.edu [Bejerano Aut07/08] 1 MW 11:00-12:15 in Redwood G19 Profs: Serafim Batzoglou, Gill Bejerano TA: Cory McLean

2 http://cs273a.stanford.edu [Bejerano Aut07/08] 2 Lecture 5 UCSC Source Tree Genome Assemblies Genomic Variation Repeats

3 http://cs273a.stanford.edu [Bejerano Aut07/08] 3 UCSC Resources: Data, Tools & Code Underlying Database (MySQL) visualize Underlying Database (MySQL) visualizesearch & download

4 4 History of the Code Started in 1999 in C after Java proved hopelessly unportable across browsers. Early modules include a Worm genome browser (Intronerator), and GigAssembler which produced working draft of human genome. In 2001 a few other grad students started working on the code. In 2002 hired staff to help with Genome Browser Currently project employs ~20 full time people. [Jim Kent, 2004]

5 5 Lagging Edge Software C language - compilers still available! CGI Scripts - portable if not pretty. SQL database - at least MySQL is free.

6 6 Problems with C Missing booleans and strings. No real objects. Must free things

7 7 Advantages of C Very fast at runtime. Very portable. Language is simple. No tangled inheritance hierarchy. Excellent free tools are available. Libraries and conventions can compensate for language weaknesses.

8 8 Coping with Missing Data Types in C #define boolean int Fixing lack of real string type much harder –lineFile/common modules and autoSql code generator make parsing files relatively painless –dyString module not a horrible string ‘class’

9 9 Object Oriented Programming in C Build objects around structures. Make families of functions with names that start with the structure name, and that take the structure as the first argument. Implement polymorphism/virtual functions with function pointers in structure. Inheritance is still difficult. Perhaps this is not such a bad thing.

10 10 struct dnaSeq /* A dna sequence in one-letter-per-base format. */ { struct dnaSeq *next; /* Next in list. */ char *name; /* Sequence name. */ char *dna; /* a’s c’s g’s and t’s. Null terminated */ int size; /* Number of bases. */ }; struct dnaSeq *dnaSeqFromString(char *string); /* Convert string containing sequence and possibly * white space and numbers to a dnaSeq. */ void dnaSeqFree(struct dnaSeq **pSeq); /* Free dnaSeq and set pointer to NULL. */ void dnaSeqFreeList(struct dnaSeq **pList); /* Free list of dnaSeq’s. */

11 http://cs273a.stanford.edu [Bejerano Aut07/08] 11 UCSC Code Tree Summary To conclude: Source tree is installed for you. All programs under utils/ should work. Code under hg/ requires the MySQL DB (or at least it thinks it does). Very useful resource: http://genomewiki.ucsc.edu/ If in trouble, use the “contact us” link to search Q&A. Then come ask us/shoot UCSC helpdesk an e-mail.

12 http://cs273a.stanford.edu [Bejerano Aut07/08] 12 Lights, Action, Rolling 2001 HGCCelera

13 http://cs273a.stanford.edu [Bejerano Aut07/08] 13 The Sequencing of the Human Genome Lander: So the genes from which most of the work was done come from Buffalo, New York. Krulwich: From Buffalo, New York? Lander: Yes. It's mostly a guy from Buffalo and a woman from Buffalo. But that's because the laboratory that was making--... Lander: The laboratory that prepared the large DNA libraries that were used was a laboratory in Buffalo. And so they put an ad in the Buffalo newspapers, and they got random volunteers from Buffalo, and they got about 20 of them. They then erased all the labels and chose at random this sample and that sample and that sample. So nobody knows who they are. We don't have any links back to who they are, and that's deliberate. Eric Lander, NOVA interview, 2001

14 http://cs273a.stanford.edu [Bejerano Aut07/08] 14 Meet Your Genome [Human Molecular Genetics, 3rd Edition]

15 http://cs273a.stanford.edu [Bejerano Aut07/08] 15 Heterochromatin as an example

16 http://cs273a.stanford.edu [Bejerano Aut07/08] 16 The Human Genome is “Finished” [HGC, 2004]

17 http://cs273a.stanford.edu [Bejerano Aut07/08] 17 “Unfinished Business in a Finished Genome” 341 remaining gaps: 33 Heterochromatic, 35 Euchromatic Boundaries, 273 Euchromatic Interior regions. Centromeric, Telomeric gaps Arcocentric, rDNA clusters: chr. 13,14,15,21,22

18 http://cs273a.stanford.edu [Bejerano Aut07/08] 18 Assembly Gap Types

19 http://cs273a.stanford.edu [Bejerano Aut07/08] 19 Mind the Gap

20 http://cs273a.stanford.edu [Bejerano Aut07/08] 20 Fluorescent in situ hybridization (FISH) [Eichler et al, 2004]

21 http://cs273a.stanford.edu [Bejerano Aut07/08] 21 Euchromatic Interior Gap, Unplaced Sequence 1 12 1 1 ? 2

22 http://cs273a.stanford.edu [Bejerano Aut07/08] 22 The _random Chromosomes

23 http://cs273a.stanford.edu [Bejerano Aut07/08] 23 hg18.chr1_random... Some genomes are in much worse shape. Some have _random chroms that are (sadly) called some other name (but look the same). _randoms are a great place to meet contaminants: pieces of local technician DNA sequence from the vector used in the protocol the odd tube from another genome project being sequenced at the same genome center.

24 http://cs273a.stanford.edu [Bejerano Aut07/08] 24 Mistaking (Haplotype) Variation for Segmental Dups

25 http://cs273a.stanford.edu [Bejerano Aut07/08] 25 Wave of the Future [Shendure et al, 2004]

26 http://cs273a.stanford.edu [Bejerano Aut07/08] 26 SNPs A Single Nucleotide Polymorphism is a source of variance in a genome. A SNP ("snip") is a single base mutation in DNA. SNPs are the most simple form and most common source of genetic polymorphism in the human genome (90% of all human DNA polymorphisms). not any more... [Hegele, 2004]

27 http://cs273a.stanford.edu [Bejerano Aut07/08] 27 Larger Scale DNA Mutation We knew this was happening to DNA, at all length scales. We did not know how frequent, nor how prevalent in the human population these changes are... you are here

28 http://cs273a.stanford.edu [Bejerano Aut07/08] 28 Copy Number Variation (CNVs) so... how representative is the reference genome? [Redon et al, 2006]

29 http://cs273a.stanford.edu [Bejerano Aut07/08] 29 J.C. Venter Goes to Buffalo serious representation problem [Khaja et al, 2006]

30 http://cs273a.stanford.edu [Bejerano Aut07/08] 30 Large Scale Variation & Disease [Lupski, 2007]

31 http://cs273a.stanford.edu [Bejerano Aut07/08] 31 Don’t Panic G E N O M E

32 http://cs273a.stanford.edu [Bejerano Aut07/08] 32 Meanwhile, back in Your Genome

33 http://cs273a.stanford.edu [Bejerano Aut07/08] 33 [Adapted from Lunter]

34 http://cs273a.stanford.edu [Bejerano Aut07/08] 34

35 http://cs273a.stanford.edu [Bejerano Aut07/08] 35

36 http://cs273a.stanford.edu [Bejerano Aut07/08] 36

37 http://cs273a.stanford.edu [Bejerano Aut07/08] 37

38 http://cs273a.stanford.edu [Bejerano Aut07/08] 38

39 http://cs273a.stanford.edu [Bejerano Aut07/08] 39

40 http://cs273a.stanford.edu [Bejerano Aut07/08] 40

41 http://cs273a.stanford.edu [Bejerano Aut07/08] 41

42 http://cs273a.stanford.edu [Bejerano Aut07/08] 42 Inferring Phylogeny Using Repeats [Nishihara et al, 2006]

43 http://cs273a.stanford.edu [Bejerano Aut07/08] 43 Simple Repeats Every possible motif of mono-, di, tri- and tetranucleotide repeats is vastly overrepresented in the human genome. These are called microsatellites, Longer repeating units are called minisatellites, The real long ones are called satellites. Highly polymorphic in the human population. Highly heterozygous in a single individual. As a result microsatellites are used in paternity testing, forensics, and the inference of demographic processes. There is no clear definition of how many repetitions make a simple repeat, nor how imperfect the different copies can be. Highly variable between genomes: e.g., using the same search criteria the mouse & rat genomes have 2-3 times more microsatellites than the human genome. They’re also longer in mouse & rat.

44 http://cs273a.stanford.edu [Bejerano Aut07/08] 44

45 http://cs273a.stanford.edu [Bejerano Aut07/08] 45

46 http://cs273a.stanford.edu [Bejerano Aut07/08] 46

47 http://cs273a.stanford.edu [Bejerano Aut07/08] 47

48 http://cs273a.stanford.edu [Bejerano Aut07/08] 48


Download ppt "[Bejerano Aut07/08] 1 MW 11:00-12:15 in Redwood G19 Profs: Serafim Batzoglou, Gill Bejerano TA: Cory McLean."

Similar presentations


Ads by Google