Presentation is loading. Please wait.

Presentation is loading. Please wait.

Conditional Random Fields 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2.

Similar presentations


Presentation on theme: "Conditional Random Fields 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2."— Presentation transcript:

1 Conditional Random Fields 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2

2 CS262 Lecture 9, Win07, Batzoglou “Features” that depend on many pos. in x V l (i+1) = max k [V k (i) + g(k, l, x, i+1)] Where g(k, l, x, i) =  j=1…n f j (k, l, x, i)  w j x1x1 x2x2 x3x3 x6x6 x4x4 x5x5 x7x7 x 10 x8x8 x9x9 ii  i-1

3 CS262 Lecture 9, Win07, Batzoglou “Features” that depend on many pos. in x Score of a parse depends on all of x at each position Can still do Viterbi because state  i only “looks” at prev. state  i-1 and the constant sequence x 11 x1x1 22 x2x2 33 x3x3 44 x4x4 55 x5x5 66 x6x6 … 11 x1x1 22 x2x2 33 x3x3 44 x4x4 55 x5x5 66 x6x6 … HMM CRF

4 CS262 Lecture 9, Win07, Batzoglou How many parameters are there, in general? Arbitrarily many parameters!  For example, let f j (k, l, x, i) depend on x i-5, x i-4, …, x i+5 Then, we would have up to K  |  | 11 parameters!  Advantage: powerful, expressive model Example: “if there are more than 50 6’s in the last 100 rolls, but in the surrounding 18 rolls there are at most 3 6’s, this is evidence we are in Fair state” Interpretation: casino player is afraid to be caught, so switches to Fair when he sees too many 6’s Example: “if there are any CG-rich regions in the vicinity (window of 2000 pos) then favor predicting lots of genes in this region”  Question: how do we train these parameters?

5 CS262 Lecture 9, Win07, Batzoglou Conditional Training Hidden Markov Model training:  Given training sequence x, “true” parse   Maximize P(x,  ) Disadvantage:  P(x,  ) = P(  | x) P(x) Quantity we care about so as to get a good parse Quantity we don’t care so much about because x is always given

6 CS262 Lecture 9, Win07, Batzoglou Conditional Training P(x,  ) = P(  | x) P(x) P(  | x) = P(x,  ) / P(x) Recall F(j, x,  ) = # times feature f j occurs in (x,  ) =  i=1…N f j (  i-1,  i, x, i) ; count f j in x,  In HMMs, let’s denote by w j the weight of j th feature: w j = log(a kl ) or log(e k (b)) Then, HMM: P(x,  ) = exp [  j=1…n w j  F(j, x,  ) ] CRF:Score(x,  ) = exp [  j=1…n w j  F(j, x,  ) ]

7 CS262 Lecture 9, Win07, Batzoglou Conditional Training In HMMs, P(  | x) = P(x,  ) / P(x) P(x,  ) = exp [  j=1…n w j  F(j, x,  ) ] P(x) =   exp [  j=1…n w j  F(j, x,  ) ] =: Z Then, in CRF we can do the same to normalize Score(x,  ) into a prob P CRF (  | x) = exp [  j=1…n w j  F(j, x,  ) ] / Z QUESTION: Why is this a probability???

8 CS262 Lecture 9, Win07, Batzoglou Conditional Training 1.We are given a training set of sequences x and “true” parses  2.Calculate Z by a sum-of-paths algorithm similar to HMM We can then easily calculate P(  | x) 3.Calculate partial derivative of P(  | x) w.r.t. each parameter w j (not covered—akin to forward/backward) d/dw i P(  | x) = F(i, x,  ) – E  ’ (i, x,  ’) Update each parameter with gradient descent! 4.Continue until convergence to optimal set of weights P(  | x) = exp [  j=1…n w j  F(j, x,  ) ] / Zis convex!!!

9 CS262 Lecture 9, Win07, Batzoglou Conditional Random Fields—Summary 1.Ability to incorporate complicated non-local feature sets Do away with some independence assumptions of HMMs Parsing is still equally efficient 2.Conditional training Train parameters that are best for parsing, not modeling Need labeled examples—sequences x and “true” parses  (Can train on unlabeled sequences, however it is unreasonable to train too many parameters this way) Training is significantly slower—many iterations of forward/backward

10 DNA Sequencing

11 CS262 Lecture 9, Win07, Batzoglou DNA sequencing How we obtain the sequence of nucleotides of a species …ACGTGACTGAGGACCGTG CGACTGAGACTGACTGGGT CTAGCTAGACTACGTTTTA TATATATATACGTCGTCGT ACTGATGACTAGATTACAG ACTGATTTAGATACCTGAC TGATTTTAAAAAAATATT…

12 CS262 Lecture 9, Win07, Batzoglou Which representative of the species? Which human? Answer one: Answer two: it doesn’t matter Polymorphism rate: number of letter changes between two different members of a species Humans: ~1/1,000 Other organisms have much higher polymorphism rates  Population size!

13 CS262 Lecture 9, Win07, Batzoglou

14 Human population migrations Out of Africa, Replacement  Single mother of all humans (Eve) ~150,000yr  Single father of all humans (Adam) ~70,000yr  Humans out of Africa ~40000 years ago replaced others (e.g., Neandertals)  Evidence: mtDNA Multiregional Evolution  Fossil records show a continuous change of morphological features  Proponents of the theory doubt mtDNA and other genetic evidence

15 CS262 Lecture 9, Win07, Batzoglou Why humans are so similar A small population that interbred reduced the genetic variation Out of Africa ~ 40,000 years ago Out of Africa H = 4Nu/(1 + 4Nu)

16 CS262 Lecture 9, Win07, Batzoglou Migration of human variation http://info.med.yale.edu/genetics/kkidd/point.html

17 CS262 Lecture 9, Win07, Batzoglou Migration of human variation http://info.med.yale.edu/genetics/kkidd/point.html

18 CS262 Lecture 9, Win07, Batzoglou Migration of human variation http://info.med.yale.edu/genetics/kkidd/point.html

19 CS262 Lecture 9, Win07, Batzoglou Human variation in Y chromosome

20 CS262 Lecture 9, Win07, Batzoglou

21

22 DNA Sequencing – Overview Gel electrophoresis  Predominant, old technology by F. Sanger Whole genome strategies  Physical mapping  Walking  Shotgun sequencing Computational fragment assembly The future—new sequencing technologies  Pyrosequencing, single molecule methods, …  Assembly techniques Future variants of sequencing  Resequencing of humans Resequencing of humans  Microbial and environmental sequencing Microbial and environmental sequencing  Cancer genome sequencing Cancer genome sequencing 1975 2015

23 CS262 Lecture 9, Win07, Batzoglou DNA Sequencing Goal: Find the complete sequence of A, C, G, T’s in DNA Challenge: There is no machine that takes long DNA as an input, and gives the complete sequence as output Can only sequence ~500 letters at a time

24 CS262 Lecture 9, Win07, Batzoglou DNA Sequencing – vectors + = DNA Shake DNA fragments Vector Circular genome (bacterium, plasmid) Known location (restriction site)

25 CS262 Lecture 9, Win07, Batzoglou Different types of vectors VECTORSize of insert Plasmid 2,000-10,000 Can control the size Cosmid40,000 BAC (Bacterial Artificial Chromosome) 70,000-300,000 YAC (Yeast Artificial Chromosome) > 300,000 Not used much recently

26 CS262 Lecture 9, Win07, Batzoglou DNA Sequencing – gel electrophoresis 1.Start at primer(restriction site) 2.Grow DNA chain 3.Include dideoxynucleoside (modified a, c, g, t) 4.Stops reaction at all possible points 5.Separate products with length, using gel electrophoresis

27 CS262 Lecture 9, Win07, Batzoglou Electrophoresis diagrams

28 CS262 Lecture 9, Win07, Batzoglou Challenging to read answer

29 CS262 Lecture 9, Win07, Batzoglou Challenging to read answer

30 CS262 Lecture 9, Win07, Batzoglou Challenging to read answer

31 CS262 Lecture 9, Win07, Batzoglou Reading an electropherogram 1.Filtering 2.Smoothening 3.Correction for length compressions 4.A method for calling the letters – PHRED PHRED – PHil’s Read EDitor (by Phil Green) Several better methods exist, but labs are reluctant to change

32 CS262 Lecture 9, Win07, Batzoglou Output of PHRED: a read A read: 500-1000 nucleotides A C G A A T C A G …A 16 18 21 23 25 15 28 30 32 …21 Quality scores: -10  log 10 Prob(Error) Reads can be obtained from leftmost, rightmost ends of the insert Double-barreled sequencing: (1990) Both leftmost & rightmost ends are sequenced, reads are paired

33 CS262 Lecture 9, Win07, Batzoglou Method to sequence longer regions cut many times at random (Shotgun) genomic segment Get one or two reads from each segment ~500 bp

34 CS262 Lecture 9, Win07, Batzoglou Reconstructing the Sequence (Fragment Assembly) Cover region with high redundancy Overlap & extend reads to reconstruct the original genomic region reads


Download ppt "Conditional Random Fields 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2."

Similar presentations


Ads by Google