Administrivia What: LTI Seminar When: Friday Oct 30, 2009, 2:00pm - 3:00pm Where: 1305 NSH Faculty Host: Noah Smith Title: SeerSuite: Enterprise Search.

Administrivia What: LTI Seminar When: Friday Oct 30, 2009, 2:00pm - 3:00pm Where: 1305 NSH Faculty Host: Noah Smith Title: SeerSuite: Enterprise Search and Cyberinfrastructure for Science and Academia Speaker: Dr. C. Lee Giles, Pennsylvania State University Cyberinfrastructure or e-science has become crucial in many areas of science as data access often defines scientific progress. Open source systems have greatly facilitated design and implementation and supporting cyberinfrastructure. However, there exists no open source integrated system for building an integrated search engine and digital library that focuses on all phases of information and knowledge extraction, such as citation extraction, automated indexing and ranking, chemical formulae search, table indexing, etc. …. Counts for two writeups if you attend!

Two Page Status Report on Project – due Wed 11/2 at 9am
This is a chance to tell me how your project is progressing - what's you accomplished, and what you plan to do in the future. There's no fixed format, but here are some things you might discuss. What dataset will you be using? What does it look like (e.g., how many entities are there, how many tokens, etc)? Looking over the data is always a good first step before you start working with it, what did you do to get acquainted with the data? Do you plan on looking at the same problem, or have you changed your plans? If you plan on writing code, what have you written so far, in what languages, and what do you still need to do? In you plan on using off-the-shelf code, what have you installed, what experiences have you had with it? If you've run a baseline system on the data and gotten some results, what are they? are they consistent with what you expected?

Brin’s 1998 paper

Poon and Domingos – continued! plus Bellare & McCallum
Mostly pilfered from Pedro’s slides

Idea: exploit “pattern/relation duality”:
Start with some seed instances of (author,title) pairs (e.g., “Isaac Asimov”, “The Robots of Dawn”) Look for occurrences of these pairs on the web. Generate patterns that match heuristically chosen subsets of the occurrences - order, URLprefix, prefix, middle, suffix Extract new (author, title) pairs that match the patterns. Go to 2. [some workshop, 1998] Result: 24M web pages + 5 books  199 occurrences  3 patterns  4047 occurrences + 5M pages  3947 occurrences  105 patterns  … 15,257 books RelationPatterns But: mostly learned “science fiction books” at least in early rounds some manual intervention special regex’s for author/title used PatternsRelation

Markov Networks: [Review]
Undirected graphical models Smoking Cancer Asthma Cough Potential functions defined over cliques Smoking Cancer Ф(S,C) False 4.5 True 2.7

First-Order Logic Constants, variables, functions, predicates E.g.: Anna, x, MotherOf(x), Friends(x, y) Literal: Predicate or its negation Clause: Disjunction of literals Grounding: Replace all variables by constants E.g.: Friends (Anna, Bob) World (model, interpretation): Assignment of truth values to all ground predicates

Markov Logic: Intuition
A logical KB is a set of hard constraints on the set of possible worlds Let’s make them soft constraints: When a world violates a formula, It becomes less probable, not impossible Give each formula a weight (Higher weight  Stronger constraint)

Example: Friends & Smokers
Two constants: Anna (A) and Bob (B) Friends(A,B) Friends(A,A) Smokes(A) Smokes(B) Friends(B,B) Cancer(A) Cancer(B) Friends(B,A)

Smokes(Anna) Cancer(Anna) W(edge:s(a)->c(a)) F 1.5 T Friends(A,A) Smokes(A) Smokes(B) Friends(B,B) Cancer(A) Cancer(B) Friends(B,A)

Two constants: Anna (A) and Bob (B) Friends(A,B) Friends(A,A) Smokes(A) Smokes(B) Friends(B,B) Cancer(A) Cancer(B) Friends(B,A)

Friends(A,B) Smokes(A) Smokes(B) W(f(a,b),s(a),s(b)) F 1.1 T Two constants: Anna (A) and Bob (B) Friends(A,B) Friends(A,A) Smokes(A) Smokes(B) Friends(B,B) Cancer(A) Cancer(B) Friends(B,A)

Markov Logic Networks MLN is template for ground Markov nets
Probability of a world x: Weight of formula i No. of true groundings of formula i in x

Weight Learning Parameter tying: Groundings of same clause
Generative learning: Pseudo-likelihood Discriminative learning: Cond. Likelihood [like CRFs – but we need to do inference. They use a Collins-like method that computes expectations near a MAP soln. WC] No. of times clause i is true in data Expected no. times clause i is true according to MLN

MAP/MPE Inference Problem: Find most likely state of world given evidence This is just the weighted MaxSAT problem Use weighted SAT solver (e.g., MaxWalkSAT [Kautz et al., 1997] )

The MaxWalkSAT Algorithm
for i ← 1 to max-tries do solution = random truth assignment for j ← 1 to max-flips do if ∑ weights(sat. clauses) > threshold then return solution c ← random unsatisfied clause with probability p flip a random variable in c else flip variable in c that maximizes ∑ weights(sat. clauses) return failure, best solution found

MAP=WalkSat, Expectations=????
MCMC????: Deterministic dependencies break MCMC Near-deterministic ones make it very slow Solution: Combine MCMC and WalkSAT → MC-SAT algorithm [Poon & Domingos, 2006]

Slice Sampling [Damien et al. 1999]
U P(x) Slice u(k) A representative of aux var methods is slice sampling. we are given a diff dist. Note the low-prob region between the two modes. This is exactly the type that cause trouble for Gibbs, because it is very slow to traverse thru this area. Now we see how slice sampling gets around this problem. It begins with a random start point. Now, suppose we have reached the kth-sample. We first sample the next U uniformly from the vertical bar from 0 to p(x). Then, we gather all x such that p(x) is above U, which comprise of the slice. We then sample the next x uniformly from this slice. And the process goes on. So w. slice sampling, we can jump from one mode to another mode in just one sampling, bypassing the low-prb region. X x(k) x(k+1) 19

The MC-SAT Algorithm X ( 0 )  A random solution satisfying all hard clauses for k  1 to num_samples M  Ø forall Ci satisfied by X ( k – 1 ) With prob. 1 – exp ( – wi ) add Ci to M endfor X ( k )  A uniformly random solution satisfying M Max Walk Sat So! we arrive at the MC-SAT alg, simple and elegant. We init w. a random sol that sat all hard clauses. In each sampling, we first choose a random subset M of the sat clauses, subj to each prb of 1-e^-w_i. Then, the next sample is drawn uniformly from solution to M. “SampleSat”: MaxWalkSat + Simulated Annealing 20

What can you do with MLNs?

Entity Resolution Problem: Given database, find duplicate records
HasToken(token,field,record) SameField(field,record,record) SameRecord(record,record) HasToken(+t,+f,r) ^ HasToken(+t,+f,r’) => SameField(f,r,r’) SameField(f,r,r’) => SameRecord(r,r’) SameRecord(r,r’) ^ SameRecord(r’,r”) => SameRecord(r,r”)

Entity Resolution Can also resolve fields:
HasToken(token,field,record) SameField(field,record,record) SameRecord(record,record) HasToken(+t,+f,r) ^ HasToken(+t,+f,r’) => SameField(f,r,r’) SameField(f,r,r’) <=> SameRecord(r,r’) SameRecord(r,r’) ^ SameRecord(r’,r”) => SameRecord(r,r”) SameField(f,r,r’) ^ SameField(f,r’,r”) => SameField(f,r,r”) P. Singla & P. Domingos, “Entity Resolution with Markov Logic”, in Proc. ICDM-2006.

Hidden Markov Models obs = { Obs1, … , ObsN } state = { St1, … , StM }
time = { 0, … , T } State(state!,time) Obs(obs!,time) State(+s,0) State(+s,t) => State(+s',t+1) State(+s,t) => State(+s,t+1) [variant we’ll use-WC] Obs(+o,t) => State(+s,t)

What did P&D do with MLNs?

Information Extraction (simplified)
Problem: Extract database from text or semi-structured sources Example: Extract database of publications from citation list(s) (the “CiteSeer problem”) Two steps: Segmentation: Use HMM to assign tokens to fields Entity resolution: Use logistic regression and transitivity

Motivation for joint extraction and matching

Token(token, position, citation) InField(position, field, citation) SameField(field, citation, citation) SameCit(citation, citation) Token(+t,i,c) => InField(i,+f,c) InField(i,+f,c) <=> InField(i+1,+f,c) f != f’ => (!InField(i,+f,c) v !InField(i,+f’,c)) Token(+t,i,c) ^ InField(i,+f,c) ^ Token(+t,i’,c’) ^ InField(i’,+f,c’) => SameField(+f,c,c’) SameField(+f,c,c’) <=> SameCit(c,c’) SameField(f,c,c’) ^ SameField(f,c’,c”) => SameField(f,c,c”) SameCit(c,c’) ^ SameCit(c’,c”) => SameCit(c,c”)

Token(token, position, citation) InField(position, field, citation) SameField(field, citation, citation) SameCit(citation, citation) Token(+t,i,c) => InField(i,+f,c) InField(i,+f,c) ^ !Token(“.”,i,c) <=> InField(i+1,+f,c) f != f’ => (!InField(i,+f,c) v !InField(i,+f’,c)) Token(+t,i,c) ^ InField(i,+f,c) ^ Token(+t,i’,c’) ^ InField(i’,+f,c’) => SameField(+f,c,c’) SameField(+f,c,c’) <=> SameCit(c,c’) SameField(f,c,c’) ^ SameField(f,c’,c”) => SameField(f,c,c”) SameCit(c,c’) ^ SameCit(c’,c”) => SameCit(c,c”) More: H. Poon & P. Domingos, “Joint Inference in Information Extraction”, in Proc. AAAI-2007.

Information Extraction (less simplified)
Token(token, position, citation) InField(position, field, citation) SameField(field, citation, citation) SameCit(citation, citation) Token(+t,i,c) => InField(i,+f,c) InField(i,+f,c) ^ !Token(“.”,i,c) <=> InField(i+1,+f,c) f != f’ => (!InField(i,+f,c) v !InField(i,+f’,c)) Token(+t,i,c) => InField(i,+f,c) !Token("aardvark",i,c) v InField(i,”author”,c) … !Token("zymurgy",i,c) v InField(i,"author",c) !Token("zymurgy",i,c) v InField(i,"venue",c)

Token(token, position, citation) InField(position, field, citation) SameField(field, citation, citation) SameCit(citation, citation) Token(+t,i,c) => InField(i,+f,c) InField(i,+f,c) ^ !Token(“.”,i,c) <=> InField(i+1,+f,c) f != f’ => (!InField(i,+f,c) v !InField(i,+f’,c)) Token(+t,i,c) => InField(i,+f,c) InField(i,+f,c) => InField(i+1,+f,c) InField(i,+f,c) ^ !HasPunct(c,i) => InField(i+1,+f,c) InField(i,+f,c) ^ !HasComma(c,i) => InField(i+1,+f,c) => InField(1,”author”,c) => InField(2,”author”,c) => InField(midpointOfC, "title", c) [computed off-line –WC]

Token(token, position, citation) InField(position, field, citation) SameField(field, citation, citation) SameCit(citation, citation) Token(+t,i,c) => InField(i,+f,c) f != f’ => (!InField(i,+f,c) v !InField(i,+f’,c)) Token(+t,i,c) => InField(i,+f,c) InField(i,+f,c) => InField(i+1,+f,c) InField(i,+f,c) ^ !HasPunct(c,i) => InField(i+1,+f,c) InField(i,+f,c) ^ !HasComma(c,i) => InField(i+1,+f,c) => InField(1,”author”,c) => InField(2,”author”,c) Center(c,i) => InField(i, "title", c)

Token(+t,i,c) => InField(i,+f,c) f != f’ => (!InField(i,+f,c) v !InField(i,+f’,c)) Initials tend to appear in author or venue field. Positions before the last non-venue initial are usually not title or venue. Positions after first “venue keyword” are usually not author or title. Token(+t,i,c) => InField(i,+f,c) InField(i,+f,c) => InField(i+1,+f,c) InField(i,+f,c) ^ !HasPunct(c,i) => InField(i+1,+f,c) InField(i,+f,c) ^ !HasComma(c,i) => InField(i+1,+f,c) => InField(1,”author”,c) => InField(2,”author”,c) => InField(midpointOfC, "title", c) Token(w,i,c) ^ IsAlphaChar(w) ^ FollowBy(c,i,”.”) => InField(c,”author”,i) v InField(c,”venue”,i) LastInitial(c,i) ^ LessThan(j,i) => !InField(c,”title”,j) ^ !InField(c,”venue”,j) FirstInitial(c,i) ^ LessThan(i,j) => InField(c,”author”,j) FirstVenueKeyword(c,i) ^ LessThan(i,j) => !InField(c,”author”,j) ^ !InField(c,”title”,j)

SimilarTitle(c,i,j,c’,i’,j’): true if c[i..j] and c’[i’…j’] are both “titlelike” i.e., no punctuation, doesn’t violate rules above c[i..j] and c’[i’…j’] are “similar” i.e. start with same trigram and end with same token SimilarVenue(c,c’): true if c and c’ don’t contain conflicting venue keywords (e.g., journal vs proceedings)

SimilarTitle(c,i,j,c’,i’,j’): … SimilarVenue(c,c’): … JointInferenceCandidate(c,i,c’): trigram starting at i in c also appears in c’ and trigram is a possible title and punct before trigram in c’ but not c

SimilarTitle(c,i,j,c’,i’,j’): … SimilarVenue(c,c’): … JointInferenceCandidate(c,i,c’): [InField(i,+f,c) ^ !HasPunct(c,i) => InField(i+1,+f,c)] InField(i,+f,c) ^ !HasPunct(c,i) ^ (!exists c’:JointInferenceCandidate(c,i,c’)) => InField(i+1,+f,c) Why is this joint? Recall we also have: Token(+t,i,c) ^ InField(i,+f,c) ^ Token(+t,i’,c’) ^ InField(i’,+f,c’) => SameField(+f,c,c’) Jnt-Seg

SimilarTitle(c,i,j,c’,i’,j’): … SimilarVenue(c,c’): … JointInferenceCandidate(c,i,c’): InField(i,+f,c) ^ ~HasPunct(c,i) ^ (!exists c’:JointInferenceCandidate(c,i,c’) ^SameCitation(c,c’) ) => InField(i+1,+f,c) Jnt-Seg-ER

Results: segmentation
Percent error reduction for best joint model

Results: matching Fraction of clusters correctly constructed using transitive closure of pairwise decisions Cora F-S: F1 Cora TFIDF: max F1

William’s summary MLNs are a compact, elegant way of describing a Markov network Standard learning methods work Network may be very very large Inference may be expensive Doesn’t eliminate feature engineering E.g., complicated “feature” predicates Experimental results for joint matching/NER are not that strong overall Cascading segmentation and then matching improves segmentation, maybe not matching But it needs to be carefully restricted (efficiency?)

Bellare & McCallum

Outline Goal: Methods:
Given (DBLP record, citation-text) that do match, learn to segment citations. Methods: Learn a CRF to align the record and text (sort of like learning an edit distance) Generate alignments, anduse them as training data for a linear-chain CRF that does segmentation (aka extraction) This CRF does not need records to work

Alignment…. Notation: Alignment feature: depends on a and x’s
Extraction feature: depends on a, y1 and x2

Learning for alignment…
Generalized expectation criterion: rather than minimize Edata[f]-Emodel[f] … plus a penalty term for the weights…minimize a weighted squared difference between Emodel[f] and p, where p is the user’s prior on the value of the feature. “We simulate user-specified expectation criteria [i.e. p’s] with statistics on manually labeled citation texts.” … top 10 features by MI, p in 11 bins, w=10 Sum of marginal probabilities divided by size of variable set

Results On 260 records, 522 record-text pairs

Results CRF trained with extraction criteria derived from labeled data
Trained on records partially aligned with high-precision rules Trained on DBLP records …and also use partial match to DB records at test time “Gold standard”- hand-labeled extraction data

Alignments and expectations
Simplified version of the idea: from Learning String Edit Distance, Ristad and Yianilos, PAMI 1998

HMM Example Sample output: xT=heehahaha, sT=122121212 Pr(1->2)
Pr(2->x) Pr(1->x) 1 2 a 0.3 e 0.5 o 0.2 d 0.3 h 0.5 b 0.2 Pr(2->1)

HMM Inference l=1 l=2 l=K t=1 t=2 ... t=T
Key point: Pr(si=l) depends only on Pr(l’->l) and si-1 so you can propogate probabilities forward... t=1 t=2 ... t=T l=1 l=2 l=K x1 x2 x3 xT

Pair HMM Notation Andrew used “null”

Pair HMM Example 1 e Pr(e) <a,a> 0.10 <e,e> <h,h>
0.05 <h,t> <-,h> 0.01 ... .. 1

Pair HMM Example e Pr(e) <a,a> 0.10 <e,e> <h,h> <e,-> 0.05 <h,t> <-,h> 0.01 ... .. 1 Sample run: zT = <h,t>,<e,e><e,e><h,h>,<e,->,<e,e> Strings x,y produced by zT: x=heehee, y=teehe Notice that x,y is also produced by z4 + <e,e>,<e,-> and many other edit strings

Distances based on pair HMMs

Pair HMM Inference Dynamic programming is possible: fill out matrix left-to-right, top-down

Pair HMM Inference t=1 t=2 ... t=T v=1 v=2 v=K

Pair HMM Inference v=1 v=2 v=K t=1 t=2 ... t=T
One difference: after i emissions of pair HMM, we do not know the column position t=1 t=2 ... t=T v=1 v=2 v=K i=1 i=1 i=2 i=3 i=1 i=3

Multiple states IY SUB IX e Pr(e) e Pr(e) <a,a> 0.10 <e,e>
<h,h> <a,-> 0.05 <h,t> 0.01 <-,h> ... .. SUB e Pr(e) <a,-> 0.11 <e,-> 0.21 <h,-> … IX

An extension: multiple states
conceptually, add a “state” dimension to the model ... v=K v=2 v=1 t=T t=2 t=1 l=2 IX ... v=K v=2 v=1 t=T t=2 t=1 l=1 SUB EM methods generalize easily to this setting

Administrivia What: LTI Seminar When: Friday Oct 30, 2009, 2:00pm - 3:00pm Where: 1305 NSH Faculty Host: Noah Smith Title: SeerSuite: Enterprise Search.

Similar presentations

Presentation on theme: "Administrivia What: LTI Seminar When: Friday Oct 30, 2009, 2:00pm - 3:00pm Where: 1305 NSH Faculty Host: Noah Smith Title: SeerSuite: Enterprise Search."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Administrivia What: LTI Seminar When: Friday Oct 30, 2009, 2:00pm - 3:00pm Where: 1305 NSH Faculty Host: Noah Smith Title: SeerSuite: Enterprise Search.

Similar presentations

Presentation on theme: "Administrivia What: LTI Seminar When: Friday Oct 30, 2009, 2:00pm - 3:00pm Where: 1305 NSH Faculty Host: Noah Smith Title: SeerSuite: Enterprise Search."— Presentation transcript:

Similar presentations

About project

Feedback