Presentation is loading. Please wait.

Presentation is loading. Please wait.

Administrivia What: LTI Seminar When: Friday Oct 30, 2009, 2:00pm - 3:00pm Where: 1305 NSH Faculty Host: Noah Smith Title: SeerSuite: Enterprise Search.

Similar presentations


Presentation on theme: "Administrivia What: LTI Seminar When: Friday Oct 30, 2009, 2:00pm - 3:00pm Where: 1305 NSH Faculty Host: Noah Smith Title: SeerSuite: Enterprise Search."— Presentation transcript:

1 Administrivia What: LTI Seminar When: Friday Oct 30, 2009, 2:00pm - 3:00pm Where: 1305 NSH Faculty Host: Noah Smith Title: SeerSuite: Enterprise Search and Cyberinfrastructure for Science and Academia Speaker: Dr. C. Lee Giles, Pennsylvania State University           Cyberinfrastructure or e-science has become crucial in many areas of science as data access often defines scientific progress. Open source systems have greatly facilitated design and implementation and supporting cyberinfrastructure. However, there exists no open source integrated system for building an integrated search engine and digital library that focuses on all phases of information and knowledge extraction, such as citation extraction, automated indexing and ranking, chemical formulae search, table indexing, etc. …. Counts for two writeups if you attend!

2 Two Page Status Report on Project – due Wed 11/2 at 9am
This is a chance to tell me how your project is progressing - what's you accomplished, and what you plan to do in the future. There's no fixed format, but here are some things you might discuss. What dataset will you be using? What does it look like (e.g., how many entities are there, how many tokens, etc)? Looking over the data is always a good first step before you start working with it, what did you do to get acquainted with the data? Do you plan on looking at the same problem, or have you changed your plans? If you plan on writing code, what have you written so far, in what languages, and what do you still need to do? In you plan on using off-the-shelf code, what have you installed, what experiences have you had with it? If you've run a baseline system on the data and gotten some results, what are they? are they consistent with what you expected?

3 Brin’s 1998 paper

4 Poon and Domingos – continued! plus Bellare & McCallum
Mostly pilfered from Pedro’s slides

5 Idea: exploit “pattern/relation duality”:
Start with some seed instances of (author,title) pairs (e.g., “Isaac Asimov”, “The Robots of Dawn”) Look for occurrences of these pairs on the web. Generate patterns that match heuristically chosen subsets of the occurrences - order, URLprefix, prefix, middle, suffix Extract new (author, title) pairs that match the patterns. Go to 2. [some workshop, 1998] Result: 24M web pages + 5 books  199 occurrences  3 patterns  4047 occurrences + 5M pages  3947 occurrences  105 patterns  … 15,257 books RelationPatterns But: mostly learned “science fiction books” at least in early rounds some manual intervention special regex’s for author/title used PatternsRelation

6 Markov Networks: [Review]
Undirected graphical models Smoking Cancer Asthma Cough Potential functions defined over cliques Smoking Cancer Ф(S,C) False 4.5 True 2.7

7 First-Order Logic Constants, variables, functions, predicates E.g.: Anna, x, MotherOf(x), Friends(x, y) Literal: Predicate or its negation Clause: Disjunction of literals Grounding: Replace all variables by constants E.g.: Friends (Anna, Bob) World (model, interpretation): Assignment of truth values to all ground predicates

8 Markov Logic: Intuition
A logical KB is a set of hard constraints on the set of possible worlds Let’s make them soft constraints: When a world violates a formula, It becomes less probable, not impossible Give each formula a weight (Higher weight  Stronger constraint)

9 Example: Friends & Smokers
Two constants: Anna (A) and Bob (B) Friends(A,B) Friends(A,A) Smokes(A) Smokes(B) Friends(B,B) Cancer(A) Cancer(B) Friends(B,A)

10 Example: Friends & Smokers
Two constants: Anna (A) and Bob (B) Friends(A,B) Friends(A,A) Smokes(A) Smokes(B) Friends(B,B) Cancer(A) Cancer(B) Friends(B,A)

11 Example: Friends & Smokers
Smokes(Anna) Cancer(Anna) W(edge:s(a)->c(a)) F 1.5 T Friends(A,A) Smokes(A) Smokes(B) Friends(B,B) Cancer(A) Cancer(B) Friends(B,A)

12 Example: Friends & Smokers
Two constants: Anna (A) and Bob (B) Friends(A,B) Friends(A,A) Smokes(A) Smokes(B) Friends(B,B) Cancer(A) Cancer(B) Friends(B,A)

13 Example: Friends & Smokers
Friends(A,B) Smokes(A) Smokes(B) W(f(a,b),s(a),s(b)) F 1.1 T Two constants: Anna (A) and Bob (B) Friends(A,B) Friends(A,A) Smokes(A) Smokes(B) Friends(B,B) Cancer(A) Cancer(B) Friends(B,A)

14 Markov Logic Networks MLN is template for ground Markov nets
Probability of a world x: Weight of formula i No. of true groundings of formula i in x

15 Weight Learning Parameter tying: Groundings of same clause
Generative learning: Pseudo-likelihood Discriminative learning: Cond. Likelihood [like CRFs – but we need to do inference. They use a Collins-like method that computes expectations near a MAP soln. WC] No. of times clause i is true in data Expected no. times clause i is true according to MLN

16 MAP/MPE Inference Problem: Find most likely state of world given evidence This is just the weighted MaxSAT problem Use weighted SAT solver (e.g., MaxWalkSAT [Kautz et al., 1997] )

17 The MaxWalkSAT Algorithm
for i ← 1 to max-tries do solution = random truth assignment for j ← 1 to max-flips do if ∑ weights(sat. clauses) > threshold then return solution c ← random unsatisfied clause with probability p flip a random variable in c else flip variable in c that maximizes ∑ weights(sat. clauses) return failure, best solution found

18 MAP=WalkSat, Expectations=????
MCMC????: Deterministic dependencies break MCMC Near-deterministic ones make it very slow Solution: Combine MCMC and WalkSAT → MC-SAT algorithm [Poon & Domingos, 2006]

19 Slice Sampling [Damien et al. 1999]
U P(x) Slice u(k) A representative of aux var methods is slice sampling. we are given a diff dist. Note the low-prob region between the two modes. This is exactly the type that cause trouble for Gibbs, because it is very slow to traverse thru this area. Now we see how slice sampling gets around this problem. It begins with a random start point. Now, suppose we have reached the kth-sample. We first sample the next U uniformly from the vertical bar from 0 to p(x). Then, we gather all x such that p(x) is above U, which comprise of the slice. We then sample the next x uniformly from this slice. And the process goes on. So w. slice sampling, we can jump from one mode to another mode in just one sampling, bypassing the low-prb region. X x(k) x(k+1) 19

20 The MC-SAT Algorithm X ( 0 )  A random solution satisfying all hard clauses for k  1 to num_samples M  Ø forall Ci satisfied by X ( k – 1 ) With prob. 1 – exp ( – wi ) add Ci to M endfor X ( k )  A uniformly random solution satisfying M Max Walk Sat So! we arrive at the MC-SAT alg, simple and elegant. We init w. a random sol that sat all hard clauses. In each sampling, we first choose a random subset M of the sat clauses, subj to each prb of 1-e^-w_i. Then, the next sample is drawn uniformly from solution to M. “SampleSat”: MaxWalkSat + Simulated Annealing 20

21 What can you do with MLNs?

22 Entity Resolution Problem: Given database, find duplicate records
HasToken(token,field,record) SameField(field,record,record) SameRecord(record,record) HasToken(+t,+f,r) ^ HasToken(+t,+f,r’) => SameField(f,r,r’) SameField(f,r,r’) => SameRecord(r,r’) SameRecord(r,r’) ^ SameRecord(r’,r”) => SameRecord(r,r”)

23 Entity Resolution Can also resolve fields:
HasToken(token,field,record) SameField(field,record,record) SameRecord(record,record) HasToken(+t,+f,r) ^ HasToken(+t,+f,r’) => SameField(f,r,r’) SameField(f,r,r’) <=> SameRecord(r,r’) SameRecord(r,r’) ^ SameRecord(r’,r”) => SameRecord(r,r”) SameField(f,r,r’) ^ SameField(f,r’,r”) => SameField(f,r,r”) P. Singla & P. Domingos, “Entity Resolution with Markov Logic”, in Proc. ICDM-2006.

24 Hidden Markov Models obs = { Obs1, … , ObsN } state = { St1, … , StM }
time = { 0, … , T } State(state!,time) Obs(obs!,time) State(+s,0) State(+s,t) => State(+s',t+1) State(+s,t) => State(+s,t+1) [variant we’ll use-WC] Obs(+o,t) => State(+s,t)

25 What did P&D do with MLNs?

26 Information Extraction (simplified)
Problem: Extract database from text or semi-structured sources Example: Extract database of publications from citation list(s) (the “CiteSeer problem”) Two steps: Segmentation: Use HMM to assign tokens to fields Entity resolution: Use logistic regression and transitivity

27 Motivation for joint extraction and matching

28 Information Extraction (simplified)
Token(token, position, citation) InField(position, field, citation) SameField(field, citation, citation) SameCit(citation, citation) Token(+t,i,c) => InField(i,+f,c) InField(i,+f,c) <=> InField(i+1,+f,c) f != f’ => (!InField(i,+f,c) v !InField(i,+f’,c)) Token(+t,i,c) ^ InField(i,+f,c) ^ Token(+t,i’,c’) ^ InField(i’,+f,c’) => SameField(+f,c,c’) SameField(+f,c,c’) <=> SameCit(c,c’) SameField(f,c,c’) ^ SameField(f,c’,c”) => SameField(f,c,c”) SameCit(c,c’) ^ SameCit(c’,c”) => SameCit(c,c”)

29 Information Extraction (simplified)
Token(token, position, citation) InField(position, field, citation) SameField(field, citation, citation) SameCit(citation, citation) Token(+t,i,c) => InField(i,+f,c) InField(i,+f,c) ^ !Token(“.”,i,c) <=> InField(i+1,+f,c) f != f’ => (!InField(i,+f,c) v !InField(i,+f’,c)) Token(+t,i,c) ^ InField(i,+f,c) ^ Token(+t,i’,c’) ^ InField(i’,+f,c’) => SameField(+f,c,c’) SameField(+f,c,c’) <=> SameCit(c,c’) SameField(f,c,c’) ^ SameField(f,c’,c”) => SameField(f,c,c”) SameCit(c,c’) ^ SameCit(c’,c”) => SameCit(c,c”) More: H. Poon & P. Domingos, “Joint Inference in Information Extraction”, in Proc. AAAI-2007.

30 Information Extraction (less simplified)
Token(token, position, citation) InField(position, field, citation) SameField(field, citation, citation) SameCit(citation, citation) Token(+t,i,c) => InField(i,+f,c) InField(i,+f,c) ^ !Token(“.”,i,c) <=> InField(i+1,+f,c) f != f’ => (!InField(i,+f,c) v !InField(i,+f’,c)) Token(+t,i,c) => InField(i,+f,c) !Token("aardvark",i,c) v InField(i,”author”,c) !Token("zymurgy",i,c) v InField(i,"author",c) !Token("zymurgy",i,c) v InField(i,"venue",c)

31 Information Extraction (less simplified)
Token(token, position, citation) InField(position, field, citation) SameField(field, citation, citation) SameCit(citation, citation) Token(+t,i,c) => InField(i,+f,c) InField(i,+f,c) ^ !Token(“.”,i,c) <=> InField(i+1,+f,c) f != f’ => (!InField(i,+f,c) v !InField(i,+f’,c)) Token(+t,i,c) => InField(i,+f,c) InField(i,+f,c) => InField(i+1,+f,c) InField(i,+f,c) ^ !HasPunct(c,i) => InField(i+1,+f,c) InField(i,+f,c) ^ !HasComma(c,i) => InField(i+1,+f,c) => InField(1,”author”,c) => InField(2,”author”,c) => InField(midpointOfC, "title", c) [computed off-line –WC]

32 Information Extraction (less simplified)
Token(token, position, citation) InField(position, field, citation) SameField(field, citation, citation) SameCit(citation, citation) Token(+t,i,c) => InField(i,+f,c) f != f’ => (!InField(i,+f,c) v !InField(i,+f’,c)) Token(+t,i,c) => InField(i,+f,c) InField(i,+f,c) => InField(i+1,+f,c) InField(i,+f,c) ^ !HasPunct(c,i) => InField(i+1,+f,c) InField(i,+f,c) ^ !HasComma(c,i) => InField(i+1,+f,c) => InField(1,”author”,c) => InField(2,”author”,c) Center(c,i) => InField(i, "title", c)

33 Information Extraction (less simplified)
Token(+t,i,c) => InField(i,+f,c) f != f’ => (!InField(i,+f,c) v !InField(i,+f’,c)) Initials tend to appear in author or venue field. Positions before the last non-venue initial are usually not title or venue. Positions after first “venue keyword” are usually not author or title. Token(+t,i,c) => InField(i,+f,c) InField(i,+f,c) => InField(i+1,+f,c) InField(i,+f,c) ^ !HasPunct(c,i) => InField(i+1,+f,c) InField(i,+f,c) ^ !HasComma(c,i) => InField(i+1,+f,c) => InField(1,”author”,c) => InField(2,”author”,c) => InField(midpointOfC, "title", c) Token(w,i,c) ^ IsAlphaChar(w) ^ FollowBy(c,i,”.”) => InField(c,”author”,i) v InField(c,”venue”,i) LastInitial(c,i) ^ LessThan(j,i) => !InField(c,”title”,j) ^ !InField(c,”venue”,j) FirstInitial(c,i) ^ LessThan(i,j) => InField(c,”author”,j) FirstVenueKeyword(c,i) ^ LessThan(i,j) => !InField(c,”author”,j) ^ !InField(c,”title”,j)

34 Information Extraction (less simplified)
SimilarTitle(c,i,j,c’,i’,j’): true if c[i..j] and c’[i’…j’] are both “titlelike” i.e., no punctuation, doesn’t violate rules above c[i..j] and c’[i’…j’] are “similar” i.e. start with same trigram and end with same token SimilarVenue(c,c’): true if c and c’ don’t contain conflicting venue keywords (e.g., journal vs proceedings)

35 Information Extraction (less simplified)
SimilarTitle(c,i,j,c’,i’,j’): … SimilarVenue(c,c’): … JointInferenceCandidate(c,i,c’): trigram starting at i in c also appears in c’ and trigram is a possible title and punct before trigram in c’ but not c

36 Information Extraction (less simplified)
SimilarTitle(c,i,j,c’,i’,j’): … SimilarVenue(c,c’): … JointInferenceCandidate(c,i,c’): [InField(i,+f,c) ^ !HasPunct(c,i) => InField(i+1,+f,c)] InField(i,+f,c) ^ !HasPunct(c,i) ^ (!exists c’:JointInferenceCandidate(c,i,c’)) => InField(i+1,+f,c) Why is this joint? Recall we also have: Token(+t,i,c) ^ InField(i,+f,c) ^ Token(+t,i’,c’) ^ InField(i’,+f,c’) => SameField(+f,c,c’) Jnt-Seg

37 Information Extraction (less simplified)
SimilarTitle(c,i,j,c’,i’,j’): … SimilarVenue(c,c’): … JointInferenceCandidate(c,i,c’): InField(i,+f,c) ^ ~HasPunct(c,i) ^ (!exists c’:JointInferenceCandidate(c,i,c’) ^SameCitation(c,c’) ) => InField(i+1,+f,c) Jnt-Seg-ER

38 Results: segmentation
Percent error reduction for best joint model

39 Results: matching Fraction of clusters correctly constructed using transitive closure of pairwise decisions Cora F-S: F1 Cora TFIDF: max F1

40 William’s summary MLNs are a compact, elegant way of describing a Markov network Standard learning methods work Network may be very very large Inference may be expensive Doesn’t eliminate feature engineering E.g., complicated “feature” predicates Experimental results for joint matching/NER are not that strong overall Cascading segmentation and then matching improves segmentation, maybe not matching But it needs to be carefully restricted (efficiency?)

41 Bellare & McCallum

42 Outline Goal: Methods:
Given (DBLP record, citation-text) that do match, learn to segment citations. Methods: Learn a CRF to align the record and text (sort of like learning an edit distance) Generate alignments, anduse them as training data for a linear-chain CRF that does segmentation (aka extraction) This CRF does not need records to work

43 Alignment…. Notation: Alignment feature: depends on a and x’s
Extraction feature: depends on a, y1 and x2

44 Learning for alignment…
Generalized expectation criterion: rather than minimize Edata[f]-Emodel[f] … plus a penalty term for the weights…minimize a weighted squared difference between Emodel[f] and p, where p is the user’s prior on the value of the feature. “We simulate user-specified expectation criteria [i.e. p’s] with statistics on manually labeled citation texts.” … top 10 features by MI, p in 11 bins, w=10 Sum of marginal probabilities divided by size of variable set

45 Results On 260 records, 522 record-text pairs

46 Results CRF trained with extraction criteria derived from labeled data
Trained on records partially aligned with high-precision rules Trained on DBLP records …and also use partial match to DB records at test time “Gold standard”- hand-labeled extraction data

47 Alignments and expectations
Simplified version of the idea: from Learning String Edit Distance, Ristad and Yianilos, PAMI 1998

48 HMM Example Sample output: xT=heehahaha, sT=122121212 Pr(1->2)
Pr(2->x) Pr(1->x) 1 2 a 0.3 e 0.5 o 0.2 d 0.3 h 0.5 b 0.2 Pr(2->1)

49 HMM Inference l=1 l=2 l=K t=1 t=2 ... t=T
Key point: Pr(si=l) depends only on Pr(l’->l) and si-1 so you can propogate probabilities forward... t=1 t=2 ... t=T l=1 l=2 l=K x1 x2 x3 xT

50 Pair HMM Notation Andrew used “null”

51 Pair HMM Example 1 e Pr(e) <a,a> 0.10 <e,e> <h,h>
0.05 <h,t> <-,h> 0.01 ... .. 1

52 Pair HMM Example e Pr(e) <a,a> 0.10 <e,e> <h,h> <e,-> 0.05 <h,t> <-,h> 0.01 ... .. 1 Sample run: zT = <h,t>,<e,e><e,e><h,h>,<e,->,<e,e> Strings x,y produced by zT: x=heehee, y=teehe Notice that x,y is also produced by z4 + <e,e>,<e,-> and many other edit strings

53 Distances based on pair HMMs

54 Pair HMM Inference Dynamic programming is possible: fill out matrix left-to-right, top-down

55 Pair HMM Inference t=1 t=2 ... t=T v=1 v=2 v=K

56 Pair HMM Inference v=1 v=2 v=K t=1 t=2 ... t=T
One difference: after i emissions of pair HMM, we do not know the column position t=1 t=2 ... t=T v=1 v=2 v=K i=1 i=1 i=2 i=3 i=1 i=3

57 Multiple states IY SUB IX e Pr(e) e Pr(e) <a,a> 0.10 <e,e>
<h,h> <a,-> 0.05 <h,t> 0.01 <-,h> ... .. SUB e Pr(e) <a,-> 0.11 <e,-> 0.21 <h,-> IX

58 An extension: multiple states
conceptually, add a “state” dimension to the model ... v=K v=2 v=1 t=T t=2 t=1 l=2 IX ... v=K v=2 v=1 t=T t=2 t=1 l=1 SUB EM methods generalize easily to this setting


Download ppt "Administrivia What: LTI Seminar When: Friday Oct 30, 2009, 2:00pm - 3:00pm Where: 1305 NSH Faculty Host: Noah Smith Title: SeerSuite: Enterprise Search."

Similar presentations


Ads by Google