Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Yang Yang *, Yizhou Sun +, Jie Tang *, Bo Ma #, and Juanzi Li * Entity Matching across Heterogeneous Sources *Tsinghua University + Northeastern University.

Similar presentations


Presentation on theme: "1 Yang Yang *, Yizhou Sun +, Jie Tang *, Bo Ma #, and Juanzi Li * Entity Matching across Heterogeneous Sources *Tsinghua University + Northeastern University."— Presentation transcript:

1 1 Yang Yang *, Yizhou Sun +, Jie Tang *, Bo Ma #, and Juanzi Li * Entity Matching across Heterogeneous Sources *Tsinghua University + Northeastern University Data&Code available at: http://arnetminer.org/document-match/ # Carnegie Mellon University

2 2 Motivation Given an entity in a source domain, we aim to find its matched entities from target domain. Product-Patent matching

3 3 Problem C1C1 C2C2 {C 1, C 2 }, where C t ={d 1, d 2, …, d n } is a collection of entities L ij = 1, d i and d j are matched 0, not matched ?, unknown Input 2: Matching relation matrix Input 1: Dual source corpus

4 4 Two domains have less or no overlapping in content Challenges 1 1 Daily expression vs Professional expression

5 5 Two domains have less or no overlapping in content Challenges 1 1 How to model the topic- level relevance probability 2 2 ???

6 6 Cross-Source Topic Model

7 7 Basic Assumption For entities from different sources, their matching relations and hidden topics are influenced by each other. How to leverage the known matching relations to help link hidden topic spaces of two sources?

8 8 Cross-Sampling d 1 and d 2 are matched … 1 1

9 9 Cross-Sampling Sample a new term w 1 for d 1 2 2 Toss a coin c, if c=0, sample w 1 ’s topic according to d 1

10 10 Cross-Sampling Sample a new term w 1 for d 1 3 3 Otherwise sample w 1 ’s topic according to d 2

11 11 Cross-Source Topic Model Step 1: Step 2:

12 12 Model Learning Variational EM –Model parameters: –Variational parameters: –E-step: –M-step:

13 13 Task I: Product-patent matching Task II: Cross-lingual matching Experiments

14 14 Task I: Product-Patent Matching Given a Wiki article describing a product, finding all patents relevant to the product. Data set: –13,085 Wiki articles; –15,00 patents from USPTO; –1,060 matching relations in total.

15 15 Experimental Results Training : 30% of the matching relations randomly chosen. Content Similarity based on LDA (CS+LDA): cosine similarity between two articles’ topic distribution extracted by LDA. Random Walk based on LDA (RW+LDA): random walk on a graph where edges indicate the topic similarity between articles. Relational Topic Model (RTM): generally used to model links between documents. Random Walk based on CST (RW+CST): uses CST instead of LDA comparing with RW+LDA.

16 16 Task II: Cross-lingual Matching Given an English Wiki article,we aim to find a Chinese article reporting the same content. Data set: –2,000 English articles from Wikipedia; –2,000 Chinese articles from Baidu Baike; –Each English article corresponds to one Chinese article.

17 17 Experimental Results Training: 3-fold cross validation Title Only: only considers the (translated) title of articles. SVM-S: famous cross-lingual Wikipedia matching toolkit. LFG [1] : mainly considers the structural information of Wiki articles. LFG+LDA: adds content feature (topic distributions) to LFG by employing LDA. LFG+CST: adds content feature to LFG by employing CST. [1] Zhichun Wang, Juanzi Li, Zhigang Wang, and Jie Tang. Cross-lingual Knowledge Linking Across Wiki Knowledge Bases. WWW'12. pp. 459-468.

18 18 Parameter Analysis (a) Number of topics K(b) Ratio (c) Precision(d) Convergence analysis

19 19 Apple vs. Samsung Topics highly relevant to both Apple and Samsung found by CST. (Topic titles are hand-labeled)

20 20 Conclusion Study the problem of entity matching across heterogeneous sources. Propose the cross-source topic model, which integrates the topic extraction and entity matching into a unified framework. Conduct two experimental tasks to demonstrate the effectiveness of CST.

21 21 Yang Yang *, Yizhou Sun +, Jie Tang *, Bo Ma #, and Juanzi Li * Entity Matching across Heterogeneous Sources *Tsinghua University + Northeastern University Data&Code available at: http://arnetminer.org/document-match/ # Carnegie Mellon University Thank You!

22 22 Problem Given an entity in a source domain, we aim to find its matched entities from target domain. –Given a textural description of a product, finding related patents in a patent database. –Given an English Wiki page, finding related Chinese Wiki pages. –Given a specific disease, finding all related drugs.


Download ppt "1 Yang Yang *, Yizhou Sun +, Jie Tang *, Bo Ma #, and Juanzi Li * Entity Matching across Heterogeneous Sources *Tsinghua University + Northeastern University."

Similar presentations


Ads by Google