1 Yang Yang *, Yizhou Sun +, Jie Tang *, Bo Ma #, and Juanzi Li * Entity Matching across Heterogeneous Sources *Tsinghua University + Northeastern University.

Slides:



Advertisements
Similar presentations
The Application of Machine Translation in CADAL Huang Chen, Chen Haiying Zhejiang University Libraries, Hangzhou, China
Advertisements

Suleyman Cetintas 1, Monica Rogati 2, Luo Si 1, Yi Fang 1 Identifying Similar People in Professional Social Networks with Discriminative Probabilistic.
Active Learning for Streaming Networked Data Zhilin Yang, Jie Tang, Yutao Zhang Computer Science Department, Tsinghua University.
Linked data: P redicting missing properties Klemen Simonic, Jan Rupnik, Primoz Skraba {klemen.simonic, jan.rupnik,
Linking Named Entity in Tweets with Knowledge Base via User Interest Modeling Date : 2014/01/22 Author : Wei Shen, Jianyong Wang, Ping Luo, Min Wang Source.
1 The PageRank Citation Ranking: Bring Order to the web Lawrence Page, Sergey Brin, Rajeev Motwani and Terry Winograd Presented by Fei Li.
1 Multi-topic based Query-oriented Summarization Jie Tang *, Limin Yao #, and Dewei Chen * * Dept. of Computer Science and Technology Tsinghua University.
Intelligent Systems Lab. Recognizing Human actions from Still Images with Latent Poses Authors: Weilong Yang, Yang Wang, and Greg Mori Simon Fraser University,
Query Dependent Pseudo-Relevance Feedback based on Wikipedia SIGIR ‘09 Advisor: Dr. Koh Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/01/24 1.
1 Social Influence Analysis in Large-scale Networks Jie Tang 1, Jimeng Sun 2, Chi Wang 1, and Zi Yang 1 1 Dept. of Computer Science and Technology Tsinghua.
GENERATING AUTOMATIC SEMANTIC ANNOTATIONS FOR RESEARCH DATASETS AYUSH SINGHAL AND JAIDEEP SRIVASTAVA CS DEPT., UNIVERSITY OF MINNESOTA, MN, USA.
1 Yuxiao Dong *$, Jie Tang $, Sen Wu $, Jilei Tian # Nitesh V. Chawla *, Jinghai Rao #, Huanhuan Cao # Link Prediction and Recommendation across Multiple.
Generative Topic Models for Community Analysis
Statistical Models for Networks and Text Jimmy Foulds UCI Computer Science PhD Student Advisor: Padhraic Smyth.
Learning to Advertise. Introduction Advertising on the Internet = $$$ –Especially search advertising and web page advertising Problem: –Selecting ads.
MANISHA VERMA, VASUDEVA VARMA PATENT SEARCH USING IPC CLASSIFICATION VECTORS.
1 Heterogeneous Cross Domain Ranking in Latent Space Bo Wang 1, Jie Tang 2, Wei Fan 3, Songcan Chen 1, Zi Yang 2, Yanzhu Liu 4 1 Nanjing University of.
Enhance legal retrieval applications with an automatically induced knowledge base Ka Kan Lo.
1 A Topic Modeling Approach and its Integration into the Random Walk Framework for Academic Search 1 Jie Tang, 2 Ruoming Jin, and 1 Jing Zhang 1 Knowledge.
Query session guided multi- document summarization THESIS PRESENTATION BY TAL BAUMEL ADVISOR: PROF. MICHAEL ELHADAD.
Dongyeop Kang1, Youngja Park2, Suresh Chari2
1 Yang Yang, Jie Tang, Juanzi Li Tsinghua University Walter Luyten, Marie-Francine Moens Katholieke Universiteit Leuven Lu Liu Northwestern University.
Multilingual Synchronization focusing on Wikipedia
MediaEval Workshop 2011 Pisa, Italy 1-2 September 2011.
Extracting Key Terms From Noisy and Multi-theme Documents Maria Grineva, Maxim Grinev and Dmitry Lizorkin Institute for System Programming of RAS.
1 A Discriminative Approach to Topic- Based Citation Recommendation Jie Tang and Jing Zhang Presented by Pei Li Knowledge Engineering Group, Dept. of Computer.
Correlated Topic Models By Blei and Lafferty (NIPS 2005) Presented by Chunping Wang ECE, Duke University August 4 th, 2006.
Learning to Classify Short and Sparse Text & Web with Hidden Topics from Large- scale Data Collections Xuan-Hieu PhanLe-Minh NguyenSusumu Horiguchi GSIS,
Mining Cross-network Association for YouTube Video Promotion Ming Yan, Jitao Sang, Changsheng Xu*. 1 Institute of Automation, Chinese Academy of Sciences,
Employing Active Learning to Cross-Lingual Sentiment Classification with Data Quality Controlling Shoushan Li †‡ Rong Wang † Huanhuan Liu † Chu-Ren Huang.
A MIXED MODEL FOR CROSS LINGUAL OPINION ANALYSIS Lin Gui, Ruifeng Xu, Jun Xu, Li Yuan, Yuanlin Yao, Jiyun Zhou, Shuwei Wang, Qiaoyun Qiu, Ricky Chenug.
Interoperable Visualization Framework towards enhancing mapping and integration of official statistics Haitham Zeidan Palestinian Central.
DOCUMENT UPDATE SUMMARIZATION USING INCREMENTAL HIERARCHICAL CLUSTERING CIKM’10 (DINGDING WANG, TAO LI) Advisor: Koh, Jia-Ling Presenter: Nonhlanhla Shongwe.
Probabilistic Latent Query Analysis for Combining Multiple Retrieval Sources Rong Yan Alexander G. Hauptmann School of Computer Science Carnegie Mellon.
 Goal recap  Implementation  Experimental Results  Conclusion  Questions & Answers.
LOGO Identifying Opinion Leaders in the Blogosphere Xiaodan Song, Yun Chi, Koji Hino, Belle L. Tseng CIKM 2007 Advisor : Dr. Koh Jia-Ling Speaker : Tu.
2015/12/121 Extracting Key Terms From Noisy and Multi-theme Documents Maria Grineva, Maxim Grinev and Dmitry Lizorkin Proceeding of the 18th International.
Effective Automatic Image Annotation Via A Coherent Language Model and Active Learning Rong Jin, Joyce Y. Chai Michigan State University Luo Si Carnegie.
DeepDive Model Dongfang Xu Ph.D student, School of Information, University of Arizona Dec 13, 2015.
1 Yang Yang *, Yizhou Sun +, Jie Tang *, Bo Ma #, and Juanzi Li * Entity Matching across Heterogeneous Sources *Tsinghua University + Northeastern University.
LINDEN : Linking Named Entities with Knowledge Base via Semantic Knowledge Date : 2013/03/25 Resource : WWW 2012 Advisor : Dr. Jia-Ling Koh Speaker : Wei.
Towards Total Scene Understanding: Classification, Annotation and Segmentation in an Automatic Framework N 工科所 錢雅馨 2011/01/16 Li-Jia Li, Richard.
Link Distribution on Wikipedia [0407]KwangHee Park.
Topic-Factorized Ideal Point Estimation Model for Legislative Voting Network Yupeng Gu †, Yizhou Sun †, Ning Jiang ‡, Bingyu Wang †, Ting Chen † † Northeastern.
Using Wikipedia for Hierarchical Finer Categorization of Named Entities Aasish Pappu Language Technologies Institute Carnegie Mellon University PACLIC.
Divided Pretreatment to Targets and Intentions for Query Recommendation Reporter: Yangyang Kang /23.
CONTEXTUAL SEARCH AND NAME DISAMBIGUATION IN USING GRAPHS EINAT MINKOV, WILLIAM W. COHEN, ANDREW Y. NG SIGIR’06 Date: 2008/7/17 Advisor: Dr. Koh,
Leveraging Knowledge Bases for Contextual Entity Exploration Categories Date:2015/09/17 Author:Joonseok Lee, Ariel Fuxman, Bo Zhao, Yuanhua Lv Source:KDD'15.
Multilingual Information Retrieval using GHSOM Hsin-Chang Yang Associate Professor Department of Information Management National University of Kaohsiung.
1 Random Walks on the Click Graph Nick Craswell and Martin Szummer Microsoft Research Cambridge SIGIR 2007.
A Multilingual Hierarchy Mapping Method Based on GHSOM Hsin-Chang Yang Associate Professor Department of Information Management National University of.
Enhanced hypertext categorization using hyperlinks Soumen Chakrabarti (IBM Almaden) Byron Dom (IBM Almaden) Piotr Indyk (Stanford)
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Meta-Path-Based Ranking with Pseudo Relevance Feedback on Heterogeneous Graph for Citation Recommendation By: Xiaozhong Liu, Yingying Yu, Chun Guo, Yizhou.
University Of Seoul Ubiquitous Sensor Network Lab Query Dependent Pseudo-Relevance Feedback based on Wikipedia 전자전기컴퓨터공학 부 USN 연구실 G
Link Distribution in Wikipedia [0324] KwangHee Park.
Multi-Modal Bayesian Embeddings for Learning Social Knowledge Graphs Zhilin Yang 12, Jie Tang 1, William W. Cohen 2 1 Tsinghua University 2 Carnegie Mellon.
Topic Modeling for Short Texts with Auxiliary Word Embeddings
MINING DEEP KNOWLEDGE FROM SCIENTIFIC NETWORKS
Xiang Li,1 Lili Mou,1 Rui Yan,2 Ming Zhang1
Cross-lingual Knowledge Linking Across Wiki Knowledge Bases
Lecture 24: NER & Entity Linking
Research at Open Systems Lab IIIT Bangalore
Navi 下一步工作的设想 郑 亮 6.6.
Adaptive entity resolution with human computation
MEgo2Vec: Embedding Matched Ego Networks for User Alignment Across Social Networks Jing Zhang+, Bo Chen+, Xianming Wang+, Fengmei Jin+, Hong Chen+, Cuiping.
Web Mining Department of Computer Science and Engg.
Example: Academic Search
Actively Learning Ontology Matching via User Interaction
Topic: Semantic Text Mining
Presentation transcript:

1 Yang Yang *, Yizhou Sun +, Jie Tang *, Bo Ma #, and Juanzi Li * Entity Matching across Heterogeneous Sources *Tsinghua University + Northeastern University Data&Code available at: # Carnegie Mellon University

2 Apple Inc. VS Samsung Co. A patent infringement suit starts from –Lasts 2 years, involves $158+ million and 10 countries. –7 / patents are involved. SAMSUNG devices accused by APPLE. Apple’s patent How to find patents relevant to a specific product? How to find patents relevant to a specific product?

3 Cross-Source Entity Matching Given an entity in a source domain, we aim to find its matched entities from target domain. –Product-patent matching; –Cross-lingual matching; –Drug-disease matching. Product-Patent matching

4 Problem C1C1 C2C2 {C 1, C 2 }, where C t ={d 1, d 2, …, d n } is a collection of entities L ij = 1, d i and d j are matched 0, not matched ?, unknown Input 2: Matching relation matrix Input 1: Dual source corpus

5 Two domains have less or no overlapping in content Challenges 1 1 Daily expression vs Professional expression

6 Two domains have less or no overlapping in content Challenges 1 1 How to model the topic- level relevance probability 2 2 ???

7 Cross-Source Topic Model Our Approach

8 Rank Baseline Ranking candidates by topic similarity Topic extraction query Little-overlapping content -> disjoint topic space

9 Cross-Sampling Toss a coin C If C=1, sample topics according to d n ’s topic distribution If C=0, sample topics according to the topic distribution of d’ m d n is matched with d’ m How latent topics influence matching relations? Bridge topic space by leveraging known matching relations.

10 Inferring Matching Relation Infer matching relations by leveraging extracted topics.

11 Cross-Source Topic Model Step 1: Step 2: Latent topicsMatching relations

12 Model Learning Variational EM –Model parameters: –Variational parameters: –E-step: –M-step:

13 Task I: Product-patent matching Task II: Cross-lingual matching Experiments

14 Task I: Product-Patent Matching Given a Wiki article describing a product, finding all patents relevant to the product. Data set: –13,085 Wiki articles; –15,000 patents from USPTO; –1,060 matching relations in total.

15 Experimental Results Training : 30% of the matching relations randomly chosen. Content Similarity based on LDA (CS+LDA): cosine similarity between two entities’ topic distribution extracted by LDA. Random Walk based on LDA (RW+LDA): random walk on a graph where edges indicate the hyperlinks between Wiki articles and citations between patents. Relational Topic Model (RTM): used to model links between documents. Random Walk based on CST (RW+CST): uses CST instead of LDA comparing with RW+LDA.

16 Task II: Cross-lingual Matching Given an English Wiki article,we aim to find a Chinese article reporting the same content. Data set: –2,000 English articles from Wikipedia; –2,000 Chinese articles from Baidu Baike; –Each English article corresponds to one Chinese article.

17 Experimental Results Training: 3-fold cross validation Title Only: only considers the (translated) title of articles. SVM-S: famous cross-lingual Wikipedia matching toolkit. LFG [1] : mainly considers the structural information of Wiki articles. LFG+LDA: adds content feature (topic distributions) to LFG by employing LDA. LFG+CST: adds content feature to LFG by employing CST. [1] Zhichun Wang, Juanzi Li, Zhigang Wang, and Jie Tang. Cross-lingual Knowledge Linking Across Wiki Knowledge Bases. WWW'12. pp

18 Topics Relevant to Apple and Samsung (Topic titles are hand-labeled)

19 Prototype System competitor Radar Chart: topic comparison Basic information comparison: #patents, business area, industry, founded year, etc.

20 Conclusion Study the problem of entity matching across heterogeneous sources. Propose the cross-source topic model, which integrates the topic extraction and entity matching into a unified framework. Conduct two experimental tasks to demonstrate the effectiveness of CST.

21 Yang Yang *, Yizhou Sun +, Jie Tang *, Bo Ma #, and Juanzi Li * Entity Matching across Heterogeneous Sources *Tsinghua University + Northeastern University Data&Code available at: # Carnegie Mellon University Thank You!

22 Apple Inc. VS Samsung Co. A patent infringement lawsuit starts from –Nexus S, Epic 4G, Galaxy S 4G, and the Samsung Galaxy Tab, infringed on Apple’s intellectual property: its patents, trademarks, user interface and style. –Lasts over 2 years, involves $158+ million. How to find patents relevant to a specific product?

23 Problem Given an entity in a source domain, we aim to find its matched entities from target domain. –Given a textural description of a product, finding related patents in a patent database. –Given an English Wiki page, finding related Chinese Wiki pages. –Given a specific disease, finding all related drugs.

24 Basic Assumption For entities from different sources, their matching relations and hidden topics are influenced by each other. How to leverage the known matching relations to help link hidden topic spaces of two sources?

25 Cross-Sampling d 1 and d 2 are matched … 1 1

26 Cross-Sampling Sample a new term w 1 for d Toss a coin c, if c=0, sample w 1 ’s topic according to d 1

27 Cross-Sampling Sample a new term w 1 for d Otherwise sample w 1 ’s topic according to d 2

28 Parameter Analysis (a) Number of topics K(b) Ratio (c) Precision(d) Convergence analysis

29 Experimental Results Training : 30% of the matching relations randomly chosen. Content Similarity based on LDA (CS+LDA): cosine similarity between two articles’ topic distribution extracted by LDA. Random Walk based on LDA (RW+LDA): random walk on a graph where edges indicate the hyperlinks between Wiki articles and citations between patents. Relational Topic Model (RTM): used to model links between documents. Random Walk based on CST (RW+CST): uses CST instead of LDA comparing with RW+LDA.

30 Experimental Results Training : 30% of the matching relations randomly chosen. Content Similarity based on LDA (CS+LDA): cosine similarity between two articles’ topic distribution extracted by LDA. Random Walk based on LDA (RW+LDA): random walk on a graph where edges indicate the hyperlinks between Wiki articles and citations between patents. Relational Topic Model (RTM): used to model links between documents. Random Walk based on CST (RW+CST): uses CST instead of LDA comparing with RW+LDA.