Refined Online Citation Matching and Adaptive Canonical Metadata Construction CSE 598B Course Project Report Huajing Li.

Slides:



Advertisements
Similar presentations
Answering Approximate Queries over Autonomous Web Databases Xiangfu Meng, Z. M. Ma, and Li Yan College of Information Science and Engineering, Northeastern.
Advertisements

Improvements and extras Paul Thomas CSIRO. Overview of the lectures 1.Introduction to information retrieval (IR) 2.Ranked retrieval 3.Probabilistic retrieval.
Chapter 5: Introduction to Information Retrieval
INFO624 - Week 2 Models of Information Retrieval Dr. Xia Lin Associate Professor College of Information Science and Technology Drexel University.
Introduction to Information Retrieval
Active Learning for Streaming Networked Data Zhilin Yang, Jie Tang, Yutao Zhang Computer Science Department, Tsinghua University.
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
TI: An Efficient Indexing Mechanism for Real-Time Search on Tweets Chun Chen 1, Feng Li 2, Beng Chin Ooi 2, and Sai Wu 2 1 Zhejiang University, 2 National.
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
1/1/ A Knowledge-based Approach to Citation Extraction Min-Yuh Day 1,2, Tzong-Han Tsai 1,3, Cheng-Lung Sung 1, Cheng-Wei Lee 1, Shih-Hung Wu 4, Chorng-Shyong.
Case Study: BibFinder BibFinder: A popular CS bibliographic mediator –Integrating 8 online sources: DBLP, ACM DL, ACM Guide, IEEE Xplore, ScienceDirect,
DISTRIBUTED COMPUTING & MAP REDUCE CS16: Introduction to Data Structures & Algorithms Thursday, April 17,
The Search Engine Architecture CSCI 572: Information Retrieval and Search Engines Summer 2010.
An Approach to Evaluate Data Trustworthiness Based on Data Provenance Department of Computer Science Purdue University.
Search Engines and Information Retrieval
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
6/16/20151 Recent Results in Automatic Web Resource Discovery Soumen Chakrabartiv Presentation by Cui Tao.
MANISHA VERMA, VASUDEVA VARMA PATENT SEARCH USING IPC CLASSIFICATION VECTORS.
Introduction to Databases CIS 5.2. Where would you find info about yourself stored in a computer? College Physician’s office Library Grocery Store Dentist’s.
ISP 433/633 Week 7 Web IR. Web is a unique collection Largest repository of data Unedited Can be anything –Information type –Sources Changing –Growing.
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
MARS: Applying Multiplicative Adaptive User Preference Retrieval to Web Search Zhixiang Chen & Xiannong Meng U.Texas-PanAm & Bucknell Univ.
Aparna Kulkarni Nachal Ramasamy Rashmi Havaldar N-grams to Process Hindi Queries.
Alert Correlation for Extracting Attack Strategies Authors: B. Zhu and A. A. Ghorbani Source: IJNS review paper Reporter: Chun-Ta Li ( 李俊達 )
Search Engines and Information Retrieval Chapter 1.
Learning Object Metadata Mining Masoud Makrehchi Supervisor: Prof. Mohamed Kamel.
Online Autonomous Citation Management for CiteSeer CSE598B Course Project By Huajing Li.
PageRank for Product Image Search Kevin Jing (Googlc IncGVU, College of Computing, Georgia Institute of Technology) Shumeet Baluja (Google Inc.) WWW 2008.
1 Formal Models for Expert Finding on DBLP Bibliography Data Presented by: Hongbo Deng Co-worked with: Irwin King and Michael R. Lyu Department of Computer.
Grouping search-engine returned citations for person-name queries Reema Al-Kamha, David W. Embley (Proceedings of the 6th annual ACM international workshop.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
1 University of Qom Information Retrieval Course Web Search (Link Analysis) Based on:
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
Ontology-Driven Automatic Entity Disambiguation in Unstructured Text Jed Hassell.
Exploring Online Social Activities for Adaptive Search Personalization CIKM’10 Advisor : Jia Ling, Koh Speaker : SHENG HONG, CHUNG.
Partially Supervised Classification of Text Documents by Bing Liu, Philip Yu, and Xiaoli Li Presented by: Rick Knowles 7 April 2005.
CS 533 Information Retrieval Systems.  Introduction  Connectivity Analysis  Kleinberg’s Algorithm  Problems Encountered  Improved Connectivity Analysis.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Now, please open your book to page 60, and let’s talk about chapter 9: How Data is Stored.
Enhancing Cluster Labeling Using Wikipedia David Carmel, Haggai Roitman, Naama Zwerdling IBM Research Lab (SIGIR’09) Date: 11/09/2009 Speaker: Cho, Chin.
Semantic Wordfication of Document Collections Presenter: Yingyu Wu.
LOGO A comparison of two web-based document management systems ShaoxinYu Columbia University March 31, 2009.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Externally growing self-organizing maps and its application to database visualization and exploration.
Excel 2007 Part (3) Dr. Susan Al Naqshbandi
Session 1 Module 1: Introduction to Data Integrity
Information Retrieval and Web Search Link analysis Instructor: Rada Mihalcea (Note: This slide set was adapted from an IR course taught by Prof. Chris.
Bloom Cookies: Web Search Personalization without User Tracking Authors: Nitesh Mor, Oriana Riva, Suman Nath, and John Kubiatowicz Presented by Ben Summers.
October 1, 2013Computer Vision Lecture 9: From Edges to Contours 1 Canny Edge Detector However, usually there will still be noise in the array E[i, j],
A System for Automatic Personalized Tracking of Scientific Literature on the Web Tzachi Perlstein Yael Nir.
LexPageRank: Prestige in Multi-Document Text Summarization Gunes Erkan, Dragomir R. Radev (EMNLP 2004)
1 CS 430: Information Discovery Lecture 5 Ranking.
Microsoft Office 2013 Try It! Chapter 4 Storing Data in Access.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
ENHANCING CLUSTER LABELING USING WIKIPEDIA David Carmel, Haggai Roitman, Naama Zwerdling IBM Research Lab SIGIR’09.
Enhanced hypertext categorization using hyperlinks Soumen Chakrabarti (IBM Almaden) Byron Dom (IBM Almaden) Piotr Indyk (Stanford)
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
September 2003, 7 th EDG Conference, Heidelberg – Roberta Faggian, CERN/IT CERN – European Organization for Nuclear Research The GRACE Project GRid enabled.
Data Mining What is to be done before we get to Data Mining?
General Architecture of Retrieval Systems 1Adrienn Skrop.
BRIEF INTRODUCTION: AND EPID 745, April 3, 2012 Xiaoguang Ma.
1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.
1 Yoel Kortick Senior Librarian Working with the Alma Community Zone and Electronic Resources.
Data Integrity & Indexes / Session 1/ 1 of 37 Session 1 Module 1: Introduction to Data Integrity Module 2: Introduction to Indexes.
2016/9/301 Exploiting Wikipedia as External Knowledge for Document Clustering Xiaohua Hu, Xiaodan Zhang, Caimei Lu, E. K. Park, and Xiaohua Zhou Proceeding.
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
OUTLINE Basic ideas of traditional retrieval systems
Data Mining Chapter 6 Search Engines
The Search Engine Architecture
Presentation transcript:

Refined Online Citation Matching and Adaptive Canonical Metadata Construction CSE 598B Course Project Report Huajing Li

Outline Introduction Matching Citations and documents Learning from Observations Cluster Repair Evaluation

Introduction In research repositories, citations represent important knowledge regarding work contexts. The citation relationships form a data structure generally known as a “citation graph”, where documents are vertices and citations are directed edges between citing and cited documents. The methods to construction citation graphs –manual information extraction –autonomous citation indexing (ACI)

Introduction Popular ACI systems –The CiteSeer Digital Library (a collection of over 725,000 documents with over 8 million citations ) –Google Scholar (433 million document and citation records ) Typical ACI process –Extract citations from research papers –Parse subfields to build accurate metadata for each citation –Link citations to documents

Introduction Typical problems in the ACI process –The citation parsers error-prone and often produce noisy results –Errors in the citation text (such as typos) –Identity uncertainty in document matching –For an automatic DL system, the identity of documents is uncertain (canonical metadata of the document can be incomplete or inaccurate) In such cases, citation metadata can be used to correct the canonical metadata of documents

Our Research Goals Provide better document metadata Reduce the cost of maintenance Allow the development of flexible APIs into CiteSeer citation graph system Maintain data security despite an open, wiki-like approach to user-contributed metadata changes Provide better citation matching compared to the current system

Matching Citations and documents Current offline approach –Citations are grouped according to their extracted metadata –The citation group is linked to a real document in the repository (exist inside the ACI system and yet not collected)

Matching Citations and documents Remember citations are themselves documents Treat citations and documents differently brings a lot of unnecessary complications into the system Citations pointing to a document in the ACI system can be represented by the document’s identity To represent the document which a citation points to and not in the current system, we use the notion of “virtual document”, which takes on the extracted metadata of the citation.

Matching Citations and documents Once the document enters the system, the corresponding virtual record is then updated with a pointer to the document file, making it a “real” document record. There are no citation edges pointing to an external unknown resource. All edges are internal in the document database and “real” and “virtual” documents can be searched in the same index space. We use Lucene to match documents online.

Learning from Observations A problem of generating beliefs in the identity of a document based on observational evidence. records may be linked with many information sources –Extracted document metadata –Extracted citations –External records (from DBLP, ACM) –User correction We focus on metadata elements with small variability in correct representations, such as names, titles, dates, etc.

Learning from Observations We use Bayesian Belief network to construct canonical metadata –Decide the canonical value X from all observations on X. –Each network BEL(X) is to develop degrees of belief in each possible value X, and X is chosen based on the value with the largest belief score. –Given a prior belief vector BEL(x), BEL(x) can be updated with a new observation ox using only a local computation.

Learning from Observations An example –An example observation vector o(x) may be (0, 0, 1, 0), indicating that o(2) is the observed value for x. –This vector must then be adjusted based on our confidence in the observation. This is achieved using a confidence matrix –assigning C=0.7 to o results in an actual message of (0.1,0.1,0.7,0.1) sent to X.

Cluster Repair Adjusting metadata dynamically in response to new evidence can lead to inconsistencies in citation groups. repairCluster(R) Find matching citations M for R For each citation C in GR If C is not contained in M Add C to REVOKE Set GR = M Reset belief vectors For each citation C in GR If C is not contained in REVOKE Update belief vectors using C If metadata changes repairCluster(R)

Cluster Repair Voting privilege –To prevent unbounded iterations, once a citation C1 is removed from GR, it can return to GR but it cannot influence metadata belief vectors for the remainder of the repairCluster iterations. –At the end of a repairCluster call stack, the non-voting citations regain voting privileges.

Evaluation Ten frequently referenced document records were selected from the top of CiteSeer’s most-cited document list along with all corresponding citations. 9,121 citations were used in the final test set. the data set was run through a noise generation program to purposely add some noise into the citation records. –Randomly insert a word into the title. –Randomly delete a word from the title. –Randomly insert an author name. –Randomly delete an author name. –Randomly misspell a word in the title. –Randomly misspell an author name. –Mistakes in the publication year attribute. Corresponding parameters are provided to control the probability with which a certain category of noise will occur, varying from 0 to 1. A noise rate of 0 means the original version of citation texts are adopted, without any intended modifications. A noise rate of 1 means a type of noise is destined to happen.

Index-Based Citation Clustering Lucene’s fuzzy query is utilized to match citations to documents. We vary the similarity threshold to observe the precision and recall

Index-Based Citation Clustering Noise is introduced into the citation data to test the capability of the matching algorithm to handle inaccurate inputs.

Metadata Determination and Cluster Repair Confidence in the document metadata was arbitrarily set at 0.8, and confidence in citation data was set at 0.5. The cluster repair algorithm was then used to iteratively query the citation index and repair the document’s metadata until convergence. Only title, author, and year metadata was tested for accuracy.

Metadata Determination and Cluster Repair