Marcos André Gonçalves

Slides:



Advertisements
Similar presentations
In the Format section, we have activated the Bibliographic style drop down menu. From this page, you can choose a specific journal or format (e.g. BMC.
Advertisements

CSE594 Fall 2009 Jennifer Wong Oct. 14, 2009
Chapter 5: Introduction to Information Retrieval
Site Level Noise Removal for Search Engines André Luiz da Costa Carvalho Federal University of Amazonas, Brazil Paul-Alexandru Chirita L3S and University.
1/1/ A Knowledge-based Approach to Citation Extraction Min-Yuh Day 1,2, Tzong-Han Tsai 1,3, Cheng-Lung Sung 1, Cheng-Wei Lee 1, Shih-Hung Wu 4, Chorng-Shyong.
1 Question Answering in Biomedicine Student: Andreea Tutos Id: Supervisor: Diego Molla.
1 CS 430 / INFO 430: Information Retrieval Lecture 16 Web Search 2.
Aki Hecht Seminar in Databases (236826) January 2009
Reference Collections: Task Characteristics. TREC Collection Text REtrieval Conference (TREC) –sponsored by NIST and DARPA (1992-?) Comparing approaches.
ReQuest (Validating Semantic Searches) Norman Piedade de Noronha 16 th July, 2004.
Learning Hidden Markov Model Structure for Information Extraction Kristie Seymour, Andrew McCullum, & Ronald Rosenfeld.
Reference Manager Making your life easier! Updated September 2007.
Learning Table Extraction from Examples Ashwin Tengli, Yiming Yang and Nian Li Ma School of Computer Science Carnegie Mellon University Coling 04.
OMAP: An Implemented Framework for Automatically Aligning OWL Ontologies SWAP, December, 2005 Raphaël Troncy, Umberto Straccia ISTI-CNR
ONDUX On-Demand Unsupervised Learning for Information Extraction Eli Cortez, Altigran da Silva and Edleno de Moura Federal University of Amazonas (UFAM)
CONTI’2008, 5-6 June 2008, TIMISOARA 1 Towards a digital content management system Gheorghe Sebestyen-Pal, Tünde Bálint, Bogdan Moscaliuc, Agnes Sebestyen-Pal.
Processing of large document collections Part 3 (Evaluation of text classifiers, applications of text categorization) Helena Ahonen-Myka Spring 2005.
IMSS005 Computer Science Seminar
1 DATABASES By: Hanna Ben-Or Phone: October 2011.
LIS510 lecture 3 Thomas Krichel information storage & retrieval this area is now more know as information retrieval when I dealt with it I.
A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.
Ontology-Driven Automatic Entity Disambiguation in Unstructured Text Jed Hassell.
Assessing Quality for Integration Based Data M. Denk, W. Grossmann Institute for Scientific Computing.
Developing a Concept Extraction Technique with Ensemble Pathway Prat Tanapaisankit (NJIT), Min Song (NJIT), and Edward A. Fox (Virginia Tech) Abstract.
A Quick Guide to beginning Research Where to Start.
1 FROM SCIENTIFIC COMMUNICATION TO PUBLIC KNOWLEDGE: THE SCIENTIFIC ARTICLE WEB PUBLISHED AS A KNOWLEDGE BASE ICCC 9th International Conference on electronic.
The Web-DL Environment for Building Digital Libraries from the Web P. Calado 1, M. Gonçalves 2, E. Fox 2, B. Ribeiro-Neto 1, A. Laender 1, A. da Silva.
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
BioRAT: Extracting Biological Information from Full-length Papers David P.A. Corney, Bernard F. Buxton, William B. Langdon and David T. Jones Bioinformatics.
1 EndNote X2 Your Bibliographic Management Tool 29 September 2009 Humanities and Social Sciences Resource Teams.
A Repetition Based Measure for Verification of Text Collections and for Text Categorization Dmitry V.Khmelev Department of Mathematics, University of Toronto.
Theme 2: Data & Models One of the central processes of science is the interplay between models and data Data informs model generation and selection Models.
Unsupervised Strategies for Information Extraction by Text Segmentation Eli Cortez, Altigran da Silva Federal University of Amazonas - BRAZIL.
Digital Libraries1 David Rashty. Digital Libraries2 “A library is an arsenal of liberty” Anonymous.
1 Smart Searching Techniques Fall 2006 the Library.
Towards a Reference Quality Model for Digital Libraries Maristella Agosti Nicola Ferro Edward A. Fox Marcos André Gonçalves Bárbara Lagoeiro Moreira.
Reference Collections: Collection Characteristics.
Introduction to Data Mining by Yen-Hsien Lee Department of Information Management College of Management National Sun Yat-Sen University March 4, 2003.
INFO 4990: Information Technology Research Methods Searching in the Research Literature Lecture by A. Fekete (based in part on materials by J. Davis and.
Finding document topics for improving topic segmentation Source: ACL2007 Authors: Olivier Ferret (18 route du Panorama, BP6) Reporter:Yong-Xiang Chen.
Feb 21-25, 2005ICM 2005 Mumbai1 Converting Existing Corpus to an OAI Compliant Repository J. Tang, K. Maly, and M. Zubair Department of Computer Science.
Virtual Examples for Text Classification with Support Vector Machines Manabu Sassano Proceedings of the 2003 Conference on Emprical Methods in Natural.
A Supervised Machine Learning Algorithm for Research Articles Leonidas Akritidis, Panayiotis Bozanis Dept. of Computer & Communication Engineering, University.
1 CS 430: Information Discovery Lecture 5 Ranking.
Refined Online Citation Matching and Adaptive Canonical Metadata Construction CSE 598B Course Project Report Huajing Li.
Jean-Yves Le Meur - CERN Geneva Switzerland - GL'99 Conference 1.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
1 Discovering Web Communities in the Blogspace Ying Zhou, Joseph Davis (HICSS 2007)
Wrapper Learning: Cohen et al 2002; Kushmeric 2000; Kushmeric & Frietag 2000 William Cohen 1/26/03.
Ontology Engineering and Feature Construction for Predicting Friendship Links in the Live Journal Social Network Author:Vikas Bahirwani 、 Doina Caragea.
1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.
Experience Report: System Log Analysis for Anomaly Detection
Discovering and accessing data from a distributed network of data centres S. Mazzeo (ESA)
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
CSE594 Fall 2009 Jennifer Wong Oct. 14, 2009
Clustering of Web pages
Wei Wei, PhD, Zhanglong Ji, PhD, Lucila Ohno-Machado, MD, PhD
Ying He Wuhan University of Technology Twitter: #AMIA2017
Social Knowledge Mining
An Efficient method to recommend research papers and highly influential authors. VIRAJITHA KARNATAPU.
Finding replicated web collections
CSc4730/6730 Scientific Visualization
Data Mining Chapter 6 Search Engines
DATABASES By: Hanna Ben-Or Phone:
[jws13] Evaluation of instance matching tools: The experience of OAEI
Family History Technology Workshop
Introduction to Information Retrieval
Extracting Patterns and Relations from the World Wide Web
CSE594 Fall 2009 Jennifer Wong Oct. 14, 2009
Prediction Patterns and Summary Holdings
Presentation transcript:

Marcos André Gonçalves FLUX-CiM Flexible Unsupervised Extraction of Citation Metadata Eli Cortez, Altigran S. da Silva , Filipe Mesquita and Edleno S. de Moura Federal University of Amazonas, Brazil Marcos André Gonçalves Federal University of Minas Gerais, Brazil

Outline Introduction Related work The FLUX-CiM method Experiments Conclusion and future work

Introduction Citation management is a central aspect of modern digital libraries. Citations serve for: Evidence of the impact of particular scientific articles. Auxiliary evidence in information retrieval (e.g., Classification). Bibliographic measures that rely on citations have served as inspiration for modern web link analysis algorithms such as pagerank. \

Introduction Citations Management involves: Data cleaning; Removal of duplicates. Techniques rely on the assumption that we can correctly identify main components within a citation. It is not a easy task: Data entry errors, multiple citation formats, large-scale citation data, etc.

Introduction Our method is based on a Knowledge-Base (KB) that helps extracting the components of citations in any given format. The FLUX-CiM method is based on: Estimating the probability of a set of terms occurring as a given citation field Use of generic structural properties of bibliographic citation. Its important to say that in our case, the KB is automatically built. This gives us a High level of Automation and Flexibility to our method. As we do not rely on a learning method, Considered Unsupervised because

General View FLUX-CiM Author Conference Title Place

Related Work

Related Work General Extraction [Laender et. al., SigRec/02] Survey about existing extraction tools [Embley et. al., DKE/99] Extraction using manually constructed ontologies SigMOd Record Data and Knowledge Engeneering

Related Work Citation Extraction [Han et al., JCDL/03] SVM classification-based method for metadata extraction [M. Y. Day et al., IEEE IRI/05] Metadata extraction based on an ontological knowledge representation Also requires a manual constructed ontology Fixed number of citation patterns

Contributions FLUX-CiM Knowledge-Base automatically built Does not consider any particular citation pattern Flexible and Unsupervised

The FLUX-CiM Method Basic Concepts

The FLUX-CiM Method Basic Concepts Knowledge Base A set of pairs KB = Constructing process is trivial KB = { (Author, O ), (Title, O ) } O = { “J. K. Rowling”, “Galadriel Waters”, “Beatrix Potter” } O = { “Harry Potter and the Half-Blood Prince ”, “A guide to Harry Potter”, “Petter Rabbit’s Halloween” } Author Title Author Title

The FLUX-CiM Method Basic Concepts Citation string Text portion encompassing a complete citation from the list of citations in a file. p-delimiters (potential delimiters) Any character other than A,…,Z a,…,z 0,…,9 Jobim A. C., Gilberto J. Bossa Nova: A new Harmonic Algorithm. MPB Surveys, 26(11):1022-1036 (1995)

The FLUX-CiM Method Method Steps

The FLUX-CiM Method Method Steps The proposed method is divided in four steps: Blocking; Matching; Binding; Joining;

The FLUX-CiM Method Method Steps Blocking Hypothesis In a citation string, every field value is bounded by a p-delimiter, but not all p-delimiters bound a value.

The FLUX-CiM Method Method Steps Blocking Splitting a citation string into substrings that we call blocks; Considering the position of the p-delimiter in a citation string; Jobim A . C ., Gilberto J . Bossa Nova : A new Harmonic Algorithm . MPB Surveys , 26 ( 11 ) : 1022 - 1036 ( 1995 )

The FLUX-CiM Method Method Steps Matching Associating each block with a bibliographic metadata field according to the occurrences of the KB; To account the probability that a given term belongs to a field, we use a function that we call FF (Field Frequency).

The FLUX-CiM Method Method Steps Matching Author ??? Author ??? Jobim A . C ., Gilberto J . Bossa Nova : A new Harmonic Algorithm . MPB Surveys , 26 ( 11 ) : 1022 - 1036 ( 1995 ) Title Journal Vol N Pages Pages Year

The FLUX-CiM Method Method Steps Binding Associate remaining unmatched blocks with fields. Information generated by matching step and the knowledge base. There are 3 distinct cases: Homogeneous Neighborhood Partial Neighborhood Heterogeneous Neighborhood

The FLUX-CiM Method Method Steps Binding – Homogenous Neighborhood Unmatched between same matched field. Author Author ??? Author ??? Jobim A . C ., Gilberto J . Bossa Nova : A new Harmonic Algorithm . MPB Surveys , 26 ( 11 ) : 1022 - 1036 ( 1995 ) Title Journal Citar Partial Nieghborhood Vol N Pages Pages Year

The FLUX-CiM Method Method Steps Binding – Heterogeneous Neighborhood Lets consider the example bellow: We must decide if the block “Bossa Nova” should be associated with Author or Title Author Author Author ??? Jobim A . C ., Gilberto J . Bossa Nova : A new Harmonic Algorithm . MPB Surveys , Title Journal

The FLUX-CiM Method Method Steps Binding – Heterogeneous Neighborhood Evaluate p-delimiters surrounding the unmatched blocks Author Author Author Title ??? Jobim A . C ., Gilberto J . Bossa Nova : A new Harmonic Algorithm . MPB Surveys , Then, we would choose to associate “Bossa Nova” to Title rather than to Author ; - column Title Journal “.” is likely to be a delimiter between Author and Title “:” is likely to be a character occurring in the values of Title

The FLUX-CiM Method Method Steps Joining Joins together blocks associated to a same field to form the values of that field. The solution we adopt relies on the information available in the KB. Usage of the average number of terms for a given metadata field

The FLUX-CiM Method Method Steps Joining Jobim A . C ., Gilberto J . Bossa Nova : A new Harmonic Algorithm . MPB Surveys , 26 ( 11 ) : 1022 - 1036 ( 1995 ) Title Author Journal Vol N Pages Pages Year Jobim A . C ., Gilberto J . Bossa Nova : A new Harmonic Algorithm . MPB Surveys , 26 ( 11 ) : 1022 - 1036 ( 1995 ) Title Journal Author Title Vol N Pages Pages Year

Experiments

Experiments Setup The method was applied to 2 domains: Health Science (HS) Computer Science (CS) Similar experiments were conducted in both domains.

Experiments Setup KB Test Collection HS 5000 6 PubMed CS 1950 1..10 Domain Size # Fields Source HS 5000 6 PubMed CS 1950 1..10 Web Sites Test Collection Its important to say that theres no intersection between the KB and the Test Collection We use a citation metadata collection (bibtex entries) to generate the knowledge base of each specific domain Domain Size # Fields Source HS 2000 6 PubMed CS 300 1..10 ACM DL

Experiments Verifying the Blocking Hypothesis We count the field values that were bounded by some p-delimiter. As expected: 100% of the field values bounded, in HS 99.8% of the field values bounded, in CS

Experiments Block-Level Results Show how correctly the blocks were associated to their respective field. Values are expressed in order of Precision, Recall and F-Measure.

In Average, less than 5% of blocks are unmatched Experiments Block-Level Results Field Matching U B (%) Binding P (%) R (%) F P (%) R (%) F Author 99.78 79.29 0.88 20.63 99.82 98.96 0.99 Title 98.11 90.43 0.94 7.83 97.19 97.61 0.97 Computer Science Journal 95.80 97.86 0.96 1.43 95.80 97.86 0.96 Date 99.70 97.38 0.98 2.04 97.98 99.13 0.98 Pages 97.87 98.71 0.98 1.29 97.06 99.14 0.98 Volume 100.0 98.25 0.99 0.00 100.0 98.25 0.99 In Average, less than 5% of blocks are unmatched Others 99.18 95.93 0.97 3.04 98.88 98.18 0.98 Average 98.80 95.56 0.96 4.54 98.34 98.37 0.98

In general, in both domains, our method reach high precision results. Experiments Block-Level Results Field Matching U B (%) Binding P (%) R (%) F P (%) R (%) F Author 99.04 94.33 0.96 4.96 98.89 99.26 0.99 Health Science Title 93.71 90.54 0.92 6.17 92.90 95.96 0.94 Journal 97.51 89.22 0.93 2.22 97.15 89.32 0.93 Date 99.85 99.50 0.99 0.35 99.85 99.50 0.99 Pages 99.90 99.45 0.99 0.35 99.70 99.45 0.99 In general, in both domains, our method reach high precision results. Volume 98.53 99.51 0.99 0.20 97.96 99.56 0.98 Average 98.09 95.42 0.96 2.38 97.74 97.17 0.97

Experiments Field Level Results Effectiveness of the whole extraction process; Health Science Computer Science Field P (%) R (%) F-measure Field P (%) R (%) F-measure Author 99.57 99.04 0.98 Author 93.85 95.58 0.94 Title 84.88 85.14 0.85 0.85 Title 93.00 93.00 0.93 Journal 97.23 89.35 0.93 Journal 95.71 97.81 0.96 Date 99.85 99.50 0.99 Date 91.75 97.44 0.97 Pages 99.70 99.20 0.99 Pages 97.00 97.84 0.97 Volume 98.20 98.75 0.98 Volume 100.0 98.25 0.99 Average 96.41 95.16 0.95 Others 98.04 97.73 0.97 Average 96.92 97.08 0.97 High accuracy levels reached after matching and binding remains after joining The f-measure of field title revealed a large overlap with terms of field journal

Experiments Citation Level Results How well each citation was extracted by our method; Domain P (%) R (%) F-Measure HS 94.82 95.10 0.94 CS 95.85 96.22 0.96 Even with diferents styles of citation, our method still achieve good results on both domains, without relying on any pattern

Experiments Recall values achieved in citation extraction More than 82% of the citations got 100% of recall. This means that all fields were correctly extracted.

Experiments After 3000 records the f-measure remains almost the same until 10.000 samples This means that our method does not require a large KB to reach a good extraction With only 50 records we got 0.75 of f-measure Performance of FLUX-CiM as the size of the KB increases

Anecdotes

Conclusions and Future Work

Conclusions Novel approach to extracting components of citations in any given format In this method: The KB is automatically built No particular citation standard adopted

Future Work Use of feedback techniques to automatically expand the KB Application of our method for extracting information from other formats and sources (e.g, addresses, paper headings) For instance, it should be interesting to automatically populate a digital library with metadata directly from web sites of recent conferences.

Questions ???