Marcos André Gonçalves

Marcos André Gonçalves
FLUX-CiM Flexible Unsupervised Extraction of Citation Metadata Eli Cortez, Altigran S. da Silva , Filipe Mesquita and Edleno S. de Moura Federal University of Amazonas, Brazil Marcos André Gonçalves Federal University of Minas Gerais, Brazil

Outline Introduction Related work The FLUX-CiM method Experiments
Conclusion and future work

Introduction Citation management is a central aspect of modern digital libraries. Citations serve for: Evidence of the impact of particular scientific articles. Auxiliary evidence in information retrieval (e.g., Classification). Bibliographic measures that rely on citations have served as inspiration for modern web link analysis algorithms such as pagerank. \

Introduction Citations Management involves:
Data cleaning; Removal of duplicates. Techniques rely on the assumption that we can correctly identify main components within a citation. It is not a easy task: Data entry errors, multiple citation formats, large-scale citation data, etc.

Introduction Our method is based on a Knowledge-Base (KB) that helps extracting the components of citations in any given format. The FLUX-CiM method is based on: Estimating the probability of a set of terms occurring as a given citation field Use of generic structural properties of bibliographic citation. Its important to say that in our case, the KB is automatically built. This gives us a High level of Automation and Flexibility to our method. As we do not rely on a learning method, Considered Unsupervised because

General View FLUX-CiM Author Conference Title Place

Related Work

Related Work General Extraction
[Laender et. al., SigRec/02] Survey about existing extraction tools [Embley et. al., DKE/99] Extraction using manually constructed ontologies SigMOd Record Data and Knowledge Engeneering

Related Work Citation Extraction
[Han et al., JCDL/03] SVM classification-based method for metadata extraction [M. Y. Day et al., IEEE IRI/05] Metadata extraction based on an ontological knowledge representation Also requires a manual constructed ontology Fixed number of citation patterns

Contributions FLUX-CiM Knowledge-Base automatically built
Does not consider any particular citation pattern Flexible and Unsupervised

The FLUX-CiM Method Basic Concepts

Knowledge Base A set of pairs KB = Constructing process is trivial KB = { (Author, O ), (Title, O ) } O = { “J. K. Rowling”, “Galadriel Waters”, “Beatrix Potter” } O = { “Harry Potter and the Half-Blood Prince ”, “A guide to Harry Potter”, “Petter Rabbit’s Halloween” } Author Title Author Title

Citation string Text portion encompassing a complete citation from the list of citations in a file. p-delimiters (potential delimiters) Any character other than A,…,Z a,…,z 0,…,9 Jobim A. C., Gilberto J. Bossa Nova: A new Harmonic Algorithm. MPB Surveys, 26(11): (1995)

The FLUX-CiM Method Method Steps

The proposed method is divided in four steps: Blocking; Matching; Binding; Joining;

Blocking Hypothesis In a citation string, every field value is bounded by a p-delimiter, but not all p-delimiters bound a value.

Blocking Splitting a citation string into substrings that we call blocks; Considering the position of the p-delimiter in a citation string; Jobim A . C ., Gilberto J . Bossa Nova : A new Harmonic Algorithm . MPB Surveys , 26 ( 11 ) : ( )

Matching Associating each block with a bibliographic metadata field according to the occurrences of the KB; To account the probability that a given term belongs to a field, we use a function that we call FF (Field Frequency).

Matching Author ??? Author ??? Jobim A . C ., Gilberto J . Bossa Nova : A new Harmonic Algorithm . MPB Surveys , 26 ( 11 ) : ( ) Title Journal Vol N Pages Pages Year

Binding Associate remaining unmatched blocks with fields. Information generated by matching step and the knowledge base. There are 3 distinct cases: Homogeneous Neighborhood Partial Neighborhood Heterogeneous Neighborhood

Binding – Homogenous Neighborhood Unmatched between same matched field. Author Author ??? Author ??? Jobim A . C ., Gilberto J . Bossa Nova : A new Harmonic Algorithm . MPB Surveys , 26 ( 11 ) : ( ) Title Journal Citar Partial Nieghborhood Vol N Pages Pages Year

Binding – Heterogeneous Neighborhood Lets consider the example bellow: We must decide if the block “Bossa Nova” should be associated with Author or Title Author Author Author ??? Jobim A . C ., Gilberto J . Bossa Nova : A new Harmonic Algorithm . MPB Surveys , Title Journal

Binding – Heterogeneous Neighborhood Evaluate p-delimiters surrounding the unmatched blocks Author Author Author Title ??? Jobim A . C ., Gilberto J . Bossa Nova : A new Harmonic Algorithm . MPB Surveys , Then, we would choose to associate “Bossa Nova” to Title rather than to Author ; - column Title Journal “.” is likely to be a delimiter between Author and Title “:” is likely to be a character occurring in the values of Title

Joining Joins together blocks associated to a same field to form the values of that field. The solution we adopt relies on the information available in the KB. Usage of the average number of terms for a given metadata field

Joining Jobim A . C ., Gilberto J . Bossa Nova : A new Harmonic Algorithm . MPB Surveys , 26 ( 11 ) : ( ) Title Author Journal Vol N Pages Pages Year Jobim A . C ., Gilberto J . Bossa Nova : A new Harmonic Algorithm . MPB Surveys , 26 ( 11 ) : ( ) Title Journal Author Title Vol N Pages Pages Year

Experiments

Experiments Setup The method was applied to 2 domains:
Health Science (HS) Computer Science (CS) Similar experiments were conducted in both domains.

Experiments Setup KB Test Collection HS 5000 6 PubMed CS 1950 1..10
Domain Size # Fields Source HS 5000 6 PubMed CS 1950 1..10 Web Sites Test Collection Its important to say that theres no intersection between the KB and the Test Collection We use a citation metadata collection (bibtex entries) to generate the knowledge base of each specific domain Domain Size # Fields Source HS 2000 6 PubMed CS 300 1..10 ACM DL

Experiments Verifying the Blocking Hypothesis
We count the field values that were bounded by some p-delimiter. As expected: 100% of the field values bounded, in HS 99.8% of the field values bounded, in CS

Experiments Block-Level Results
Show how correctly the blocks were associated to their respective field. Values are expressed in order of Precision, Recall and F-Measure.

In Average, less than 5% of blocks are unmatched
Experiments Block-Level Results Field Matching U B (%) Binding P (%) R (%) F P (%) R (%) F Author 99.78 79.29 0.88 20.63 99.82 98.96 0.99 Title 98.11 90.43 0.94 7.83 97.19 97.61 0.97 Computer Science Journal 95.80 97.86 0.96 1.43 95.80 97.86 0.96 Date 99.70 97.38 0.98 2.04 97.98 99.13 0.98 Pages 97.87 98.71 0.98 1.29 97.06 99.14 0.98 Volume 100.0 98.25 0.99 0.00 100.0 98.25 0.99 In Average, less than 5% of blocks are unmatched Others 99.18 95.93 0.97 3.04 98.88 98.18 0.98 Average 98.80 95.56 0.96 4.54 98.34 98.37 0.98

In general, in both domains, our method reach high precision results.
Experiments Block-Level Results Field Matching U B (%) Binding P (%) R (%) F P (%) R (%) F Author 99.04 94.33 0.96 4.96 98.89 99.26 0.99 Health Science Title 93.71 90.54 0.92 6.17 92.90 95.96 0.94 Journal 97.51 89.22 0.93 2.22 97.15 89.32 0.93 Date 99.85 99.50 0.99 0.35 99.85 99.50 0.99 Pages 99.90 99.45 0.99 0.35 99.70 99.45 0.99 In general, in both domains, our method reach high precision results. Volume 98.53 99.51 0.99 0.20 97.96 99.56 0.98 Average 98.09 95.42 0.96 2.38 97.74 97.17 0.97

Experiments Field Level Results
Effectiveness of the whole extraction process; Health Science Computer Science Field P (%) R (%) F-measure Field P (%) R (%) F-measure Author 99.57 99.04 0.98 Author 93.85 95.58 0.94 Title 84.88 85.14 0.85 0.85 Title 93.00 93.00 0.93 Journal 97.23 89.35 0.93 Journal 95.71 97.81 0.96 Date 99.85 99.50 0.99 Date 91.75 97.44 0.97 Pages 99.70 99.20 0.99 Pages 97.00 97.84 0.97 Volume 98.20 98.75 0.98 Volume 100.0 98.25 0.99 Average 96.41 95.16 0.95 Others 98.04 97.73 0.97 Average 96.92 97.08 0.97 High accuracy levels reached after matching and binding remains after joining The f-measure of field title revealed a large overlap with terms of field journal

Experiments Citation Level Results
How well each citation was extracted by our method; Domain P (%) R (%) F-Measure HS 94.82 95.10 0.94 CS 95.85 96.22 0.96 Even with diferents styles of citation, our method still achieve good results on both domains, without relying on any pattern

Experiments Recall values achieved in citation extraction
More than 82% of the citations got 100% of recall. This means that all fields were correctly extracted.

Experiments After 3000 records the f-measure remains almost the same until samples This means that our method does not require a large KB to reach a good extraction With only 50 records we got 0.75 of f-measure Performance of FLUX-CiM as the size of the KB increases

Anecdotes

Conclusions and Future Work

Conclusions Novel approach to extracting components of citations in any given format In this method: The KB is automatically built No particular citation standard adopted

Future Work Use of feedback techniques to automatically expand the KB
Application of our method for extracting information from other formats and sources (e.g, addresses, paper headings) For instance, it should be interesting to automatically populate a digital library with metadata directly from web sites of recent conferences.

Questions ???

Marcos André Gonçalves

Similar presentations

Presentation on theme: "Marcos André Gonçalves"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Marcos André Gonçalves

Similar presentations

Presentation on theme: "Marcos André Gonçalves"— Presentation transcript:

Similar presentations

About project

Feedback