A research literature search engine with abbreviation recognition Cheng-Tao Chu Pei-Chin Wang
Outline Features Demo Issues involved Implementation Evaluation Q&A Tailored Edit Distance Probabilistic Model Translation Model Score Combination Evaluation Q&A
Features Given a query containing authors, proceeding or title keywords, return relevant papers Able to retrieve the desired papers with abbreviated author/proceeding names Web interface for query and user evaluation.
Demo It’s show time
Issues involved Tag the arbitrary query into author, proceeding, and other keywords fields Recognize author P. Raghavan -> Prabhakar Raghavan -> Padma Raghavan -> … Raghavan Probability of each possible candidates
Issues involved (cont.) Recognize proceeding name More than a look-up table IJCAI -> International Joint Conference of AI -> IJCAI Workshop How to combine the weight of each candidate Score from Lucene Score for a possible author Score for a possible proceeding
Implementation DBLP XML Parser Tagger Database Query Browser Search Engine Retrieved Documents Probabilistic Model Tailored Edit Distance
Tailored Edit Distance Heuristic Award for consecutive matching Award for matching capitalized character More penalty on substitution, less on insertion/deletion Probabilistic representation Transform edit distance cost to probability Normalize the cost Use training data to estimate the distribution
Conceptual Histogram
Probabilistic Model Translation Model Network Structure Use tailored edit distance to estimate the distribution Return a distribution of candidate names (Assuming the independency between the full name and its abbreviation given evidence) Network Structure Full Name First Name Middle Name Last Name First Ini. Mid. Ini. Last Ini.
Score Combination Lucene score formula Assign weights to each candidates as Combination score Set idf(t) as ( weight of that term + original idf(t) ) Assign boost value to each term in query
Evaluation Test data construction Evaluation by test data precision User evaluation Comparison with Google Scholar
Q&A