Charis Ermopoulos Yong Yang Hanna Zhong Qian Yang
Problem Definition -Given a full name of a database researcher, find his/her homepage. Homepage definition: (to be discussed in class )
Related Work See previous group’s slides
Personal Dictionary Name Weighting function Homepage Domain Dictionary Heuristics To distinguish Database-related webpages from the rest To distinguish personal homepages from common sites Architecture
Domain Dictionary A set of words that are common in the database community. Our approach: DBWorld DBConferenceContrast Area Our Dictionary (Virtual) + - =
Domain Dictionary Dictionary Building: parse documents from each source into 2-word phrases and calculate their frequency data mine4.47E-03 dbworld messag4.38E-03 paper submiss3.78E-03 program committe3.10E-03 import date2.98E-03 state univers2.74E-03 intern confer2.73E-03 comput scienc2.70E-03 hong kong2.65E-03 camera readi2.56E-03 data manag2.33E-03 queri process1.63E-02 mobil databas1.36E-02 languag featur1.09E-02 data manag1.09E-02 xqueri implement queri languag8.17E-03 queri optim process data data mine research prototyp databas architectur program committe mathemat scienc mathemat physic intern confer date june intern institut schr dinger erwin schr dinger intern degli studi DBWorld DBConferenceContrast Area Our Dictionary (Virtual) + - =
Domain Dictionary (cont.) Similarity Measuring: (1) Parse the webpage into 2-word phrases, and calculate their frequency (2) Use cosine similarity measure based on phrase frequency to get a score from each dictionary: S dbworld, S dbconf, S contrast (3) Combine S dbworld, S dbconf, (1- S contrast ) using geometric average.
Personal Dictionary A set of words related to the specific person that we are looking for. Our approach: use DBLP to find information about co-authors, keywords of research, and conferences
Personal Dictionary (1) Given a researcher ’ s name, find his/her DBLP page (2) Build the personal dictionary, using Term Frequency and Entry Frequency (#publication entries where a term appears) (3) Use cosine measure to evaluate the similarity between a webpage and this personal dictionary
Heuristics Rules to distinguish a homepage from other websites. Our Heuristics: In title: Name, “Homepage”, “DBLP”, “eventseer”, In URL: A version of person’s name, “citeseer” In body: Visual cues, specific keywords {University, Department, Professor, Research, Homepage} Co-occurrence of “publication” and person’s name.
Personal Dictionary Name Weighting function Homepage Domain Dictionary Heuristics Recall…
Combining Scores Experimentally assign weights for the previous scoring functions. Return the URL with the highest score.
Strengths Disambiguating between people with the same name, given that there is only one of them in the databases field. Fits well in the DBLife architecture, since our algorithm run offline for the whole researchers list that we get from DBLP.
Strengths (cont) Incremental architecture: Finds new researchers through DBLP Finds new domain related words through DBWorld Modular architecture: we can add more scoring functions.
Limitations Can’t distinguish between pages that look like the homepage that we are looking for. Can’t distinguish between people with the same name, working in the same area (databases). Google, DBLP, DBWorld dependent.
Demo …
Questions ?
Thank you!