Presentation is loading. Please wait.

Presentation is loading. Please wait.

Charis Ermopoulos Yong Yang Hanna Zhong Qian Yang.

Similar presentations


Presentation on theme: "Charis Ermopoulos Yong Yang Hanna Zhong Qian Yang."— Presentation transcript:

1 Charis Ermopoulos Yong Yang Hanna Zhong Qian Yang

2 Problem Definition -Given a full name of a database researcher, find his/her homepage. Homepage definition: (to be discussed in class )

3 Related Work See previous group’s slides

4 Personal Dictionary Name Weighting function Homepage Domain Dictionary Heuristics To distinguish Database-related webpages from the rest To distinguish personal homepages from common sites Architecture

5 Domain Dictionary A set of words that are common in the database community. Our approach: DBWorld DBConferenceContrast Area Our Dictionary (Virtual) + - =

6 Domain Dictionary Dictionary Building: parse documents from each source into 2-word phrases and calculate their frequency data mine4.47E-03 dbworld messag4.38E-03 paper submiss3.78E-03 program committe3.10E-03 import date2.98E-03 state univers2.74E-03 intern confer2.73E-03 comput scienc2.70E-03 hong kong2.65E-03 camera readi2.56E-03 data manag2.33E-03 queri process1.63E-02 mobil databas1.36E-02 languag featur1.09E-02 data manag1.09E-02 xqueri implement0.008174387 queri languag8.17E-03 queri optim0.005449591 process data0.005449591 data mine0.005449591 research prototyp0.005449591 databas architectur0.005449591 program committe0.019085487 mathemat scienc0.007952286 mathemat physic0.006361829 intern confer0.0055666 date june0.005168986 intern institut0.004373758 schr dinger0.003976143 erwin schr0.003976143 dinger intern0.003976143 degli studi0.003578529 DBWorld DBConferenceContrast Area Our Dictionary (Virtual) + - =

7 Domain Dictionary (cont.) Similarity Measuring: (1) Parse the webpage into 2-word phrases, and calculate their frequency (2) Use cosine similarity measure based on phrase frequency to get a score from each dictionary: S dbworld, S dbconf, S contrast (3) Combine S dbworld, S dbconf, (1- S contrast ) using geometric average.

8 Personal Dictionary A set of words related to the specific person that we are looking for. Our approach: use DBLP to find information about co-authors, keywords of research, and conferences

9 Personal Dictionary (1) Given a researcher ’ s name, find his/her DBLP page (2) Build the personal dictionary, using Term Frequency and Entry Frequency (#publication entries where a term appears) (3) Use cosine measure to evaluate the similarity between a webpage and this personal dictionary

10 Heuristics Rules to distinguish a homepage from other websites. Our Heuristics: In title: Name, “Homepage”, “DBLP”, “eventseer”, In URL: A version of person’s name, “citeseer” In body: Visual cues, specific keywords {University, Department, Professor, Research, Homepage} Co-occurrence of “publication” and person’s name.

11 Personal Dictionary Name Weighting function Homepage Domain Dictionary Heuristics Recall…

12 Combining Scores Experimentally assign weights for the previous scoring functions. Return the URL with the highest score.

13 Strengths Disambiguating between people with the same name, given that there is only one of them in the databases field. Fits well in the DBLife architecture, since our algorithm run offline for the whole researchers list that we get from DBLP.

14 Strengths (cont) Incremental architecture: Finds new researchers through DBLP Finds new domain related words through DBWorld Modular architecture: we can add more scoring functions.

15 Limitations Can’t distinguish between pages that look like the homepage that we are looking for. Can’t distinguish between people with the same name, working in the same area (databases). Google, DBLP, DBWorld dependent.

16 Demo …

17 Questions ?

18 Thank you!


Download ppt "Charis Ermopoulos Yong Yang Hanna Zhong Qian Yang."

Similar presentations


Ads by Google