Presentation is loading. Please wait.

Presentation is loading. Please wait.

Linh Harvesting useful data from researchers’ homepages.

Similar presentations

Presentation on theme: "Linh Harvesting useful data from researchers’ homepages."— Presentation transcript:

1 Linh Harvesting useful data from researchers’ homepages

2 15-Aug-08 Outline  Researchers’ homepages  Challenges  Related works

3 15-Aug-08 Researchers’ homepages  Lots of useful information about the researchers themselves  Basic information  Contact information  Educational history  Publications

4 15-Aug-08 Challenges  Different layouts  Templates  Personal pages  Different content  Pages introducing researchers  CV-like  Personal pages  Different content structures  Tables / lists  Natural language text

5 15-Aug-08 Challenges  Different data presentations  hangli at microsoft dot com , junyang   erafalin(at)   Natalio.Krasnogor -replace all this by at symbol-  wmt then the at-sign then uci dot edu

6 15-Aug-08 Related works – Tang et al (2008)  Tang et al.(2008) – ArnetMiner  Separate text into tokens (5 token types)  Assign possible tags to each tokens (CRF)  Extract profile properties (Amilcare tool and SVM) F1 = 83.37% (1,000 researchers)  Name disambiguation: may be simpler in our case

7 15-Aug-08 Related works - Cai et al (2003)  Cai et al (2003) - Visual-based content structure extraction  Underlying documentation presentation independent  Visual-based Page Segmentation (VIPS)  By combining DOM structure and visual cues (tag, color, text, size)

8 15-Aug-08 Related works - Cai et al (2003)

9 15-Aug-08 Related works - Cai et al (2003)  Strength Domain independent  layout independent No data training required Good results in evaluation report (97% of pages correctly detected)  Applicability Can be used to improve speed and correctness of the retrieval Different levels of complexicity in homepages layouts

10 15-Aug-08 References  J. Tang, D. Zhang, and L. Yao. Social network extraction of academic researchers. In Proc. of ICDM’2007 pp ,  D. Cai, S. Yu, J.R. Wen and W.Y. Ma (2003). Extracting content structure for web pages based on visual representation. In the 5 th APWC, pp  C.H. Lee (2004). PARCELS: PARser for Content Extraction and Logical Structure (Stylistic detection). Honours Thesis, School of Computing, NUS,  J. Chen, K. Xiao (2008). Perception-oriented Online news extraction. In JCDL 2008 pp.363  Amilcare Webpage -  Wikipedia Webpage –  W3Schools Webpage –

11 Linh

Download ppt "Linh Harvesting useful data from researchers’ homepages."

Similar presentations

Ads by Google