Researcher affiliation extraction from homepages I. Nagy, R. Farkas, M. Jelasity University of Szeged, Hungary.

Slides:



Advertisements
Similar presentations
Automatic Timeline Generation from News Articles Josh Taylor and Jessica Jenkins.
Advertisements

Using Query Patterns to Learn the Durations of Events Andrey Gusev joint work with Nate Chambers, Pranav Khaitan, Divye Khilnani, Steven Bethard, Dan Jurafsky.
Problem Semi supervised sarcasm identification using SASI
Mining External Resources for Biomedical IE Why, How, What Malvina Nissim
© author(s) of these slides including research results from the KOM research network and TU Darmstadt; otherwise it is specified at the respective slide.
NYU ANLP-00 1 Automatic Discovery of Scenario-Level Patterns for Information Extraction Roman Yangarber Ralph Grishman Pasi Tapanainen Silja Huttunen.
Introduction to Supervised Machine Learning Concepts PRESENTED BY B. Barla Cambazoglu February 21, 2014.
LEDIR : An Unsupervised Algorithm for Learning Directionality of Inference Rules Advisor: Hsin-His Chen Reporter: Chi-Hsin Yu Date: From EMNLP.
Sunita Sarawagi.  Enables richer forms of queries  Facilitates source integration and queries spanning sources “Information Extraction refers to the.
Predicting Text Quality for Scientific Articles Annie Louis University of Pennsylvania Advisor: Ani Nenkova.
Predicting Text Quality for Scientific Articles AAAI/SIGART-11 Doctoral Consortium Annie Louis : Louis A. and Nenkova A Automatically.
Information Extraction CS 652 Information Extraction and Integration.
CS4705.  Idea: ‘extract’ or tag particular types of information from arbitrary text or transcribed speech.
Event Extraction: Learning from Corpora Prepared by Ralph Grishman Based on research and slides by Roman Yangarber NYU.
Automatic Discovery and Classification of search interface to the Hidden Web Dean Lee and Richard Sia Dec 2 nd 2003.
Predicting the Semantic Orientation of Adjective Vasileios Hatzivassiloglou and Kathleen R. McKeown Presented By Yash Satsangi.
An Overview of Text Mining Rebecca Hwa 4/25/2002 References M. Hearst, “Untangling Text Data Mining,” in the Proceedings of the 37 th Annual Meeting of.
1 Natural Language Processing for the Web Prof. Kathleen McKeown 722 CEPSR, Office Hours: Wed, 1-2; Tues 4-5 TA: Yves Petinot 719 CEPSR,
Extracting Interest Tags from Twitter User Biographies Ying Ding, Jing Jiang School of Information Systems Singapore Management University AIRS 2014, Kuching,
Mining the Medical Literature Chirag Bhatt October 14 th, 2004.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Huimin Ye.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Drew DeHaas.
Artificial Intelligence Research Centre Program Systems Institute Russian Academy of Science Pereslavl-Zalessky Russia.
Mining and Summarizing Customer Reviews
Webpage Understanding: an Integrated Approach
Information Extraction Yahoo! Labs Bangalore Rajeev Rastogi Yahoo! Labs Bangalore.
Research paper: Web Mining Research: A survey SIGKDD Explorations, June Volume 2, Issue 1 Author: R. Kosala and H. Blockeel.
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
C OLLECTIVE ANNOTATION OF WIKIPEDIA ENTITIES IN WEB TEXT - Presented by Avinash S Bharadwaj ( )
APPLICATIONS OF DATA MINING IN INFORMATION RETRIEVAL.
Bootstrapping Information Extraction from Semi-Structured Web Pages Andy Carlson (Machine Learning Department, Carnegie Mellon) Charles Schafer (Google.
The CoNLL-2013 Shared Task on Grammatical Error Correction Hwee Tou Ng, Yuanbin Wu, and Christian Hadiwinoto 1 Siew.
Review of the web page classification approaches and applications Luu-Ngoc Do Quang-Nhat Vo.
Ontology-Driven Automatic Entity Disambiguation in Unstructured Text Jed Hassell.
CROSSMARC Web Pages Collection: Crawling and Spidering Components Vangelis Karkaletsis Institute of Informatics & Telecommunications NCSR “Demokritos”
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Mining Knowledge from Text Using Information Extraction.
WebMining Web Mining By- Pawan Singh Piyush Arora Pooja Mansharamani Pramod Singh Praveen Kumar 1.
ACBiMA: Advanced Chinese Bi-Character Word Morphological Analyzer 1 Ting-Hao (Kenneth) Huang Yun-Nung (Vivian) Chen Lingpeng Kong
Presenter: Lung-Hao Lee ( 李龍豪 ) January 7, 309.
Math Information Retrieval Zhao Jin. Zhao Jin. Math Information Retrieval Examples: –Looking for formulas –Collect teaching resources –Keeping updated.
A Bootstrapping Method for Building Subjectivity Lexicons for Languages with Scarce Resources Author: Carmen Banea, Rada Mihalcea, Janyce Wiebe Source:
Presenter: Shanshan Lu 03/04/2010
Mining Topic-Specific Concepts and Definitions on the Web Bing Liu, etc KDD03 CS591CXZ CS591CXZ Web mining: Lexical relationship mining.
Feature Detection in Ajax-enabled Web Applications Natalia Negara Nikolaos Tsantalis Eleni Stroulia 1 17th European Conference on Software Maintenance.
CS 6998 NLP for the Web Columbia University 04/22/2010 Analyzing Wikipedia and Gold-Standard Corpora for NER Training William Y. Wang Computer Science.
Bootstrapping for Text Learning Tasks Ramya Nagarajan AIML Seminar March 6, 2001.
Intelligent Web Topics Search Using Early Detection and Data Analysis by Yixin Yang Presented by Yixin Yang (Advisor Dr. C.C. Lee) Presented by Yixin Yang.
Project Overview Vangelis Karkaletsis NCSR “Demokritos” Frascati, July 17, 2002 (IST )
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
LOGO 1 Corroborate and Learn Facts from the Web Advisor : Dr. Koh Jia-Ling Speaker : Tu Yi-Lang Date : Shubin Zhao, Jonathan Betz (KDD '07 )
Matwin Text classification: In Search of a Representation Stan Matwin School of Information Technology and Engineering University of Ottawa
Creating Subjective and Objective Sentence Classifier from Unannotated Texts Janyce Wiebe and Ellen Riloff Department of Computer Science University of.
1 Language Specific Crawler for Myanmar Web Pages Pann Yu Mon Management and Information System Engineering Department Nagaoka University of Technology,
Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏
Supertagging CMSC Natural Language Processing January 31, 2006.
Automatic recognition of discourse relations Lecture 3.
Using Wikipedia for Hierarchical Finer Categorization of Named Entities Aasish Pappu Language Technologies Institute Carnegie Mellon University PACLIC.
BOOTSTRAPPING INFORMATION EXTRACTION FROM SEMI-STRUCTURED WEB PAGES Andrew Carson and Charles Schafer.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Toward Entity Retrieval over Structured and Text Data Mayssam Sayyadian, Azadeh Shakery, AnHai Doan, ChengXiang Zhai Department of Computer Science University.
Semi-Supervised Recognition of Sarcastic Sentences in Twitter and Amazon -Smit Shilu.
Institute of Informatics & Telecommunications NCSR “Demokritos” Spidering Tool, Corpus collection Vangelis Karkaletsis, Kostas Stamatakis, Dimitra Farmakiotou.
Harnessing the Deep Web : Present and Future -Tushar Mhaskar Jayant Madhavan, Loredana Afanasiev, Lyublena Antova, Alon Halevy January 7,
Research at Open Systems Lab IIIT Bangalore
Social Knowledge Mining
The CoNLL-2014 Shared Task on Grammatical Error Correction
Automatic Detection of Causal Relations for Question Answering
Web Mining Department of Computer Science and Engg.
Extracting Why Text Segment from Web Based on Grammar-gram
KnowItAll and TextRunner
Presentation transcript:

Researcher affiliation extraction from homepages I. Nagy, R. Farkas, M. Jelasity University of Szeged, Hungary

Scientific social information Research interest Education Previous and current affiliations, projects Professional memberships Teaching activities Students, supervisors Personal (nationality, age)

Source of social information Social sites (e.g. ) –structured, lots of information –coverage? Citation databases –limited information (coauthors, affiliations, citations) Homepages –thought to be important by the researcher himself –almost every researcher has (a) homepage –unstructured

Web Content Mining Early systems (’ ): expert rules Seed-driven systems –Input: seed pairs of target information –Extract patterns from unlabeled text (e.g. Web) –Exploits redundancy (celebs) –High precision Researcher homepage –Long tail, high recall required

Case study: affiliation information [affiliation;position;start date;end date] Frequently given Experiences can be generalised useful: –Collegial relationships (whether they worked with the same group at the same time) –Do American or European researchers change their workplace more often?

Architecture Locating the homepage of the researcher –name disambiguation Locating the relevant parts of the site –pages (focused crawling), parts Extracting information tuples –Weakly supervised setting Normalisation –For every source of information

Manually tagged corpus 455 sites, 5282 pages for 89 researchers three-level deep annotation hierarchy with 44 classes manual annotation in the original HTML format (WYSWYG) with hyperlinks low inter-annotation agreement focus on affiliation

Sample

Textual information 47% textual, 24% itemised, 29% hybrid Structured: wrapper induction Textual paragraph: longer than 40 characters and contains at least one verb

Relevant parts Every researcher has (a) homepage Every homepage can be found in the top10 Google response (query=name) „CV site” always in depth 1 Textual paragraphs contain cluewords –class conditional prob. based 1-DNF –filtering 70k irrelevant paragraphs

Slot detection It is not a NER –just affiliation related entities –surface features are insufficient Standard procedure (CRF) –with domain specific lists as extra feature –domain specific segmentation –70% phrase level F-measure, one- researcher-leave-out (37% by lists/regexp)

Subject detection Sometimes information about supervisors, colleagues Hypothesis: paragraphs are „homogeneous” Two procedures –NER for person names (trained on CoNLL) –personal pronouns –~70% accuracy on gold standard and on predicted too

Collecting information tuples affiliation is the head Heuristic: assign each year and position_type to the nearest affiliation ~90% accuracy using the gold- standard labels ~70% accuracy using the labels predicted by the system (FPs count as misclassified)

Problematic issues „I am a Ph.D. Student working under the supervision of Prof. NAME” „Hewlett-Packard Labs in Palo Alto” „Ph.D. from MIT in Physics” „Department of Computer Science, [Waterloo University] BASELINE ” „I lead the Distributed Systems Group” In-domain name detection Enumeration detection is important (syntactic parsers?)

Conclusions Information from homepages of researchers Special nature of the tasks: –long tail –small labeled corpus –lack of domain-specific parsers Several well defined subtasks Basic solutions for each subtask

Thank you!