Charis Ermopoulos Yong Yang Hanna Zhong Qian Yang.

Slides:



Advertisements
Similar presentations
Fatma Y. ELDRESI Fatma Y. ELDRESI ( MPhil ) Systems Analysis / Programming Specialist, AGOCO Part time lecturer in University of Garyounis,
Advertisements

LIBRARY WEBSITE, CATALOG, DATABASES AND FREE WEB RESOURCES.
The 5S numbers game..
Automatic Timeline Generation from News Articles Josh Taylor and Jessica Jenkins.
The basics for simulations
Web indexing ICE0534 – Web-based Software Development July Seonah Lee.
WWW 2014 Seoul, April 8 th SNOW 2014 Data Challenge Two-level message clustering for topic detection in Twitter Georgios Petkos, Symeon Papadopoulos, Yiannis.
Computer Science 1000 Information Searching Permission to redistribute these slides is strictly prohibited without permission.
Searching The Web Search Engines are computer programs (variously called robots, crawlers, spiders, worms) that automatically visit Web sites and, starting.
What is the Internet? The Internet is a computer network connecting millions of computers all over the world It has no central control - works through.
1 Using Scopus for Literature Research. 2 Why Scopus?  A comprehensive abstract and citation database of peer- reviewed literature and quality web sources.
Information retrieval Finding relevant data using irrelevant keys Example: database of photographic images sorted by number, date. DBMS: Well structured.
A field is a unit of information. Limit search by the title field.
Web Information Retrieval Projects Ida Mele. Rules Students can work in teams (max 3 people) The project must be delivered by the deadline that will be.
CS344: Introduction to Artificial Intelligence Vishal Vachhani M.Tech, CSE Lecture 34-35: CLIR and Ranking in IR.
Webpage Understanding: an Integrated Approach
1 Announcements Research Paper due today Research Talks –Nov. 29 (Monday) Kayatana and Lance –Dec. 1 (Wednesday) Mark and Jeremy –Dec. 3 (Friday) Joe and.
Basic Web Applications 2. Search Engine Why we need search ensigns? Why we need search ensigns? –because there are hundreds of millions of pages available.
Mehdi Kargar Aijun An York University, Toronto, Canada Keyword Search in Graphs: Finding r-cliques.
1 Scopus as a Research Tool March Why Scopus?  A comprehensive abstract and citation database of peer-reviewed literature and quality web sources.
1 Formal Models for Expert Finding on DBLP Bibliography Data Presented by: Hongbo Deng Co-worked with: Irwin King and Michael R. Lyu Department of Computer.
Clustering-based Collaborative filtering for web page recommendation CSCE 561 project Proposal Mohammad Amir Sharif
11 CANTINA: A Content- Based Approach to Detecting Phishing Web Sites Reporter: Gia-Nan Gao Advisor: Chin-Laung Lei 2010/6/7.
1 Applying Collaborative Filtering Techniques to Movie Search for Better Ranking and Browsing Seung-Taek Park and David M. Pennock (ACM SIGKDD 2007)
Detecting Semantic Cloaking on the Web Baoning Wu and Brian D. Davison Lehigh University, USA WWW 2006.
Grouping search-engine returned citations for person-name queries Reema Al-Kamha, David W. Embley (Proceedings of the 6th annual ACM international workshop.
Web Searching Basics Dr. Dania Bilal IS 530 Fall 2009.
Ontology-Driven Automatic Entity Disambiguation in Unstructured Text Jed Hassell.
25/03/2003CSCI 6405 Zheyuan Yu1 Finding Unexpected Information Taken from the paper : “Discovering Unexpected Information from your Competitor’s Web Sites”
11 A Hybrid Phish Detection Approach by Identity Discovery and Keywords Retrieval Reporter: 林佳宜 /10/17.
How to create a bibliography using MLA format. Why is it important to cite your work? Even if it’s unintentional, plagiarism can still have serious consequences.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Faculty Webpage Design Minimum Requirements. Go to: then High Schoolhttp://gcsc.groupfusion.net/
Crawling and Aligning Scholarly Presentations and Documents from the Web By SARAVANAN.S 09/09/2011 Under the guidance of A/P Min-Yen Kan 10/23/
Chapter 6: Information Retrieval and Web Search
Ranking in Information Retrieval Systems Prepared by: Mariam John CSE /23/2006.
Search Engines Reyhaneh Salkhi Outline What is a search engine? How do search engines work? Which search engines are most useful and efficient? How can.
Query Suggestion Naama Kraus Slides are based on the papers: Baeza-Yates, Hurtado, Mendoza, Improving search engines by query clustering Boldi, Bonchi,
Hypersearching the Web, Chakrabarti, Soumen Presented By Ray Yamada.
Searching CiteSeer Metadata Using Nutch Larry Reeve INFO624 – Information Retrieval Dr. Lin – Winter 2005.
Search Engines By: Faruq Hasan.
Data Integration Hanna Zhong Department of Computer Science University of Illinois, Urbana-Champaign 11/12/2009.
AN EFFECTIVE STATISTICAL APPROACH TO BLOG POST OPINION RETRIEVAL Ben He Craig Macdonald Iadh Ounis University of Glasgow Jiyin He University of Amsterdam.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Improving the performance of personal name disambiguation.
An Efficient Information Retrieval System Objectives: n Efficient Retrieval incorporating keyword’s position; and occurrences of keywords in heading or.
By Pamela Drake SEARCH ENGINE OPTIMIZATION. WHAT IS SEO? Search engine optimization (SEO) is the process of affecting the visibility of a website or a.
Refined Online Citation Matching and Adaptive Canonical Metadata Construction CSE 598B Course Project Report Huajing Li.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
1 CS 430: Information Discovery Lecture 8 Collection-Level Metadata Vector Methods.
Toward Entity Retrieval over Structured and Text Data Mayssam Sayyadian, Azadeh Shakery, AnHai Doan, ChengXiang Zhai Department of Computer Science University.
What is Seo? SEO stands for “search engine optimization.” It is the process of getting traffic from the “free,” “organic,” “editorial” or “natural” search.
1 e-Resources on Social Sciences: Scopus. 2 Why Scopus?  A comprehensive abstract and citation database of peer-reviewed literature and quality web sources.
Vertical Search for Courses of UIUC Homepage Classification The aim of the Course Search project is to construct a database of UIUC courses across all.
1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.
IR 6 Scoring, term weighting and the vector space model.
Search Engine Optimization
Showing you all main Author features in Editorial Manager
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
Library Website, Catalog, DATABASES and Free Web Resources
WEB SPAM.
Julián ALARTE DAVID INSA JOSEP SILVA
CANTINA: A Content-Based Approach to Detecting Phishing Web Sites
9/21 Find and cite a website source
Prepared by Rao Umar Anwar For Detail information Visit my blog:
A research literature search engine with abbreviation recognition
The Anatomy of a Large-Scale Hypertextual Web Search Engine
Information needed for citing sources:
6. Implementation of Vector-Space Retrieval
Information Retrieval and Web Design
Showing you all main Author features in Editorial Manager
Presentation transcript:

Charis Ermopoulos Yong Yang Hanna Zhong Qian Yang

Problem Definition -Given a full name of a database researcher, find his/her homepage. Homepage definition: (to be discussed in class )

Related Work See previous group’s slides

Personal Dictionary Name Weighting function Homepage Domain Dictionary Heuristics To distinguish Database-related webpages from the rest To distinguish personal homepages from common sites Architecture

Domain Dictionary A set of words that are common in the database community. Our approach: DBWorld DBConferenceContrast Area Our Dictionary (Virtual) + - =

Domain Dictionary Dictionary Building: parse documents from each source into 2-word phrases and calculate their frequency data mine4.47E-03 dbworld messag4.38E-03 paper submiss3.78E-03 program committe3.10E-03 import date2.98E-03 state univers2.74E-03 intern confer2.73E-03 comput scienc2.70E-03 hong kong2.65E-03 camera readi2.56E-03 data manag2.33E-03 queri process1.63E-02 mobil databas1.36E-02 languag featur1.09E-02 data manag1.09E-02 xqueri implement queri languag8.17E-03 queri optim process data data mine research prototyp databas architectur program committe mathemat scienc mathemat physic intern confer date june intern institut schr dinger erwin schr dinger intern degli studi DBWorld DBConferenceContrast Area Our Dictionary (Virtual) + - =

Domain Dictionary (cont.) Similarity Measuring: (1) Parse the webpage into 2-word phrases, and calculate their frequency (2) Use cosine similarity measure based on phrase frequency to get a score from each dictionary: S dbworld, S dbconf, S contrast (3) Combine S dbworld, S dbconf, (1- S contrast ) using geometric average.

Personal Dictionary A set of words related to the specific person that we are looking for. Our approach: use DBLP to find information about co-authors, keywords of research, and conferences

Personal Dictionary (1) Given a researcher ’ s name, find his/her DBLP page (2) Build the personal dictionary, using Term Frequency and Entry Frequency (#publication entries where a term appears) (3) Use cosine measure to evaluate the similarity between a webpage and this personal dictionary

Heuristics Rules to distinguish a homepage from other websites. Our Heuristics: In title: Name, “Homepage”, “DBLP”, “eventseer”, In URL: A version of person’s name, “citeseer” In body: Visual cues, specific keywords {University, Department, Professor, Research, Homepage} Co-occurrence of “publication” and person’s name.

Personal Dictionary Name Weighting function Homepage Domain Dictionary Heuristics Recall…

Combining Scores Experimentally assign weights for the previous scoring functions. Return the URL with the highest score.

Strengths Disambiguating between people with the same name, given that there is only one of them in the databases field. Fits well in the DBLife architecture, since our algorithm run offline for the whole researchers list that we get from DBLP.

Strengths (cont) Incremental architecture: Finds new researchers through DBLP Finds new domain related words through DBWorld Modular architecture: we can add more scoring functions.

Limitations Can’t distinguish between pages that look like the homepage that we are looking for. Can’t distinguish between people with the same name, working in the same area (databases). Google, DBLP, DBWorld dependent.

Demo …

Questions ?

Thank you!