Hypersearching the Web Hira Bashir - June 22, 2010 Soumen Chakarbarti, Byron Dom, S. Ravi Kumar, Prabhakar Raghavan, Sridhar Rajagopalan & Andrew Tomkins.

Slides:



Advertisements
Similar presentations
Hyper search ing the Web Soumen Chakrabarti, Byron Dom, S. Ravi Kumar, Prabhakar Raghavan, Sridhar Rajagopalan, Andrew Tomkins Jacob Kalakal Joseph CS.
Advertisements

Hyper search ing the Web Soumen Chakrabarti, Byron Dom, S. Ravi Kumar, Prabhakar Raghavan, Sridhar Rajagopalan, Andrew Tomkins Jacob Kalakal Joseph CS.
Mining Web’s Link Structure Sushanth Rai University of Texas at Arlington
Information Networks Link Analysis Ranking Lecture 8.
Hyper-Searching the Web. Search Engines Basic Search (index) Cluster Search (themes) Meta-search (outsource) “Smarter” meta-search (themes + outsource)
Our purpose Giving a query on the Web, how can we find the most authoritative (relevant) pages?
DATA MINING LECTURE 12 Link Analysis Ranking Random walks.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 3 March 23, 2005
Link Analysis Ranking. How do search engines decide how to rank your query results? Guess why Google ranks the query results the way it does How would.
1 CS 430 / INFO 430: Information Retrieval Lecture 16 Web Search 2.
1 Authoritative Sources in a Hyperlinked Environment Jon M. Kleinberg Presented by Yongqiang Li Adapted from
Authoritative Sources in a Hyperlinked Environment Hui Han CSE dept, PSU 10/15/01.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 21: Link Analysis.
6/16/20151 Recent Results in Automatic Web Resource Discovery Soumen Chakrabartiv Presentation by Cui Tao.
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) Web IR.
Authoritative Sources in a Hyperlinked Environment Jon M. Kleinberg Presented By: Talin Kevorkian Summer June
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 3 April 2, 2006
15-853Page :Algorithms in the Real World Indexing and Searching III (well actually II) – Link Analysis – Near duplicate removal.
Link Analysis, PageRank and Search Engines on the Web
ISP 433/633 Week 7 Web IR. Web is a unique collection Largest repository of data Unedited Can be anything –Information type –Sources Changing –Growing.
1 ICS 215: Advances in Database Management System Technology Spring 2004 Professor Chen Li Information and Computer Science University of California, Irvine.
Link Structure and Web Mining Shuying Wang
Seek and Ye shall Find COS 116: 2/21/2008 Sanjeev Arora The continuum of computer “intelligence”
Prestige (Seeley, 1949; Brin & Page, 1997; Kleinberg,1997) Use edge-weighted, directed graphs to model social networks Status/Prestige In-degree is a good.
Link Analysis HITS Algorithm PageRank Algorithm.
Chapter 8 Web Structure Mining Part-1 1. Web Structure Mining Deals mainly with discovering the model underlying the link structure of the web Deals with.
Motivation When searching for information on the WWW, user perform a query to a search engine. The engine return, as the query’s result, a list of Web.
“ The Initiative's focus is to dramatically advance the means to collect,store,and organize information in digital forms,and make it available for searching,retrieval,and.
PRESENTED BY ASHISH CHAWLA AND VINIT ASHER The PageRank Citation Ranking: Bringing Order to the Web Lawrence Page and Sergey Brin, Stanford University.
HITS – Hubs and Authorities - Hyperlink-Induced Topic Search A on the left is an authority A on the right is a hub.
Λ14 Διαδικτυακά Κοινωνικά Δίκτυα και Μέσα
Presented By: - Chandrika B N
Seek and Ye shall Find COS 116, Spring 2010 Adam Finkelstein The continuum of computer “intelligence”
The PageRank Citation Ranking: Bringing Order to the Web Presented by Aishwarya Rengamannan Instructor: Dr. Gautam Das.
Authoritative Sources in a Hyperlinked Environment Jon M. Kleinberg Presentation by Julian Zinn.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 21: Link Analysis.
Google’s Billion Dollar Eigenvector Gerald Kruse, PhD. John ‘54 and Irene ‘58 Dale Professor of MA, CS and I T Interim Assistant Provost Juniata.
Adversarial Information Retrieval The Manipulation of Web Content.
1 Announcements Research Paper due today Research Talks –Nov. 29 (Monday) Kayatana and Lance –Dec. 1 (Wednesday) Mark and Jeremy –Dec. 3 (Friday) Joe and.
1 University of Qom Information Retrieval Course Web Search (Link Analysis) Based on:
CS315 – Link Analysis Three generations of Search Engines Anchor text Link analysis for ranking Pagerank HITS.
Presented by, Lokesh Chikkakempanna Authoritative Sources in a Hyperlinked environment.
Web Mining Class Nam Hoai Nguyen Hiep Tuan Nguyen Tri Survey on Web Structure Mining
Link Analysis on the Web An Example: Broad-topic Queries Xin.
How Does a Search Engine Work? Part 2 Dr. Frank McCown Intro to Web Science Harding University This work is licensed under a Creative Commons Attribution-NonCommercial-
Keyword Searching and Browsing in Databases using BANKS Seoyoung Ahn Mar 3, 2005 The University of Texas at Arlington.
Overview of Web Ranking Algorithms: HITS and PageRank
Lecture #10 PageRank CS492 Special Topics in Computer Science: Distributed Algorithms and Systems.
Hypersearching the Web, Chakrabarti, Soumen Presented By Ray Yamada.
Ch 14. Link Analysis Padmini Srinivasan Computer Science Department
Ranking Link-based Ranking (2° generation) Reading 21.
Analysis of Link Structures on the World Wide Web and Classified Improvements Greg Nilsen University of Pittsburgh April 2003.
1 1 COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified based on the slides provided by Lawrence Page, Sergey Brin, Rajeev Motwani.
Information Retrieval and Web Search Link analysis Instructor: Rada Mihalcea (Note: This slide set was adapted from an IR course taught by Prof. Chris.
- Murtuza Shareef Authoritative Sources in a Hyperlinked Environment More specifically “Link Analysis” using HITS Algorithm.
1 CS 430: Information Discovery Lecture 5 Ranking.
CS 540 Database Management Systems Web Data Management some slides are due to Kevin Chang 1.
PageRank Google : its search listings always seemed deliver the “good stuff” up front. 1 2 Part of the magic behind it is its PageRank Algorithm PageRank™
1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.
15-499:Algorithms and Applications
HITS Hypertext-Induced Topic Selection
Search Engines and Link Analysis on the Web
Greg Nilsen University of Pittsburgh April 2003
Text & Web Mining 9/22/2018.
Lecture 22 SVD, Eigenvector, and Web Search
CS 572 (Spring 2011) | Class Presentation | June 21, 2011
Junghoo “John” Cho UCLA
Lecture 22 SVD, Eigenvector, and Web Search
Lecture 22 SVD, Eigenvector, and Web Search
Discussion Class 9 Google.
Presentation transcript:

Hypersearching the Web Hira Bashir - June 22, 2010 Soumen Chakarbarti, Byron Dom, S. Ravi Kumar, Prabhakar Raghavan, Sridhar Rajagopalan & Andrew Tomkins

… … … … … … … … … …

… … … … … … … … … … CHAOS!

Enter your text here… Search I’m a Search Engine!

Learn Pashto Language| Search I’m a Search Engine!

Learn Pashto Language Search I’m a Search Engine!

A Abate Abash. A Abate Abash. Webpage 1, Webpage 2, …, Webpage L Webpage 1, Webpage 2, …, Webpage M Webpage 1, Webpage 2, …, Webpage N. Webpage 1, Webpage 2, …, Webpage L Webpage 1, Webpage 2, …, Webpage M Webpage 1, Webpage 2, …, Webpage N. Words WebPages containing the word

A Abate Abash. A Abate Abash. Webpage 1, Webpage 2, …, Webpage L Webpage 1, Webpage 2, …, Webpage M Webpage 1, Webpage 2, …, Webpage N. Webpage 1, Webpage 2, …, Webpage L Webpage 1, Webpage 2, …, Webpage M Webpage 1, Webpage 2, …, Webpage N. Words WebPages containing the word Index

A Abate Abash. A Abate Abash. Webpage 1, Webpage 2, …, Webpage L Webpage 1, Webpage 2, …, Webpage M Webpage 1, Webpage 2, …, Webpage N. Webpage 1, Webpage 2, …, Webpage L Webpage 1, Webpage 2, …, Webpage M Webpage 1, Webpage 2, …, Webpage N. Words WebPages containing the word Challenge: Creating, maintaining, using Index

Ranking Function Heuristics

Ranking Function Heuristics Pashto is the National language of Afghanistan. Pashto is extensively spoken by Pashtuns tribes in Pakistan’s rural cities. Pashto has two major Dialects. The Pashto spoken in Peshawar differs from the Pashto spoken in Afghanistan or Baluchistan. This text would cover the basic vocabulary to familiarize you with this language. If you know Urdu, it would be easier you learn written…………. Pashto is the National language of Afghanistan. Pashto is extensively spoken by Pashtuns tribes in Pakistan’s rural cities. Pashto has two major Dialects. The Pashto spoken in Peshawar differs from the Pashto spoken in Afghanistan. Pashto poetry and pashto songs have been a highlight of pashto language. Pashto has a rich vocabulary. Pashto is much older then………………………………….

Ranking Function Heuristics Learn Pashto Language Welcome to the world’s best resource to learn…. -Pashto Poetry -Pashto Poets -Greetings in Pashto Welcome to the world’s best resource to learn…. Pashto is the National language of Afghanistan. Pashto is extensively spoken by Pashtuns tribes in Pakistan’s rural cities. Pashto has two major Dialects. The Pashto spoken in Peshawar differs from the Pashto spoken in Afghanistan or Baluchistan. This text would cover the basic vocabulary to familiarize you with this language. If you know Urdu, it would be easier you learn written…………. Pashto is the National language of Afghanistan. Pashto is extensively spoken by Pashtuns tribes in Pakistan’s rural cities. Pashto has two major Dialects. The Pashto spoken in Peshawar differs from the Pashto spoken in Afghanistan. Pashto poetry and pashto songs have been a highlight of pashto language. Pashto has a rich vocabulary. Pashto is much older then………………………………….

Ranking Function Heuristics Learn Pashto Language Welcome to the world’s best resource to learn…. -Pashto Poetry -Pashto Poets -Greetings in Pashto Welcome to the world’s best resource to learn…. Pashto is the National language of Afghanistan. Pashto is extensively spoken by Pashtuns tribes in Pakistan’s rural cities. Pashto has two major Dialects. The Pashto spoken in Peshawar differs from the Pashto spoken in Afghanistan or Baluchistan. This text would cover the basic vocabulary to familiarize you with this language. If you know Urdu, it would be easier you learn written…………. Pashto is the National language of Afghanistan. Pashto is extensively spoken by Pashtuns tribes in Pakistan’s rural cities. Pashto has two major Dialects. The Pashto spoken in Peshawar differs from the Pashto spoken in Afghanistan. Pashto poetry and pashto songs have been a highlight of pashto language. Pashto has a rich vocabulary. Pashto is much older then…………………………………. What’s the problem?

Cheap airfare cheap airfare cheap airfare cheap airfare…………….!!!

Ok, I do spamming but it ain’t just me!

Synonymy Polysemy Cheap airfare cheap airfare cheap airfare cheap airfare…………….!!! Ok, I do spamming but it ain’t just me!

Any Solution?

WordNet Project Indexed-based Search Engine By George A. Miller & Colleagues At Princeton University

WordNet Project Indexed-based Search Engine By George A. Miller & Colleagues At Princeton University Semantic Network Group of human linguistics

WordNet Project Indexed-based Search Engine By George A. Miller & Colleagues At Princeton University Car Automobile Semantic Network

WordNet Project Indexed-based Search Engine By George A. Miller & Colleagues At Princeton University Car Automobile Now search Semantic Network

Any Challenges?

Polysemy would aggravate!

And even with Synonymy….

Polysemy would aggravate! FAQs BOTs Browse

Clever Project -at IBM Are we ignoring something useful…

Clever Project -at IBM Yes…

...More than a billion carefully placed hyperlinks! Clever Project -at IBM

Some issues in search queries: - Harvard example - IBM example - Website Design

Some issues in search queries: - Harvard example - IBM example - Website Design Let’s fix this! Query List Query List Override with Predetermined RIGHT answers

Some issues in search queries: - Harvard example - IBM example - Website Design Let’s fix this! Query List Query List Override with Predetermined RIGHT answers An Observation

Clever Project -at IBM Underlying Approach/Idea Location A link to Location B Location B An implicit endorsement of Location B by Location A

Hub Authority Recommendation Hub - My Fav Links -Commercial Links -Personal Inventories Clever Project -at IBM Finding authoritative sites on broad topics with the help of hyperlinks

Clever Project -at IBM What computational method is used to identify hubs and authorities? Page 1 Page 2 Page 3. Candidate Pages Good Hub Good Authority Yes No Yes. No Yes. Initial estimates by guessing Estimate about Hubs Guess about Authorities Used to Improve

Clever Project -at IBM What computational method is used to identify hubs and authorities? Guess about Authorities Used to Improve Hub Estimate about Hubs

Clever Project -at IBM What computational method is used to identify hubs and authorities? Guess about Authorities Used to Improve Hub Estimate about Hubs

Clever Project -at IBM What computational method is used to identify hubs and authorities? Guess about Authorities Used to Improve Hub Estimate about Hubs Where does the best hubs point most heavily at?

Clever Project -at IBM More light on implementation Topic: Acupuncture Initial List (200 pages) By any standard text index, such as Alta Vista

Clever Project -at IBM More light on implementation Topic: Acupuncture Augmented List Initial List (200 pages) By any standard text index, such as Alta Vista Pages that link to and from the pages in the Initial list

Clever Project -at IBM More light on implementation Topic: Acupuncture Augmented List Initial List (200 pages) Initial Authority Score Initial Hub Score Sum of hub scores of other locations pointing to it Sum of authority scores of other locations pointing to it Root Set

Clever Project -at IBM More light on implementation Topic: Acupuncture Augmented List Initial List (200 pages) Initial Authority Score Initial Hub Score Sum of hub scores of other locations pointing to it Sum of authority scores of other locations pointing to it Root Set These scores will be re-adjusted iteratively, until results are fine-tuned & start settling down!

Clever Project -at IBM In visual terms...

Clever Project -at IBM Looking at the mathematics behind... Vector Matrix (Hub score & Authority score) Numerical values defining the hyperlinking structure of the root set Iteratively Result Hub & Authority Vector (Eigen Vector) Equilibrated to a certain number!

Clever Project -at IBM Looking at the mathematics behind... Vector Matrix (Hub score & Authority score) Numerical values defining the hyperlinking structure of the root set Iteratively Result Hub & Authority Vector (Eigen Vector) Equilibrated to a certain number! Observations -If root set’s size is 3000 pages, 5 rounds of calculations will be enough to steady the Hub and Authority scores -Algorithm is independent of initial scores

Clever Project -at IBM Iterative Process Separation of Websites Cluster 1 Cluster 3 A By-product of Clever!

Clever Project -at IBM Abortion Iterative Process Separation of Websites Cluster 1 Cluster 3 Pro-life Pro- choice A By-product of Clever!

Clever Project -at IBM Chaotic cover A Larger Perspective... Based on how pages are linked! Inherent albeit inchoate order

Clever Project -at IBM Paper 2 … Refrences … paper 4 … Garfield Measure Paper 4 … Refrences … paper 1 … Paper 5 … Refrences … paper 4 … Paper 1 … Refrences … paper 4 … Paper 3 … Refrences … paper 2 … Reference: Eugene Garfield, founder of Science Citation Index

Clever Project -at IBM Paper 2 … Refrences … paper 4 … Garfield Measure Paper 4 … Refrences … paper 1 … Paper 5 … Refrences … paper 4 … Paper 1 … Refrences … paper 4 … Paper 3 … Refrences … paper 2 … Reference: Eugene Garfield, founder of Science Citation Index

Clever Project -at IBM Paper 2 … Refrences … paper 4 … Garfield Measure (High) Impact Factor Paper 4 … Refrences … paper 1 … Paper 5 … Refrences … paper 4 … Paper 1 … Refrences … paper 4 … Paper 3 … Refrences … paper 2 … A metric that judges a paper by the number of citation it gets Reference: Eugene Garfield, founder of Science Citation Index

Any Challenges?

Any solution to this…?

Improvement to Garfield Measure Journal 1 Weight = X Journal 2 Weight = Y

Improvement to Garfield Measure Definition of Importance Journal 1 Weight = X Journal 2 Weight = Y But… c

Improvement to Garfield Measure Definition of Importance Journal 1 Weight = X Reference: Gabriel Pinski and Francis Narin (1976), CHI Research Journal 2 Weight = Y But… c An Iterative method for computing a stable set of adjusted scores (they called it influence weights) So there was indeed a very early solution! Authority Hub No distinction

A fundamental difference… Traditional Printed Scientific LiteratureWeb Hub Needed!

Investigating Power of Hyperlinks

Ranking Measure “Influence Weight” Idea (Pinski, Narin) Related to Link based

Investigating Power of Hyperlinks Ranking Measure “Influence Weight” Idea (Pinski, Narin) Related to Link based Heavy visited location Haphazard jumps

Investigating Power of Hyperlinks Ranking Measure “Influence Weight” Idea (Pinski, Narin) Related to Link based Heavy visited location Web 1101 Web 2304 Web 3060 Web page No. of Links to it In practice.. Haphazard jumps

Investigating Power of Hyperlinks Ranking Measure “Influence Weight” Idea (Pinski, Narin) Related to Link based Heavy visited location -Random Traversal -Finding a single kind of universally important page intuitively Web 1101 Web 2304 Web 3060 Web page No. of Links to it In practice.. Haphazard jumps

Difference Clever Different root set for each search Forward & Backward Google Initial ranking retained Faster Forward (link to link) Sociological Phenomenon

Future... Integrating Text & Hyperlinks Overcomes a shortcoming Listing Web resources Knitting communities Next 5 years? Challenges Fundamental changes?

Questions?