Presentation is loading. Please wait.

Presentation is loading. Please wait.

Hypersearching the Web Hira Bashir - June 22, 2010 Soumen Chakarbarti, Byron Dom, S. Ravi Kumar, Prabhakar Raghavan, Sridhar Rajagopalan & Andrew Tomkins.

Similar presentations


Presentation on theme: "Hypersearching the Web Hira Bashir - June 22, 2010 Soumen Chakarbarti, Byron Dom, S. Ravi Kumar, Prabhakar Raghavan, Sridhar Rajagopalan & Andrew Tomkins."— Presentation transcript:

1 Hypersearching the Web Hira Bashir - June 22, 2010 Soumen Chakarbarti, Byron Dom, S. Ravi Kumar, Prabhakar Raghavan, Sridhar Rajagopalan & Andrew Tomkins

2

3 … … … … … … … … … …

4 … … … … … … … … … … CHAOS!

5 Enter your text here… Search I’m a Search Engine!

6 Learn Pashto Language| Search I’m a Search Engine!

7 Learn Pashto Language Search I’m a Search Engine!

8 A Abate Abash. A Abate Abash. Webpage 1, Webpage 2, …, Webpage L Webpage 1, Webpage 2, …, Webpage M Webpage 1, Webpage 2, …, Webpage N. Webpage 1, Webpage 2, …, Webpage L Webpage 1, Webpage 2, …, Webpage M Webpage 1, Webpage 2, …, Webpage N. Words WebPages containing the word

9 A Abate Abash. A Abate Abash. Webpage 1, Webpage 2, …, Webpage L Webpage 1, Webpage 2, …, Webpage M Webpage 1, Webpage 2, …, Webpage N. Webpage 1, Webpage 2, …, Webpage L Webpage 1, Webpage 2, …, Webpage M Webpage 1, Webpage 2, …, Webpage N. Words WebPages containing the word Index

10 A Abate Abash. A Abate Abash. Webpage 1, Webpage 2, …, Webpage L Webpage 1, Webpage 2, …, Webpage M Webpage 1, Webpage 2, …, Webpage N. Webpage 1, Webpage 2, …, Webpage L Webpage 1, Webpage 2, …, Webpage M Webpage 1, Webpage 2, …, Webpage N. Words WebPages containing the word Challenge: Creating, maintaining, using Index

11 Ranking Function Heuristics

12 Ranking Function Heuristics Pashto is the National language of Afghanistan. Pashto is extensively spoken by Pashtuns tribes in Pakistan’s rural cities. Pashto has two major Dialects. The Pashto spoken in Peshawar differs from the Pashto spoken in Afghanistan or Baluchistan. This text would cover the basic vocabulary to familiarize you with this language. If you know Urdu, it would be easier you learn written…………. Pashto is the National language of Afghanistan. Pashto is extensively spoken by Pashtuns tribes in Pakistan’s rural cities. Pashto has two major Dialects. The Pashto spoken in Peshawar differs from the Pashto spoken in Afghanistan. Pashto poetry and pashto songs have been a highlight of pashto language. Pashto has a rich vocabulary. Pashto is much older then………………………………….

13 Ranking Function Heuristics Learn Pashto Language Welcome to the world’s best resource to learn…. -Pashto Poetry -Pashto Poets -Greetings in Pashto Welcome to the world’s best resource to learn…. Pashto is the National language of Afghanistan. Pashto is extensively spoken by Pashtuns tribes in Pakistan’s rural cities. Pashto has two major Dialects. The Pashto spoken in Peshawar differs from the Pashto spoken in Afghanistan or Baluchistan. This text would cover the basic vocabulary to familiarize you with this language. If you know Urdu, it would be easier you learn written…………. Pashto is the National language of Afghanistan. Pashto is extensively spoken by Pashtuns tribes in Pakistan’s rural cities. Pashto has two major Dialects. The Pashto spoken in Peshawar differs from the Pashto spoken in Afghanistan. Pashto poetry and pashto songs have been a highlight of pashto language. Pashto has a rich vocabulary. Pashto is much older then………………………………….

14 Ranking Function Heuristics Learn Pashto Language Welcome to the world’s best resource to learn…. -Pashto Poetry -Pashto Poets -Greetings in Pashto Welcome to the world’s best resource to learn…. Pashto is the National language of Afghanistan. Pashto is extensively spoken by Pashtuns tribes in Pakistan’s rural cities. Pashto has two major Dialects. The Pashto spoken in Peshawar differs from the Pashto spoken in Afghanistan or Baluchistan. This text would cover the basic vocabulary to familiarize you with this language. If you know Urdu, it would be easier you learn written…………. Pashto is the National language of Afghanistan. Pashto is extensively spoken by Pashtuns tribes in Pakistan’s rural cities. Pashto has two major Dialects. The Pashto spoken in Peshawar differs from the Pashto spoken in Afghanistan. Pashto poetry and pashto songs have been a highlight of pashto language. Pashto has a rich vocabulary. Pashto is much older then…………………………………. What’s the problem?

15 Cheap airfare cheap airfare cheap airfare cheap airfare…………….!!!

16

17 Ok, I do spamming but it ain’t just me!

18 Synonymy Polysemy Cheap airfare cheap airfare cheap airfare cheap airfare…………….!!! Ok, I do spamming but it ain’t just me!

19 Any Solution?

20 WordNet Project Indexed-based Search Engine By George A. Miller & Colleagues At Princeton University

21 WordNet Project Indexed-based Search Engine By George A. Miller & Colleagues At Princeton University Semantic Network Group of human linguistics

22 WordNet Project Indexed-based Search Engine By George A. Miller & Colleagues At Princeton University Car Automobile Semantic Network

23 WordNet Project Indexed-based Search Engine By George A. Miller & Colleagues At Princeton University Car Automobile Now search Semantic Network

24 Any Challenges?

25 Polysemy would aggravate!

26 And even with Synonymy….

27 Polysemy would aggravate! FAQs BOTs Browse

28 Clever Project -at IBM Are we ignoring something useful…

29 Clever Project -at IBM Yes…

30 ...More than a billion carefully placed hyperlinks! Clever Project -at IBM

31 Some issues in search queries: - Harvard example - IBM example - Website Design

32 Some issues in search queries: - Harvard example - IBM example - Website Design Let’s fix this! Query List --- ---- Query List --- ---- Override with Predetermined RIGHT answers

33 Some issues in search queries: - Harvard example - IBM example - Website Design Let’s fix this! Query List --- ---- Query List --- ---- Override with Predetermined RIGHT answers An Observation

34 Clever Project -at IBM Underlying Approach/Idea Location A link to Location B Location B An implicit endorsement of Location B by Location A

35 Hub Authority Recommendation Hub - My Fav Links -Commercial Links -Personal Inventories Clever Project -at IBM Finding authoritative sites on broad topics with the help of hyperlinks

36 Clever Project -at IBM What computational method is used to identify hubs and authorities? Page 1 Page 2 Page 3. Candidate Pages Good Hub Good Authority Yes No Yes. No Yes. Initial estimates by guessing Estimate about Hubs Guess about Authorities Used to Improve

37 Clever Project -at IBM What computational method is used to identify hubs and authorities? Guess about Authorities Used to Improve Hub Estimate about Hubs

38 Clever Project -at IBM What computational method is used to identify hubs and authorities? Guess about Authorities Used to Improve Hub Estimate about Hubs

39 Clever Project -at IBM What computational method is used to identify hubs and authorities? Guess about Authorities Used to Improve Hub Estimate about Hubs Where does the best hubs point most heavily at?

40 Clever Project -at IBM More light on implementation Topic: Acupuncture Initial List (200 pages) By any standard text index, such as Alta Vista

41 Clever Project -at IBM More light on implementation Topic: Acupuncture Augmented List Initial List (200 pages) By any standard text index, such as Alta Vista Pages that link to and from the pages in the Initial list 0 1000 - 5000

42 Clever Project -at IBM More light on implementation Topic: Acupuncture Augmented List Initial List (200 pages) 0 1000 - 5000 Initial Authority Score Initial Hub Score Sum of hub scores of other locations pointing to it Sum of authority scores of other locations pointing to it Root Set

43 Clever Project -at IBM More light on implementation Topic: Acupuncture Augmented List Initial List (200 pages) 0 1000 - 5000 Initial Authority Score Initial Hub Score Sum of hub scores of other locations pointing to it Sum of authority scores of other locations pointing to it Root Set These scores will be re-adjusted iteratively, until results are fine-tuned & start settling down!

44 Clever Project -at IBM In visual terms...

45 Clever Project -at IBM Looking at the mathematics behind... Vector Matrix (Hub score & Authority score) Numerical values defining the hyperlinking structure of the root set Iteratively Result Hub & Authority Vector (Eigen Vector) Equilibrated to a certain number!

46 Clever Project -at IBM Looking at the mathematics behind... Vector Matrix (Hub score & Authority score) Numerical values defining the hyperlinking structure of the root set Iteratively Result Hub & Authority Vector (Eigen Vector) Equilibrated to a certain number! Observations -If root set’s size is 3000 pages, 5 rounds of calculations will be enough to steady the Hub and Authority scores -Algorithm is independent of initial scores

47 Clever Project -at IBM Iterative Process Separation of Websites Cluster 1 Cluster 3 A By-product of Clever!

48 Clever Project -at IBM Abortion Iterative Process Separation of Websites Cluster 1 Cluster 3 Pro-life Pro- choice A By-product of Clever!

49 Clever Project -at IBM Chaotic cover A Larger Perspective... Based on how pages are linked! Inherent albeit inchoate order

50 Clever Project -at IBM Paper 2 … Refrences … paper 4 … Garfield Measure Paper 4 … Refrences … paper 1 … Paper 5 … Refrences … paper 4 … Paper 1 … Refrences … paper 4 … Paper 3 … Refrences … paper 2 … Reference: Eugene Garfield, founder of Science Citation Index

51 Clever Project -at IBM Paper 2 … Refrences … paper 4 … Garfield Measure Paper 4 … Refrences … paper 1 … Paper 5 … Refrences … paper 4 … Paper 1 … Refrences … paper 4 … Paper 3 … Refrences … paper 2 … Reference: Eugene Garfield, founder of Science Citation Index

52 Clever Project -at IBM Paper 2 … Refrences … paper 4 … Garfield Measure (High) Impact Factor Paper 4 … Refrences … paper 1 … Paper 5 … Refrences … paper 4 … Paper 1 … Refrences … paper 4 … Paper 3 … Refrences … paper 2 … A metric that judges a paper by the number of citation it gets Reference: Eugene Garfield, founder of Science Citation Index

53 Any Challenges?

54

55 Any solution to this…?

56 Improvement to Garfield Measure Journal 1 Weight = X Journal 2 Weight = Y

57 Improvement to Garfield Measure Definition of Importance Journal 1 Weight = X Journal 2 Weight = Y But… c

58 Improvement to Garfield Measure Definition of Importance Journal 1 Weight = X Reference: Gabriel Pinski and Francis Narin (1976), CHI Research Journal 2 Weight = Y But… c An Iterative method for computing a stable set of adjusted scores (they called it influence weights) So there was indeed a very early solution! Authority Hub No distinction

59 A fundamental difference… Traditional Printed Scientific LiteratureWeb Hub Needed!

60 Investigating Power of Hyperlinks

61 Ranking Measure “Influence Weight” Idea (Pinski, Narin) Related to Link based

62 Investigating Power of Hyperlinks Ranking Measure “Influence Weight” Idea (Pinski, Narin) Related to Link based Heavy visited location Haphazard jumps

63 Investigating Power of Hyperlinks Ranking Measure “Influence Weight” Idea (Pinski, Narin) Related to Link based Heavy visited location Web 1101 Web 2304 Web 3060 Web page No. of Links to it In practice.. Haphazard jumps

64 Investigating Power of Hyperlinks Ranking Measure “Influence Weight” Idea (Pinski, Narin) Related to Link based Heavy visited location -Random Traversal -Finding a single kind of universally important page intuitively Web 1101 Web 2304 Web 3060 Web page No. of Links to it In practice.. Haphazard jumps

65 Difference Clever Different root set for each search Forward & Backward Google Initial ranking retained Faster Forward (link to link) Sociological Phenomenon

66 Future... Integrating Text & Hyperlinks Overcomes a shortcoming Listing Web resources Knitting communities Next 5 years? Challenges Fundamental changes?

67 Questions?

68


Download ppt "Hypersearching the Web Hira Bashir - June 22, 2010 Soumen Chakarbarti, Byron Dom, S. Ravi Kumar, Prabhakar Raghavan, Sridhar Rajagopalan & Andrew Tomkins."

Similar presentations


Ads by Google