Presentation is loading. Please wait.

Presentation is loading. Please wait.

Rensselaer Polytechnic Institute CSCI-4220 – Network Programming David Goldschmidt, Ph.D. from Search Engines: Information Retrieval in Practice, 1st edition.

Similar presentations


Presentation on theme: "Rensselaer Polytechnic Institute CSCI-4220 – Network Programming David Goldschmidt, Ph.D. from Search Engines: Information Retrieval in Practice, 1st edition."— Presentation transcript:

1 Rensselaer Polytechnic Institute CSCI-4220 – Network Programming David Goldschmidt, Ph.D. from Search Engines: Information Retrieval in Practice, 1st edition by Croft, Metzler, and Strohman, Pearson, 2010, ISBN 0-13-607224-0

2  What is search?  What are we searching for?  How many searches are processed per day?  What is the average number of words in text-based searches?

3  Applications and varieties of search:  Web search  Site search  Vertical search  Enterprise search  Desktop search  As-you-type search  Proximity search search

4

5

6  Relevance  Search results contain information the searcher was looking for  Problems with vocabulary mismatch ▪ Homonyms (e.g. “Jersey shore”)  User relevance  Search results relevant to one user may be completely irrelevant to another user SNOOKI

7  Precision  Proportion of retrieved documents that are relevant  How precise were the results?  Recall (and coverage)  Proportion of relevant documents that were actually retrieved  Did we retrieve all of the relevant documents? http://trec.nist.gov

8  Timeliness and freshness  Search results contain information that is current and up-to-date  Performance  Users expect subsecond response times  Media  User devices are constantly changing (cellphones, mobile devices, tablets, etc.)

9  Scalability  Designs that perform equally well as the system grows and expands ▪ Increased number of documents, number of users, etc.  Flexibility (or adaptability)  Tune search engine components to keep up with changing landscape  Spam-resistance

10  Gerard Salton (1927-1995)  Pioneer in information retrieval  Defined information retrieval as:  “a field concerned with the structure, analysis, organization, storage, searching, and retrieval of information”  This was 1968 (before the Internet and Web!)

11  Structured information:  Often stored in a database  Organized via predefined tables, columns, etc.  Select all accounts with balances less than $200  Unstructured information  Document text (headings, words, phrases)  Images, audio, video (often relies on textual tags) account numberbalance 7004533711$498.19 7004533712$781.05 7004533713$147.15 7004533714$195.75

12  Search and IR has largely focused on text processing and documents  Search typically uses the statistical properties of text  Word counts  Word frequencies  But ignore linguistic features (noun, verb, etc.)

13  Web crawlers adhere to a politeness policy:  GET requests sent every few seconds or minutes  A robots.txt file specifies what crawlers are allowed to crawl:

14 default priority is 0.5 some URLs might not be discovered by crawler

15 what about checking for updated pages?

16  Freshness is essentially a Boolean value  Age measures the degree to which crawled page is out of date


Download ppt "Rensselaer Polytechnic Institute CSCI-4220 – Network Programming David Goldschmidt, Ph.D. from Search Engines: Information Retrieval in Practice, 1st edition."

Similar presentations


Ads by Google