Presentation is loading. Please wait.

Presentation is loading. Please wait.

Web Technologies Search Engines

Similar presentations


Presentation on theme: "Web Technologies Search Engines"— Presentation transcript:

1 Web Technologies Search Engines
ITEC547 Text Mining Web Technologies Search Engines

2 Indexing Text for Search
Outline of Presentation 1 Early Search Engines 2 Indexing Text for Search 3 Indexing Multimedia 4 Queries 5 Searching an Index

3 1 Early Search Engines History, Problems, Solutions …

4 Search Engines Open Text (1995-1997) Magellan (1995-2001)
Infoseek (Go) ( ) Snap (NBCi)( ) Direct Hit ( ) Lycos (1994; reborn 1999) WebCrawler (1994; reborn 2001) Yahoo (1994; reborn 2002) Excite (1995; reborn 2001) HotBot (1996; reborn 2002) Ask Jeeves (1998; reborn 2002) Teoma ( ) AltaVista (1995- ) LookSmart (1996- ) Overture (1998- ) Open Text ( ): Yahoo's original search partner was also a popular web search site of its own in The company crawled the web to gather listings, just as Google does today. Open Text decided to focus instead on enterprise search solutions, where it is currently successful. Web search operations closed in mid-1997. Magellan ( ): An early search engine that saw its popularity drop immediately after being purchased by Excite in mid It was closed in April 2001. Infoseek ( ): Launched in early 1995, Infoseek originally hoped to charge for searching. When that failed, the popular search engine shifted to depending like others on banner ads. Disney took a large stake in the company in 1998 and went down the "portal" path that other leading search engines had followed. The site was also renamed "Go." Its failure to make moneycaused Disney to stop Go's own internal search capabilities abruptly in early 2001. Today, Go remains operating, powered by Google. Snap ( ): Launched by CNET in 1997, Snap first used Infoseek, then Inktomi, then created its own directory of human-edited listings that were coupled with clickthrough technology that ranked results in part by what people clicked on. NBC later acquired a majority interest in the company, then renamed it NBCi and intended to win the "portal wars" with the site. But as with Disney and Infoseek, the site's internal search technology was abruptly closed in early It is currently powered by meta search results from Infospace. Direct Hit ( ): When Google first appeared as the hot new search technology in 1998, so did Direct Hit, featuring the ability to measure what people clicked on in search results as a way to make those better. It gained a deal with HotBot and was offered as a search feature on other portals such as Lycos and MSN. It was purchased by Ask Jeeves in 2000, then neglected over the following years. The site was formally closed in early 2002.

5 Information Retrieval
The indexing and retrieval of textual documents. Searching for pages on the World Wide Web is the most recent and perhaps most widely used IR application Concerned firstly with retrieving relevant documents to a query. Concerned secondly with retrieving from large sets of documents efficiently.

6 Typical IR Task Given: Find:
A corpus of textual natural-language documents. A user query in the form of a textual string. Find: A ranked set of documents that are relevant to the query.

7 Typical IR System Architecture
Document corpus IR System Query String Ranked Documents 1. Doc1 2. Doc2 3. Doc3 .

8 Initially used in academic or specialized domains.
EARLY SEARCH ENGINES Initially used in academic or specialized domains. Legal and specialized domains consume a large amount of textual info Use of expensive proprietary hardware and software High computational and storage requirements Boolean query model Iterative search model Fetch documents in many steps New AND York NOT City

9 Medline of National Library of Medicine
Developed in late 1960 and made available in 1971 Based on inverted file organization Boolean query language Queries broken down and numbered into segments Results of a queries fed into the next query segment Each user assigned a time slot If cycle not completed in time slot, most recent results are returned Query and browse operations performed as separate steps Following a query, results are viewed Modifications start a new query-browse cycle

10 Broader subject content Specialized collections of data on payment
Dialog Broader subject content Specialized collections of data on payment Boolean query Each term numbered and executed separately then combined Word patterns For multiword queries proximity operator W

11 Indexing Text for Search
2 Indexing Text for Search Reduce retrieval time improve hit accuracy

12 Why Index Simplest approach search text sequentially
Size must be small Static, semi-static index Inverted Index mapping from content, such as words or numbers, to its locations in a database file, or in a document or a set of documents. Documents/Positions in Documents/Weight Fuzzy/Stemming/Stopwords

13 Example T1 : "it is what it is“ T2 : "what is it“
Inverted Index "a": {2} "banana": {2} "is": {0, 1, 2} "it": {0, 1, 2} "what": {0, 1} T1 : "it is what it is“ T2 : "what is it“ T3 : "it is a banana"

14 Example "a": {(2, 2)} "banana": {(2, 3)} T0 : "it is what it is“
Full Inverted Index "a": {(2, 2)} "banana": {(2, 3)} "is": {(0, 1), (0, 4), (1, 1), (2, 1)} "it": {(0, 0), (0, 3), (1, 2), (2, 0)} "what": {(0, 2), (1, 0)} T0 : "it is what it is“ T1 : "what is it“ T2 : "it is a banana"

15 Inverted Index

16 Inverted Index

17 A unique DocId associated with each URL Hit: word occurences
Google Index A unique DocId associated with each URL Hit: word occurences wordID: 24 bit number Word position Font size relative to the rest of the document Plain hit : in the document Fancy hit : in the URL, title, anchor text, meta tags Word occurrences of a web page are distributed across a set of barrels

18 Architecture of the 1st Google Engine

19 Architecture of the 1st Google Engine

20 Architecture of the 1st Google Engine

21 3 Indexing Multimedia Broadcast and compress for seamless delivery

22 Forming an index for multimedia
Indexing Multimedia Forming an index for multimedia Use context : surrounding text Add manual description Analyze automatically and attach a description

23 4 Queries

24 Keywords Proximity Patterns Phrases Ranges Weights of keywords
Queries Keywords Proximity Patterns Phrases Ranges Weights of keywords Spelling mistakes

25 Boolean query Multimedia query Queries No relevance measure
May be hard to understand Multimedia query Find images of Everest Find x-rays showing the human rib cage Find companies whose stock prices have similar patterns

26 Relevance is a subjective judgment and may include:
Being on the proper subject. Being timely (recent information). Being authoritative (from a trusted source). Satisfying the goals of the user and his/her intended use of the information (information need).

27 Keyword Search Simplest notion of relevance is that the query string appears verbatim in the document. Slightly less strict notion is that the words in the query appear frequently in the document, in any order (bag of words).

28 Problems with Keywords
May not retrieve relevant documents that include synonymous terms. “restaurant” vs. “café” “PRC” vs. “China” May retrieve irrelevant documents that include ambiguous terms. “bat” (baseball vs. mammal) “Apple” (company vs. fruit) “bit” (unit of data vs. act of eating)

29 User enters query terms Links returned
Relevance Feedback User enters query terms Keywords maybe weighted or not Links returned Choose the relevant and irrelevant ones 𝑄 𝑛𝑒𝑤 =𝛼𝑄+𝛽 1 𝑅 𝑖∈𝑅 𝑡 𝑖 −𝛾( 1 𝑅 𝑖∈𝑅 𝑡 𝑖 ) If there is no negative feedback second term is 0 T’s are terms from relevant and irrelevant sets marked by the user

30 5 Searching an Index SEARCHING AN INDEX

31 Searching an Inverted Index
Tokenize the query, search index vocabulary for each query token Get a list of documents associated with each token Combine the list of documents using constraints specified in the query

32 Google Search Tokenize query and remove stopwords
Translate the query words into wordIDs using the lexicon For every wordID get the list of documents from the short inverted barrel and build a composite set of documents Scan the composite list of documents Skip to next document if the current document does not match Compute a rank using query and features If no more documents go to step 3 and use full inverted barrels to find more docs If there are sufficient # of docs go to step 5 Sort the final Document List by rank

33 Location: title,URL, anchor,body Size: relative font size
How are results ranked? Weight type Location: title,URL, anchor,body Size: relative font size Capitalization Count occurences Closeness (proximity)

34 Recall : % of correct items that are selected 𝑇𝑃 TP+FN
Evaluation Response time quality Recall : % of correct items that are selected 𝑇𝑃 TP+FN Precision : % of selected items that are correct 𝑇𝑃 TP+FP

35 Ranking Algorithms : Hyperlink
Popularity Ranking Rank “popular” documents higher among set of documents with specific keywords. Determining “Popularity” Access rate ? How to get accurate data? Bookmarks? Might be private? Links to related pages? Using web crawler to analyze external links.

36 Count of In-links/Out-links
Popularity/Prestige transfer of prestige a link from a popular page x to a page y is treated as conferring more prestige to page y than a link from a not-so-popular page z. Count of In-links/Out-links

37 Hypertext Induced Topic Search (HITS)
The HITS algorithm: compute popularity using set of related pages only. Important web pages : cited by other important web pages or a large number of less-important pages Initially all pages have same importance

38 Hub - A page that stores links to many related pages
Hubs and Authorities Hub - A page that stores links to many related pages may not in itself contain actual information on a topic Authority - A page that contains actual information on a topic may not store links to many related pages Each page gets a prestige value as a hub (hub-prestige), and another prestige value as an authority (authority-prestige).

39 Hubs and Authorities in twitter

40 Hubs and Authorities algorithm
Locate and build the subgraph Assign initial values to hub and authority scores of each node Run a loop till convergence Assign the sum of the hub scores of all nodes y that link to node x to the authority score of x Assign the sum of the authority scores of all nodes y that are linked from node x to node y to hub score of node x Normalize the hub and authority scores of all nodes Check for convergence. Is the difference< threshold? Return the list of nodes sorted in descending order of hub and authority scores

41 Ranks based on citation statistics
Page Rank Algorithm Ranks based on citation statistics In/out links Rank of a page depends on the ranks of the pages that link to it. 𝑅 𝑖 =𝑐 𝑅 𝑗 𝑜𝑢𝑡𝑙𝑖𝑛𝑘𝑠 𝑜𝑓 𝑗 +𝑑

42 Locate and build subgraph
Page rank Algorithm Locate and build subgraph Save the number of out-links from every node in an array Assign a default PageRank to all nodes Run a loop till convergence Compute a new PageRank score for every node. Assign the sum of PageRank scores divided by the number of out-links of every node that links to a node and add the default rank source Check convergence. Is the difference between new and old PageRank< threshold?

43 But wait… There’s Homework
But wait… There’s Homework! 1-Explain web crawling and the general architecture of a web crawler. 2- What is the use of robots.txt? 3- Find a web crawler code and explain how it can be used to collect information on ? 4-Crawl the social media to collect emu related info. (if you want bonus) ?


Download ppt "Web Technologies Search Engines"

Similar presentations


Ads by Google