Presentation is loading. Please wait.

Presentation is loading. Please wait.

CSCE 590 Web Scraping – Information Extraction II

Similar presentations


Presentation on theme: "CSCE 590 Web Scraping – Information Extraction II"— Presentation transcript:

1 CSCE 590 Web Scraping – Information Extraction II
Topics Information Retrieval framework revisited Readings: Scrapy User manual – March 16, 2017

2 Figure 23.1 Google as an IR engine

3 Figure 23.2 Architecture of an IR system

4 Slide from Speech and Language Processing -- Jurafsky and Martin
Google PageRank Slide from Speech and Language Processing -- Jurafsky and Martin

5 Google PageRank continued
“ PageRank relies on the uniquely democratic nature of the web by using its vast link structure as an indicator of an individual page's value. In essence, Google interprets a link from page A to page B as a vote, by page A, for page B. But, Google looks at more than the sheer volume of votes, or links a page receives; it also analyzes the page that casts the vote. Votes cast by pages that are themselves "important" weigh more heavily and help to make other pages "important". Slide from Speech and Language Processing -- Jurafsky and Martin

6 False or spoofed PageRank
This spoofing technique, also known as 302 Google Jacking, was a known failing or bug in the system. Any page's PageRank could have been spoofed to a higher or lower number of the webmaster's choice and only Google has access to the real PageRank of the page. Spoofing is generally detected by running a Google search for a URL with questionable PageRank, as the results will display the URL of an entirely different site (the one redirected to) in its results. Slide from Speech and Language Processing -- Jurafsky and Martin

7 Wolfram Alpha n-grams it was the best of times it was the worst of times Who was the president in 1888? What is pi? graphs on 11 vertices *Who was Grover Cleveland's vice-president?

8 Bing http://www.bing.com

9

10 Vector Space Model Vector Space Model - Represents the terms that occur within the collection Term weights In a document if keyword1 occurs 4 times keyword2 occurs 7 times keyword3 occurs 0 times keywordn occurs 3 times Then the vector representing the document is (4,7,0, …3)

11 Slide from Speech and Language Processing -- Jurafsky and Martin
Sim(query, document) . Dot-product Vector length Term-by-document matrix Slide from Speech and Language Processing -- Jurafsky and Martin

12 Figure 23.3 Visualization of the vector model

13 Inverse Document Frequency
Term Weighting Document importance Inverse Document Frequency (Spark Jones 1972) “Assign higher weights to more discriminative words” Where N = total # of documents and ni = # documents that contain term i Tf-idf weighting (term-frequency x idf) Slide from Speech and Language Processing -- Jurafsky and Martin

14 Tf-idf weighted cosine
Also used in summarization (page 794) Other topics: Stemming Stop-list

15 Figure 23.4 Rank-Specific P and R for list of docs

16 Figure 23.5 Interpolated Precision

17 Figure 23.6

18 Ways to improve User Queries
Relevance feedback Query expansion thesaurus

19 Figure 23.7 Factoid Question Answering
Wolfram-Alpha Bing Google Where is the Louvre? 10 10+map What is the abbreviation for limited partnership? - 9 What are the names of Odin’s ravens? What currency is used in China? what kind of nuts are used in marzipan? What is the official language of Algeria? What is the telephone number of the University of South Carolina? 7 SCState How many pounds are in a stone?

20 Figure 23.8 Architecture

21 Figure 23.9 Question Typology

22 Figure 23.9 continued

23 Figure 23.10

24 Figure 23.11

25 Figure 23.12

26 Figure 23.13 Example Summarization

27 Figure 23.14

28 Figure 23.15

29 Figure 23.16

30 Figure 23.17 Features used in Supervised Classifiers

31 Figure 23.18

32 Figure 23.19 Summarization Tagging
Subject(S) Object(O) Oblique(X)

33 Figure 23.20 Rewriting References

34 Figure Examples

35 Figure 23.22

36 Google translate What is the number of graphs on 10 vertices? Translation: English » Arabic ما هو عدد من الرسوم البيانية في 10 القمم؟ Translation: Arabic » English What is the number of charts in the 10 top?


Download ppt "CSCE 590 Web Scraping – Information Extraction II"

Similar presentations


Ads by Google