Presentation is loading. Please wait.

Presentation is loading. Please wait.

The Last Lecture Agenda –1:40-2:00pm Integrating XML and Search Engines—Niagara way –2:00-2:10pm My concluding remarks (if any) –2:10-2:45pm Interactive.

Similar presentations


Presentation on theme: "The Last Lecture Agenda –1:40-2:00pm Integrating XML and Search Engines—Niagara way –2:00-2:10pm My concluding remarks (if any) –2:10-2:45pm Interactive."— Presentation transcript:

1 The Last Lecture Agenda –1:40-2:00pm Integrating XML and Search Engines—Niagara way –2:00-2:10pm My concluding remarks (if any) –2:10-2:45pm Interactive summarization of the semester –Teaching evaluations (I leave)

2 This part based on Niagara slides

3 Niagara

4

5

6

7

8

9

10

11

12

13

14 Generating a SEQL Query from XML-QL A different kind of Containment

15

16

17

18

19

20

21

22

23

24

25

26

27

28 “Review”

29 Main Topics Approximately three equal parts: –Information retrieval –Information integration/Aggregation –Information mining –other topics as permitted by time Useful course background –CSE 310 Data structures (Also 4xx course on Algorithms) –CSE 412 Databases –CSE 471 Intro to AI What I said on 1/17

30 What we did by 4/30

31 Information Retrieval Traditional Model –Given a set of documents A query expressed as a set of keywords –Return A ranked set of documents most relevant to the query –Evaluation: Precision: Fraction of returned documents that are relevant Recall: Fraction of relevant documents that are returned Efficiency Web-induced headaches –Scale (billions of documents) –Hypertext (inter-document connections) Consequently –Ranking that takes link structure into account Authority/Hub –Indexing and Retrieval algorithms that are ultra fast

32 Database Style Retrieval Traditional Model (relational) –Given: A single relational database –Schema –Instances A relational (sql) query –Return: All tuples satisfying the query Evaluation –Soundness/Completeness –efficiency Web-induced headaches Many databases all are partially complete overlapping heterogeneous schemas access limitations Network (un)reliability Consequently Newer models of DB Newer notions of completeness Newer approaches for query planning

33 What about “mining” Didn’t do too much “data” mining  –But did do some “web” mining Mining the link structure –A/H computation etc Clustering the search engine results –K-means; Agglomerative clustering Classification as part of focused crawling –The “distiller” approach

34 Interactive Review… 2:00-2:45: An interactive summarization of the class. Rather than me show up the list of topics we covered, I thought up a more interesting approach for summarizing the class in *your* collective words. Here is how it will go: *Everyone* in the class will be called on to list one topic/technique/issue that they felt they learned from the course. Generic answers like "I learned about search engines" are discouraged in favor of specific answers (such as "I thought the connection between the dominant eigen values and the way a/h computation works was quite swell"). It is okay to list topics/issues that you got interested in even if those were just a bit beyond what we actually covered. Note that there is an expectation that when your turn comes you will mention something that has not been mentioned by folks who spoke ahead of you. Since I get to decide the order in which to call on you, it is best if you jot down upto 5 things you thought you learned so the chance that you will say something different is higher.

35 Further headaches brought on by Semi-structured retrieval If everyone puts their pages in XML –Introducing similarity based retrieval into traditional databases –Standardizing on shared ontologies...

36 Learning Patterns (Web/DB mining) Traditional classification learning (supervised) –Given a set of structured instances of a pattern (concept) –Induce the description of the pattern Evaluation: –Accuracy of classification on the test data –(efficiency of learning) Mining headaches –Training data is not obvious –Training data is massive –Training instances are noisy and incomplete Consequently –Primary emphasis on fast classification Even at the expense of accuracy –80% of the work is “data cleaning”


Download ppt "The Last Lecture Agenda –1:40-2:00pm Integrating XML and Search Engines—Niagara way –2:00-2:10pm My concluding remarks (if any) –2:10-2:45pm Interactive."

Similar presentations


Ads by Google