Presentation is loading. Please wait.

Presentation is loading. Please wait.

Web Programming Week 14 Old Dominion University Department of Computer Science CS 418/518 Fall 2006 Michael L. Nelson 11/27/06.

Similar presentations


Presentation on theme: "Web Programming Week 14 Old Dominion University Department of Computer Science CS 418/518 Fall 2006 Michael L. Nelson 11/27/06."— Presentation transcript:

1 Web Programming Week 14 Old Dominion University Department of Computer Science CS 418/518 Fall 2006 Michael L. Nelson 11/27/06

2 Relational Data Model is a Special Case… SELECT name, catches, yards, touchdowns FROM VT_Boxscores, VT_Roster WHERE game_id = “12” AND number = “4” AND year = “2006”;

3 Unstructured Data is More Common…

4 Precision and Recall Precision –“ratio of the number of relevant documents retrieved over the total number of documents retrieved” (p. 10) –how much extra stuff did you get? Recall –“ratio of relevant documents retrieved for a given query over the number of relevant documents for that query in the database” (p. 10) note: assumes a priori knowledge of the denominator! –how much did you miss?

5 Precision and Recall 1 1 0 Precision Recall figure 1.2 in FBY

6 LIKE & REGEXP We can search rows with the “LIKE” (or “REGEXP”) operator –http://dev.mysql.com/doc/refman/5.0/en/pattern-matching.htmlhttp://dev.mysql.com/doc/refman/5.0/en/pattern-matching.html –for tables of any size, this will be s-l-o-w –there is a better way… mysql> SELECT id, name FROM VT_Roster WHERE name LIKE ‘Se%’ -> AND year=‘2006’); +----+---------------+ | id | name | +----+---------------+ | 7 | Sean Glennon | | 70 | Sergio Render | +----+---------------+ 2 rows in set (0.00 sec)

7 CREATE Table mysql> CREATE TABLE recaps ( -> id INT UNSIGNED AUTO_INCREMENT NOT NULL PRIMARY KEY, -> title VARCHAR(200), -> body TEXT, -> FULLTEXT (title,body) -> ); Query OK, 0 rows affected (0.00 sec) can only create FULLTEXT on CHAR, VARCHAR or TEXT columns “title” and “body” still available as regular columns if you want to search only on “title”, you need to create a separate index

8 INSERT mysql> INSERT INTO recaps (title,body) VALUES -> ('Hokies Blank UVa', '#17 Hokies ended the season...'), -> ('Hokies Put Wake in Their Place', 'Sean Glennon threw for...'), -> ('Hokies Blank Kent State', 'Virgina Tech overcame a sloppy...'); Query OK, 3 rows affected (0.00 sec) Records: 3 Duplicates: 0 Warnings: 0

9 MATCH.. AGAINST mysql> SELECT * FROM recaps -> WHERE MATCH (title,body) AGAINST (’sloppy'); +----+-------------------------+------------------------------------------+ | id | title | body | +----+-------------------------+------------------------------------------+ | 3 | Hokies Blank Kent State | Virginia Tech overcame a sloppy... | +----+-------------------------+------------------------------------------+ 1 row in set (0.00 sec) mysql> SELECT * FROM recaps -> WHERE MATCH (title,body) AGAINST (’Hokies'); +----+-------------------------+------------------------------------------+ | id | title | body | +----+-------------------------+------------------------------------------+ 0 rows in set (0.00 sec) why?!

10 Ranking If you are not in Boolean mode and the word appears in > 50% of the rows, then the word is considered a “stop word” and is not matched –this makes sense for large collections (the word is not a good discriminator of records), but can lead to unexpected results for small collections

11 Stopwords Stopwords exist in stoplists or negative dictionaries Idea: remove low semantic content –index should only have “important stuff” What not to index is domain dependent, but often includes: –“small” words: a, and, the, but, of, an, very, etc. –NASA ADS example: http://adsabs.harvard.edu/abs_doc/stopwords.html –MySQL full-text index: http://dev.mysql.com/doc/refman/5.0/en/fulltext-stopwords.html

12 Stopwords Punctuation, numbers often stripped or treated as stopwords –precision suffers on searches for: NASA TM-3389 F-15 X.500.NET Tree::Suffix MySQL also treats words < 4 characters as stopwords –too bad for: “Liu”, “CFD”, “Ada”, etc.

13 Getting the Rank mysql> SELECT id, MATCH (title,body) AGAINST (’Sewell') -> FROM recaps; +----+-----------------------------------------+ | id | MATCH (title,body) AGAINST (’Sewel') | +----+-----------------------------------------+ | 1 | 0.65545833110809 | | 2 | 0 | | 3 | 0 | +----+-----------------------------------------+ 3 rows in set (0.00 sec)

14 Boolean Mode Does not use the 50% threshold Does use stopwords, length limitation Operator list: –http://dev.mysql.com/doc/refman/5.0/en/fulltext-boolean.htmlhttp://dev.mysql.com/doc/refman/5.0/en/fulltext-boolean.html mysql> SELECT * FROM recaps -> WHERE MATCH (title,body) AGAINST (’+Hokies’ IN BOOLEAN MODE); +----+-------------------------+------------------------------------------+ | id | title | body | +----+-------------------------+------------------------------------------+ | 1 | Hokies Blank UVa | #17 Hokies ended the season... | | 2 | Hokies Put Wake in... | Sean Glennon threw for... | | 3 | Hokies Blank Kent State | Virginia Tech overcame a sloppy... | +----+-------------------------+------------------------------------------+ 3 rows in set (0.00 sec)

15 Blind Query Expansion (AKA Automatic Relevance Feedback) How does one keep up with Virginia Tech’s multiple names / nicknames? –Hokies, Fighting Gobblers, VPI, VPI&SU, Va Tech, VT Idea: run the query with the requested terms, then take the results and re- run the query with the most relevant terms from the initial results mysql> SELECT * FROM recaps -> WHERE MATCH (title,body) AGAINST (’Virginia Tech'); +----+------------------------+------------------------------------------+ | id | title | body | +----+------------------------+------------------------------------------+ | 3 | Hokies Blank Kent State| Virginia Tech overcame a sloppy... | +----+------------------------+------------------------------------------+ 1 rows in set (0.00 sec) mysql> SELECT * FROM recaps -> WHERE MATCH (title,body) AGAINST (’Virginia Tech’ WITH QUERY EXPANSION); +----+-------------------------+------------------------------------------+ | id | title | body | +----+-------------------------+------------------------------------------+ | 1 | Hokies Blank UVa | #17 Hokies ended the season... | | 2 | Hokies Put Wake in... | Sean Glennon threw for... | | 3 | Hokies Blank Kent State | Virginia Tech overcame a sloppy... | +----+-------------------------+------------------------------------------+ 3 rows in set (0.00 sec) in this example, pretend “Virginia Tech” did not appear in the game recaps and that “Hokies” appears in > 50% of rows

16 For More Information… MySQL documentation: –http://dev.mysql.com/doc/refman/5.0/en/fulltext-search.htmlhttp://dev.mysql.com/doc/refman/5.0/en/fulltext-search.html Chapter 12/13 “Building a Content Management System” CS 751/851 “Introduction to Digital Libraries” –http://www.cs.odu.edu/~mln/teaching/http://www.cs.odu.edu/~mln/teaching/ –esp. “Information Retrieval Concepts” lecture Is MySQL the right tool for your job? –http://lucene.apache.org/http://lucene.apache.org/ MySQL examples in this lecture based on those found at dev.mysql.com content snippets taken from www.techsideline.com


Download ppt "Web Programming Week 14 Old Dominion University Department of Computer Science CS 418/518 Fall 2006 Michael L. Nelson 11/27/06."

Similar presentations


Ads by Google