Presentation is loading. Please wait.

Presentation is loading. Please wait.

Fast Phrase Querying With Combined Indexes HUGH E. WILLIAMS, JUSTIN ZOBEL, and DIRK BAHLE RMIT University 2004 Burak Görener 201195001 Doğuş University.

Similar presentations


Presentation on theme: "Fast Phrase Querying With Combined Indexes HUGH E. WILLIAMS, JUSTIN ZOBEL, and DIRK BAHLE RMIT University 2004 Burak Görener 201195001 Doğuş University."— Presentation transcript:

1 Fast Phrase Querying With Combined Indexes HUGH E. WILLIAMS, JUSTIN ZOBEL, and DIRK BAHLE RMIT University 2004 Burak Görener 201195001 Doğuş University

2 Search Engines... Need to evaluate queries extremely fast. Involve phrases. Supported with low disk overheads.

3 Introduction Most queries consist of simple list of words. Some of query terms must be ordered and adjacent.  Typically by enclosing and in quotation mark. Standart way to evaluate phrase queries to use inverted index.  Inverted Index(II) use List of posting (each posting include a document ID )  List of offsets.(ordinal word position)  II work with combinating the posting list for the query terms occurs in the documents. This process is fast but does not mean!  Because of common words.

4 Introduction Cont. A common term require several megabytes for each GB of Inverted Index's Data.  A crude solution is to use stopping The Google neglected common words in phrase queries until 2002  Until this, many more queries evaluated incorrectly.

5 Introduction Cont. A Nextword index is like a Inverted Index  Nextword index use Index term(firstword and nextword)  Nextword index work Each index term(firstword) is a list of the words(nextword) that follow that term. Firstword and nextword occur as a pair.  As a disadvantages is its storage size.  Must be processed linearly(Nextword process). With direct indexing, indexed 10 k most common phase queries reduces query evalution time by over %10.

6 Next... Introduction (Fin) Properties of Phrase Queries Inverted Index in Phrase Queries Partial Phrase and Nextword Indexing Combining Phrase and Inverted Indexing Experimental Result Conclusion

7 Properties of Queries In this research, used query logs by Excite from 1997 and 1999  These logs have similar properties.  1.583.922 queries including duplicates.  % 8.3 of these were explicit phrase queries.  In totaly, %5-10 are explicit.  Queries matched in an around 20 GB Web dataset.  Pharses queries, 11.103 or % 8.4 include one of three common words as the, to and of. In totaly, %14.4 of phase queries include one of 20 commonest terms.

8 Properties of Queries In this research, used query logs by Excite from 1997 and 1999  These logs have similar properties.  1.583.922 queries including duplicates.  % 8.3 of these were explicit phrase queries.  In totaly, %5-10 are explicit.  Queries matched in an around 20 GB Web dataset.  Pharses queries, 11.103 or % 8.4 include one of three common words as the, to and of. In totaly, %14.4 of phase queries include one of 20 commonest terms.

9 Properties of Queries Common words played important role!  In tower of london, can be safely neglected during evalution.  But in the spacial name like movie name or brand name  End of days or The who  These queries are diffucult to evaluate with stopwords removed.  Also query logs include;  To be or not to be  Who are we  All in all

10 Properties of Queries Stopping may yield efficiency gain,  But, significant number of queries cannot be correctly evaluated. Basic query is tower of london, it is evaluated as tower – london  Stopped first 3 commenest word  Result 309 x 10^6 matches  Stopped first 20 commenest word  Result 490 x 10^6 matches  Stopped first 254 commenest word  Result 1693 x 10^6 matches Most mixed problem in form and to.  Dismathes flights from london and flights to london

11 Properties of Queries Other dismathes examples; So many roads ->how many road Man in the moon -> man on the moon Among the phase queries include,  Generaly 2 words.  %34 in 3 words.  %1.3 in 6 or more word.

12 Properties of Queries Testing Data  Called WT10g collection.  This is 10.27 GB Web data (HTML) and 1.67 million doc.  It is crawed in 1997

13 Most Frequent Words and Word Pairs

14 Next... Introduction (Fin) Properties of Phrase Queries (Fin) Inverted Index in Phrase Queries Partial Phrase and Nextword Indexing Combining Phrase and Inverted Indexing Experimental Result Conclusion

15 Inverted Index It is a standart method for supporting queries on large text DB. It is fast for ranked query evalution. It use two level structure  Upper level is a vocabulary or lexicon  Lower level is set of posting list. Zobel and Moffat (1998) notation;  D is document ID  F dt frequent of term indocument D  OX is position of term in document D

16 Inverted Index Let's look "hatful of hollow" This is general structure of Inverted Index  Term and Document frequences contain in it.  Word positions are ordinal.

17 Inverted Index Inverted Index Evaluator  It is open source MG text retrival engine  Descirebed by Witten et al.(1999) Inverted Index data size for WT10g is 1,429 MB  Stopped word data size is 427 MB (490 stopwords)  Stopped Inverted Index size is 1,002 MB

18 Inverted Index Result of Inverted Index performing

19 Next... Introduction (Fin) Properties of Phrase Queries (Fin) Inverted Index in Phrase Queries (Fin) Partial Phrase and Nextword Indexing Combining Phrase and Inverted Indexing Experimental Result Conclusion

20 Phrase Indexes Phase Index is an Inverted Index where items stored as a word sequence. A parcial phrase index with a vocabulary of five popular phrases.

21 Phrase Indexes A phrase index with L = 3 cannot be used efficient to 2 word queries  L=> 2 are stored as term in conventional inverted index.  L= 2 is organized for partial nextword indexes. Parcial Phrase Index  It is notation like;  D is document ID, f dp is term frequence of document. Offsets are not stored. The sets saves the cost of merging lists.

22 Phrase Indexes As examples are  Lord of the rings(19) and birtney spears(59)* in 2001 Given a stream of queries over a long period and fixed volume of memory May also be required to update the vocabulary or replace least frequently used queries. This research do not experiment with this approach. * is number of same request(Query)

23 Nextword Indexes A phrase query can never be less than two word. Nextword index is similar to inverted index. Term representation;  F wp is document frequence.  D is document ID.  F dwp is frequent of term of D.  OX is position of term in D.

24 Nextword Indexes A nextword index with two firstwords. An example : boulder municipal employee credit union  This can be grouped like boulder-municipal,employee- credit and credit-union Other example : historical railroads in new hamsphire  It can grouped as railroads in in preferences to in new AS railroad is much less common than in.

25 Nextword Indexes The nextword index for the WT10g collection is 2.75 GB in size.  It is exactly twice that of an inverted index file. The nextword index involves more complex structures than does processing with inverted index. Differences between Inverted Index and Nextword Index in queries

26 Next... Introduction (Fin) Properties of Phrase Queries (Fin) Inverted Index in Phrase Queries (Fin) Partial Phrase and Nextword Indexing (Fin) Combining Phrase and Inverted Indexing Experimental Result Conclusion

27 Combining Nextword and Inverted Indexing Propose that common words only be used as firstword in a parcial nextword index.

28 Combining Phrase and Inverted Indexing As an example, the query is new york city  can be resolved using the partial phrase index find the locations of new york and merging with the inverted index postings list for city.

29 Three-Way Index Combination It is include a parcial nextword, partial phrase, and full inverted index.

30 Next... Introduction (Fin) Properties of Phrase Queries (Fin) Inverted Index in Phrase Queries (Fin) Partial Phrase and Nextword Indexing (Fin) Combining Phrase and Inverted Indexing (Fin) Experimental Result Conclusion

31 Experimental Result All expriments were run on intel 700 Mhz Pentium III based server with 2 GB of memory. Result of Inverted and Nextword Indexing This table is include the memory usage of the combinations.

32 Result of Inverted and Nextword Indexing Result of n terms queries with Inverted and Nextword Indexing

33 Result of Inverted Index and Phrase This test evaluate in 100, 1000, 10000 most frequent distinct queries  Phrase index was less than %0.1of the collection  2.1MB, 4,8 MB, 12,8 MB  In query logs, an american dictionary of the english language AND los angeles department of water and power are in 10000 common queries. Experimental results,

34 Result of Inverted Index, Nextword Index and Phrase This result is based 66000 queries' testing with using phase queries as common 10000 queries, nextword(only stopped word) and inverted indexing.

35 Next... Introduction (Fin) Properties of Phrase Queries (Fin) Inverted Index in Phrase Queries (Fin) Partial Phrase and Nextword Indexing (Fin) Combining Phrase and Inverted Indexing (Fin) Experimental Result(Fin) Conclusion

36


Download ppt "Fast Phrase Querying With Combined Indexes HUGH E. WILLIAMS, JUSTIN ZOBEL, and DIRK BAHLE RMIT University 2004 Burak Görener 201195001 Doğuş University."

Similar presentations


Ads by Google