Presentation is loading. Please wait.

Presentation is loading. Please wait.

Dataware’s Document Clustering and Query-By-Example Toolkits John Munson Dataware Technologies 1999 BRS User Group Conference.

Similar presentations


Presentation on theme: "Dataware’s Document Clustering and Query-By-Example Toolkits John Munson Dataware Technologies 1999 BRS User Group Conference."— Presentation transcript:

1 Dataware’s Document Clustering and Query-By-Example Toolkits John Munson Dataware Technologies 1999 BRS User Group Conference

2 Document Clustering  Automatically creates clusters of similar documents  General benefit: provides an overview of the range of topics in a set  Multiple specific uses – Familiarization with database before searching – Familiarization with a result set after searching – Assistance in category definition for other uses - Category tree construction - FAQ construction

3 Dataware’s Clustering Toolkit  One API function  Source of documents is a BRS result set – which could be backref 0 for entire database – Can specify certain fields for analysis  Output indicates member documents for each cluster  Application can specify number and max/min size of clusters, etc.  US PTO (Patent and Trademark Office) plans to do category tree construction

4 How It Works  Extracts keywords from each document – using our keyword-generation library - which is also in 6.3 keyword generation load filter  Repeats these steps: – Compare document and cluster pairs using the keyword lists - How many keywords do two lists share, and how similar are their weights? – Combine the most similar pair into one cluster  Stops when n clusters remain (n is configurable)

5 How It Works  Output is a list of clusters, including: – a cluster quality score - Measures how cohesive the cluster is – a ranked list of keywords describing the cluster – a ranked list of member documents - Highest-ranked docs are the most “central”

6 Speed Tricks  Speed is a big issue in clustering – especially for interactive searching – Keyword extraction takes time – Pairwise comparisons don’t scale up well at all – Thus, we use a couple of speed tricks - One trick for database design - One trick inside the clustering function  Trick 1: Pre-generate keywords – Use the BRS 6.3 keyword generation load filter – The filter produces a keyword paragraph that looks like this...

7 Speed Tricks..Keywords: compartment (187.80). mass (156.56). methylhistidine (118.12)....  At clustering time, we don’t need to do keyword analysis – Just retrieve keyword lists from engine – Cuts execution time in half

8 Speed Tricks  Trick 2: Cluster a sample of the set (Cutting et al) – Create the desired number of clusters from a small sample – Then compare the remaining documents only to those few clusters, not to all other documents – Saves a huge amount of execution time  Another trick for result-set clustering: – Cluster only the top-ranked 100 to 1000 docs  A final speed note: CPU speed helps a lot – Clustering is very processor-intensive - 2x CPU speed gives almost 2x clustering speed

9 Query-By-Example (QBE)  Allows an example passage or document to serve as a query  Useful when we already have some text or a document about our topic – “Find more like this” – No query formulation required – QBE analyzes the text, then constructs and executes a query

10 Dataware’s QBE Toolkit  One API function  Source of example text can be: – a text buffer - e.g. text selected with mouse – a BRS document (or documents) from a result set - e.g. selected from a title list - Can specify certain fields for analysis – a word list with weights or occurrence counts  Output is a standard ranked document list

11 How It Works  Extracts keywords from the example text – using... all together now... our keyword-generation library, yet again  Keyword selection process likes words that: – occur frequently in the example text – are rare in the database as a whole  Getting database statistics can be done: – using field qualification - most accurate but slow – using no qualification - still good, much faster – not at all -- just use occurrence counts in example text -- fastest, but trickier

12 How It Works  Performs a ranked search using the keywords and their weights  Flexible fielding: – Analysis of example document(s) can use one set of BRS paragraphs – Search can use a different set  Speed trick: – Generate keyword field for database (load filter) – Field-level index it – Use it for QBE searches

13 That’s all, folks!


Download ppt "Dataware’s Document Clustering and Query-By-Example Toolkits John Munson Dataware Technologies 1999 BRS User Group Conference."

Similar presentations


Ads by Google