Presentation is loading. Please wait.

Presentation is loading. Please wait.

Portable Classification Tools Mark Shewhart LexisNexis 21 June 2001.

Similar presentations


Presentation on theme: "Portable Classification Tools Mark Shewhart LexisNexis 21 June 2001."— Presentation transcript:

1 Portable Classification Tools Mark Shewhart LexisNexis 21 June 2001

2 Overview Classification Tools and Types Consistent Controlled Classification Schemes Across All Content Benefits of C.C.C.S. Approaches to Portable Classification Challenges Examples Q & A

3 Introduction Mark Shewhart LexisNexis One of early innovators in building on-line databases and search tools, with classification Currently providing increasing range of tools, solutions and services to support information needs of government organizations, companies, and individuals

4 Uncontrolled Classification PROS No manual development of classification algorithms or searches Aids in knowledge discovery & taxonomy development Adapts to changing terminology and topics CONS Difficulty providing meaningful labels to taxonomy Problematic on fine grained rules Examples Verity, Semio, SRAs NetOwl Extractor, InXights Thing Finder, LEXIS-NEXIS core-terms

5 Controlled Classification Machine Leaning Provide several hundred on-point samples per topic Most systems do not allow for manual intervention Examples - Verity, Semio, Autonomy, InXight, Purple Yogi, Webmind, Fulcrum, SmartLogik. Manually Created Algorithms Human Indexers manually create the algorithm for each topic Examples - Any Boolean Search Engine, Verity, InXight Classifier, LEXIS-NEXIS SmartIndexing, Factiva Intelligent Indexing, Metacode, Sageware.

6 Basic search tools with complex queries created by domain experts is a form of controlled classification Natural Language Verity, Alta-Vista, LexisNexis, West... Boolean MS Site Server, Alta-Vista, LexisNexis, West, Factiva, Dialog... Enhanced - additional beyond boolean operators/control Verity, Semio... Controlled Classification

7 Taxonomy Development Several companies market tools focused on taxonomy development Knowledge Discovery Relationships between terms New or changing terms Uses for Uncontrolled Classification

8 Consistent Classification Scheme Everywhere Your Intranet, The Web, and Premium Content Providers Search all three using the same taxonomy A consistent, controlled, classification scheme facilitates data analysis & visualization - BIZ360, I2 Intra-document linking by taxonomy nodes Investigative Analysis of content

9 Consistent Classification - One Stop Search Premium Content Your Intranet Web Content One Stop Search Mining

10 Consistent Classification - Locate & Link Dossier Case Law Patents Computer Company News Computing & Tech News Microsoft News Case with Microsoft as a Party Explore LEXIS-NEXIS for Microsoft Microsoft Web Site

11 Company Tracking and Analysis MICROSOFT CORP INTEL DELL COMPUTER CORP Your Companies User pre-selects companies to track.

12 Company Tracking and Analysis MICROSOFT CORP INTEL DELL COMPUTER CORP Your Companies User selects Microsoft Corp. Higher than average coverage flagged

13 Company Tracking and Analysis MICROSOFT CORP INTEL DELL COMPUTER CORP Your Companies The next day - User is back again Extremely high coverage flagged

14 Company Tracking and Analysis MICROSOFT CORP INTEL DELL COMPUTER CORP Your Companies Click on the red circle for News Topic Analysis

15 Company Tracking and Analysis MICROSOFT CORP INTEL DELL COMPUTER CORP Your Companies User clicks on the STOCKS bar for the news

16 Answer Set Navigation Executive Changes Stocks Lawsuits User clicks on Topic Analysis More Executive Changes More Stocks More Lawsuits

17 Consistent Classification - Trending Trend Analysis of Metadata NEXercise User Selected Indexing Terms: Download into Excel Spreadsheet Online Trading Electronic Commerce Internet Crime

18 Consistent Classification - Press Trending Trending in the News International Herald Tribune (Neuilly-sur-Seine, France), July 4, 2000, Tuesday … The National Security Agency certainly features regularly in Mr. Gertz's coverage. A Lexis-Nexis search lists 132 Gertz stories in The Washington Times going back to 1989 that have mentioned the agency. The Washington Post, June 28, 2000,...easily discern one of the issues of greatest concern to voters: George W. Bush's position on the death penalty. A Nexis search Monday for stories mentioning Bush at least three times and the words "death penalty" or "executions" or "capital punishment" at least three … The New York Times, June 14, 2000,...tally the Hotline political tip sheet keeps of how often possible vice-presidential choices merit a major media mention. Mr. Danforth had 10 mentions, compared with 49 for Gov. Tom Ridge of Pennsylvania, No. 1 on the 53-name list. The Washington Times, May 05, 2000, … "A Nexis search of 'extreme right' over the past month scored 212 mentions; a Nexis search of 'extreme left' over the past month yielded 58 items. MC Technology Marketing Intelligence, December 1, 1999 … We looked at such quantitative data as stock performance in 1999 and the number of press mentions (as shown in a Lexis- Nexis search), Fortune, October 12, 1998, … Just how addicted to cliches are financial media editors? Here's a list of fave words and the number of stock market stories in which they appeared, generated by a Lexis-Nexis search from the end of August to Sept. 11: Turmoil: 1,559; plunge: 1,260; crash: 965; correction: 860; bear market: 750;...

19 Consistent Classification- Source Suggestion Automatic Suggestion of Sources LEXIS-NEXIS Suggest-a- Source User Selected Indexing Term LEXIS-NEXIS top Sources for Denver Broncos Rocky Mountain News Denver Post Sports Network Associated Press Seattle Post-intelligencer USA Today Washington Post Orlando Sentinel Kansas City Star Regal-fort Worth Star San Diego Union Tribune LEXIS-NEXIS top Sources for IPOs Cable News Network F M&A Journal AFX-Extel News PR Newswire Business Wire Phillips Newsletter Financial Times Institutional Invest IAC News Business Times Cable News Network Asia Intelligence Wire Financial Post New York Post IPOs LEXIS-NEXIS Suggest-a- Source User Selected Indexing Term Denver Broncos What are these?

20 Consistent Classification - More Than a Cite List Source Analyzer NEXIS Source Analyzer Dayton Daily News Topics 2697 Sports 2616 Athletes 2181 Basketball 1871 Campaigns & Elections 1772 College Sports 1503 Cities 1476 Lawyers 1473 Baseball & Softball 1438 High School Sports 1345 Violent Crime 1258 Litigation 1207 Sentencing 1158 Judges 1132 American Football 1086 Fundraising 937 Television Programming 931 Deaths & Obituaries 857 Diseases & Disorders 852 Settlements & Decisions 837 Arrests Source Analyzer User Selected Sources: Download into Excel Spreadsheet Dayton Daily News Washington Post LA Times NEXIS Source Analyzer Washington Post Topics 11410 Sports 8567 Campaigns & Elections 7439 Athletes 6415 Lawyers 4665 Basketball 4498 Violent Crime 4393 Banking & Finance 4265 Entertainment & Arts 4155 Baseball & Softball 3938 Judges 3753 International Relations 3703 Budget 3675 College Sports 3557 Cities 3397 Litigation 3384 Sentencing 3243 Candidates 3202 American Football 3109 Television Programming 2758 Fundraising NEXIS Source Analyzer Los Angeles Times Topics 6080 Sports 3375 Cities 3101 Campaigns & Elections 2915 High School Sports 2815 Athletes 2800 Lawyers 2360 Basketball 2347 Baseball & Softball 2341 Letters & Comments 2241 College Sports 2188 Violent Crime 2113 San Fernando Valley 1918 Television Programming 1851 Litigation 1793 Judges 1711 Deaths & Obituaries 1504 Editorials & Opinions 1410 Environment 1391 Television Industry 1380 Sentencing Source Analyzer highlights Common Terms

21 Consistent Classification - More Than a Cite List Source Analyzer NEXIS Source Analyzer Financial Times Topics 61039 Banking & Finance 32061 Mergers & Acquisitions 18869 Telecommunications 18112 Trade Agreements 17499 Campaigns & Elections 13484 Currencies 11458 Computing & Technology 11121 International Relations 11056 Exchange Rates 11009 Privatization 10229 Emerging Markets 10160 Energy 9015 Joint Ventures 8959 Stock Indexes 8680 Debt 8609 Budget 8606 Automakers 8424 Engineering 8347 Central Banks 8110 Taxes Source Analyzer User Selected Sources: Download into Excel Spreadsheet Financial Times USA Today NEXIS Source Analyzer USA Today Topics 30235 Sports 17591 Athletes 9006 Baseball & Softball 9003 College Sports 8989 Basketball 8287 Television Programming 7501 American Football 7355 Campaigns & Elections 6485 Lawyers 6370 Banking & Finance 5662 Olympics 4975 Entertainment & Arts 4884 Television Industry 4469 Polls & Surveys 3975 Litigation 3832 Airlines 3363 Judges 3335 Violent Crime 3331 International Relations 2933 Network Television Source Analyzer highlights Common Terms The New Republic, JULY 26, 1999 … The U.S. section is lambasted for repeating what was reported in the American press. To prove it, Sullivan does a Nexis search on the topic of each article in a random issue and compares what he finds to The Economist. The results are not surprising.

22 Reporter Analysis What is a reporter covering? NEXIS ByLine Analyzer Steve Schmidt reported Topics 13 CITIES 10 NATIONAL PARKS 10 CAMPAIGNS & ELECTIONS 8 SUBURBS 8 MARRIAGE 7 THEME PARKS 6 VIOLENT CRIME 6 SECONDARY SCHOOLS 5 SPORTS 5 PUBLIC TRANSPORTATION ByLine Analyzer User Selected Reporter: Download into Excel Spreadsheet Steve Schmidt NEXIS ByLine Analyzer Steve Schmidt reported Companies 5 MICROSOFT CORP 1 WALT DISNEY CO INC 1 PACIFIC LUMBER CO 1 PACIFIC BELL 1 MAPES HOTEL 1 DESTINATION PALM BEACH 1 ALTURAS CASINO 1 ALASKA AIR GROUP INC NEXIS ByLine Analyzer Steve Schmidt reported people 4 DAVID KNIGHT 3 SHAWN STINSON 3 EMILIO ESTEVEZ 3 CHARLIE SHEEN 3 BILL GATES 3 ALBERT GORE JR 2 WILLIE L BROWN 2 SCOTT HINSON 2 PETE KNIGHT 2 MICHAEL GONZALEZ NEXIS ByLine Analyzer Steve Schmidt reported Organizations 4 SAN DIEGO STATE UNIVERSITY 4 FEDERAL BUREAU OF INVESTIGATION 3 SAN DIEGO CITY COUNCIL 3 NATIONAL PARK SERVICE 2 WILD HORSE ORGANIZED ASSISTANCE 2 VALLEY MIDDLE SCHOOL 2 UNIVERSITY OF CALIFORNIA (LOS ANGELES) 2 SAN DIEGO PADRES 2 HELIX HIGH SCHOOL 1 YOSEMITE INSTITUTE

23 Topic Analysis Whos involved & Whos reporting on the recent rash of bacteria related product recalls? NEXIS Topics Analyzer Top Reporters 2 ROBERT WALKER 2 NICOLE BAILEY 2 LYNNE KOZIEY 1 SHAWN OHLER 1 SARAH GREEN 1 QUINTIN ELLISON 1 MATTHEW P BLANCHARD 1 MARTHA M. HAMILTON 1 MARLENE HABIB 1 MARK BROWN 1 LYLE HARVEY 1 KATHERINE HARDING 1 KAREN CLARK LEPOOLE 1 JOHN TAYLOR 1 JESSICA HANSEN 1 IAN MCDOUGALL 1 FRED ANKLAM JR 1 DONNA CASEY 1 DINA CAPPIELLO 1 CHU SHOWWEI 1 CHRISTINE WINTER 1 BILL EGBERT 1 BARBARA DURBIN Topic Analyzer User Selected Topics: Download into Excel Spreadsheet Product Recalls Bacteria NEXIS Topic Analyzer Top related Companies 29 MOYER PACKING CO 16 IBP INC 12 PACKERLAND PACKING CO INC 11 KRAFT FOODS 6 LAKESIDE FARM INDUSTRIES 5 PHILIP MORRIS COS INC 5 FOOD SAFETY & INSPECTION SERVICE 4 SNOW BRAND MILK PRODUCTS CO LTD 3 GARDEN BOTANIKA INC 2 XL FOODS 2 STOP & SHOP SUPERMARKET CO 2 LAKESIDE PACKERS 2 GIANT FOOD STORES INC 2 DEL GOULD MEATS INC 2 COSTCO WHOLESALE CORP

24 Approaches Documents ASP Service Model Categories Service Provider Customer Internet

25 Approaches Port The Classification Application to run in users environment Software Intellectual Capital

26 Approaches Port the Intellectual Capital to another classification systems format & logic Verity Users Semio Users Autonomy Users Hummingbird Users Inxight Users

27 Challenges Operator Incompatibility Parsing vs Inverted Word Index Tools Document Length Adjustments

28 Search Operator Compatibility Many Boolean search systems do not have a frequency operator - ATLEASTn( term ) at LexisNexis Years ago, LexisNexis noticed that many experienced searchers were simulating a frequency operator by cascading an existing proximity operator –cat W/9999 cat W/9999 cat –To simulate ATLEAST3( cat ) How do we port an ATLEASTn() search to a system without a proximity operator or a system that does not cascade proximity operators?

29 Porting Boolean Searches - Verity Example ATLEASTn Operator LNG Boolean:ATLEASTn( expr ) Verity: ( ( ( ( ( ( expr ) ) ) ) ) ) NOTE: ATLEASTn( expr1 or expr2 or … or exprX ) is equivalent to ATLEASTn( expr1 ) or ATLEASTn(expr2 ) or … or ATLEASTn( exprX ) ATLEASTn( expr1 and expr2 and … and exprX ) is equivalent to ATLEASTn( expr1 ) and ATLEASTn(expr2 ) and … and ATLEASTn( exprX )

30 Automatic Stemming - Precision Issues Many search engines perform automatic stemming which is needed for depluralization which was assumed when the Search Advisor searches were created and tested. Unfortunately, this stemming allows words to match morphological variants other then singular/plurals. For example, a search on CONSTITUTION may match CONSTITUTIONAL. This causes the ported searches to retrieve documents that the LN Boolean search does not. Some possible solutions. Do nothing. The words are many times similar in concept. This would require more detailed domain by domain analysis. Some search tools allow the user to put quotes around terms to turn off the stemming. If so, put quotes around all terms and generate additional terms in our search to simulate depluralization. Put quotes around all terms and do NOT generate new terms. This omits depluralization as well. Huge recall hit I would imagine.

31 Porting Boolean Searches - Recall Issues Proximity operators are impacted by differences in the set of non-searchable noise words. Porting LexisNexis searches to a system with less noise words will cause some documents matched by LexisNexis search engine not to be retrieved. For example, the search ATTACHED w/5 POLE matches in LN but may not in the following text cable attached to the hopper which the gin-pole. This also occurs in phrases which are W/1 (really a phrase). We may also miss documents on the term SURETY CONTRACT when LN matched it in the phrase SURETY TO THE CONTRACT Possible solution - Increase n by 1 or 2 in the ported search. This could have precision impacts.

32 Porting Uncontrolled Classification Tools To Yours.4 cat.2 dog.3 puppy.4 mouse Natural Language Search : cat, dog, puppy, mouse Natural Language Search : cat, cat, cat, cat, dog, dog, puppy, puppy, puppy, mouse, mouse, mouse, mouse New Weighted Natural Language Search that does not use TFIDF: cat(0.4), dog(0.2), puppy(0.3), mouse(0.4) Many companies market uncontrolled classification tools that automatically create categories Many cluster terms and assign weights different than TFIDF

33 LN Topical Indexing to Verity Example #SUBJECT: #CVTS: #SUBJ=CATS & DOGS EXAMPLE #TERMS: #WEIGHT=1 #THRESH=5 #FREQLMT=4 {fl01 = 4} #TERM01=cat #TERM01=cats #FREQLMT=4 {fl02 = 4} #TERM02=dog #TERM02=dogs

34 Word Concept Buckets the #TERM01 word concept counts with a frequency limit of 4 on a scale of 0.0 to 1.0 can be represented in Verity as: ( ( ( (cat) ) ), ( ( cats ) ) ) ) The #TERM02 word concept counts with a frequency limit of 4 on a scale of 0.0 to 1.0 is represented in Verity as: ( ( ( (dog) ) ), ( ( dogs ) ) ) )

35 Word Concept Buckets Examples of the TERM01 word concept counts (FL=4) # cat/cats ( ( ( (cat) ) ), ( ( cats ) ) ) ) 00.00 10.25 20.50 30.75 41.00 5+1.00

36 Blocking Effect #SUBJECT: #CVTS: #SUBJ=CAT DOG EXAMPLE #TERMS: #THRESH=4 #FREQLMT=5 {fl01 = 5} #TERM01=cat dog #FREQLMT=3 {fl02 = 3} #TERM02=cat #TERM02=dog #BLOCK=cat food #BLOCK=dog food In SmartIndexing, we do not count cat if it is in the phrase cat dog This is the Blocking Effect This is not natural in an Inverted word index based search systems Very unnatural - cats and dogs, sleeping together - total hysteria

37 Blocking Effect Verity has the operator which counts term frequency without the Blocking Effect. So the cat in cat dog is counted But … (cat) = (cat) - (cat dog) - (cat food) We have term counts with the blocking effect …. … Whoops! Verity does not have a operator!

38 Learning to Subtract Introducing ( b, a ) defined as b – a = ( ( ( b ), a ) ) Where 0<= a <= b <= 1 Follow the math.... ( ( ( b ), a ) ) ) = ( ( b ) + a ) ) = ( 1 - b + a ) = 1 - ( 1 - b + a ) = 1 -1 + b - a = b – a

39 Actual Results from CATS & DOGS EXAMPLE Cats & Dogs Test Summary expected results Score (Doc_)0 cat/cats1 cat/cats2 cat/cats3 cat/cats4 cat/cats5 + cat/cats 0 dog/dogs0.0 (CD1)0.125 (CD7)0.25 (CD11)0.375 (CD14)0.50 (CD16)0.50 (CD17) 1 dog/dogs0.125 (CD2)0.25 (CD8)0.375 (CD12)0.50 (CD15)0.625 (CD27)0.625 (CD32) 2 dog/dogs0.25 (CD3)0.375 (CD9)0.50 (CD13)0.625 (CD23)0.750 (CD28)0.750 (CD33) 3 dog/dogs0.375 (CD4)0.50 (CD10)0.625 (CD20)0.750 (CD24)0.875 (CD29)0.875 (CD34) 4 dog/dogs0.50 (CD5)0.625 (CD18)0.750 (CD21)0.875 (CD25)1.00 (CD30)1.00 (CD35) 5+ dog/dogs0.50 (CD6)0.625 (CD19)0.750 (CD22)0.875 (CD26)1.00 (CD31)1.00 (CD36) Cats & Dogs Test Actual Results Score (Doc_)0 cat/cats1 cat/cats2 cat/cats3 cat/cats4 cat/cats5 + cat/cats 0 dog/dogs0.0000 (CD1)0.1247 (CD7)0.2494 (CD11)0.3746 (CD14)0.4997 (CD16)0.5000 (CD17) 1 dog/dogs0.1247 (CD2)0.2494 (CD8)0.3742 (CD12)0.4993 (CD15)0.6244 (CD27)0.6247 (CD32) 2 dog/dogs0.2494 (CD3)0.3742 (CD9)0.4989 (CD13)0.6240 (CD23)0.7492 (CD28)0.7494 (CD33) 3 dog/dogs0.3746 (CD4)0.4993 (CD10)0.6240 (CD20)0.7492 (CD24)0.8743 (CD29)0.8746 (CD34) 4 dog/dogs0.4997 (CD5)0.6244 (CD18)0.7492 (CD21)0.8743 (CD25)0.9994 (CD30)0.9997 (CD35) 5+ dog/dogs0.5000 (CD6)0.6247 (CD19)0.7494 (CD22)0.8743 (CD26)0.9997 (CD31)1.0000 (CD36) Verity Threshold = THRESH/MAX = 5/8 = 0.625

40 Q & A Mark Shewhart Consulting Research Scientist LexisNexis mark.shewhart.3@lexis-nexis.com 937-865-6800 x4717


Download ppt "Portable Classification Tools Mark Shewhart LexisNexis 21 June 2001."

Similar presentations


Ads by Google