Presentation on theme: "1 Finding Stuff: -LSI and Database Searching- A Business Use Case Joe Tragert EBSCO Publishing Bentley June 26, 2006."— Presentation transcript:
1 Finding Stuff: -LSI and Database Searching- A Business Use Case Joe Tragert EBSCO Publishing Bentley June 26, 2006
2 Overview EBSCO Publishing overview Latent Semantic Indexing pros and cons Integrated diverse content types – the Executive Daily Brief use case Discovering obfuscated records – the US PTO example
3 EBSCO Industries Ranked #162 in Forbes “America’s Largest Private Companies” in 2005
4 EBSCO Publishing Research & reference solutions Corporate Medical Academic Public Library K-12 73 terabytes of content, configured into over 100 different proprietary full-text databases Redistribute 100+ 3 rd -party reference products Founded in 1987, 550 employees world wide, HQ in Ipswich, MA
5 Latent Semantic Indexing Searching is focused on the words, not indices or metadata. The engine can be “trained” to optimize results by domain (engineering, medical, general business, etc.) Engine creates a vector space based upon the data it sees. All articles are placed within that vector space. Updates are quickly assigned values within the vector space, enabling real-time integration of RSS feeds. Multiple data sources are integrated rapidly, requiring a few hours to a few days.
6 Conceptual Search: concepts are matched, not key words Easier to create searches by using chunks of text as search “terms” No need to understand thesauri or Boolean operators Integrated Content: databases, blogs, RSS, etc. Multiple databases can be searched at once (similar to federated search, but different…) Since the words are searched, no need to normalize indices or record structures of source data sets Real time content The engine can rapidly assign new content to the existing vector space, enabling integration of current content with archival material Language agnostic Since all content is converted to value in the vector space, multiple languages can be searched and returned in a single result list LSI Advantages
7 Precision: Matching concepts does not lead to the “one perfect article” Multiple content types in one result set requires robust filtering and refining functionality, to minimize confusion Default date order sorting can “overwhelm” a result list Multiple languages is seductive, but requires quality translator feature to get best utility from the results Can be difficult for the “Google generation” to grasp the concept of “concepts” LSI Disadvantages
8 Structured data: users tend not to care about meta data Currency is king: users tend to focus on “real time” content (news sites, blogs) but periodicals can provide real value Skills: not everyone is a librarian… actually, most aren’t Tools: slow to learn, slower to change Perspective: impatient with complexity Why Use LSI?
9 LSI Use Case: Customizable monitoring and alert service Supports non-librarian corporate uses: brand management, corporate intelligence, general counsel, IP management, etc. Two types of Search Content Analyst LLC’s patented Concept Search™ EBSCO’s keyword search Multiple content types Premium business content (EBSCO structured content) Newspapers RSS feeds (blogs, news sites) Licensed databases (USPTO, INSPEC, etc.) Intranet repositories
10 1.Users can set up folders, and monitor for content related conceptually (same meaning, but different words) to key words or article “examples” already in the folders 2.Users can search for immediate results that are related to words, articles, emails or external documents, using Concept Search or Key Word Search 3.Users can link to “advanced” key word search options, thesauri, and visual searching Multiple Content Types and Search Methods
11 Users can add, delete or edit “alerts” (folders) as needed Users put words, phrases, paragraphs, full articles, emails, MS Word docs, etc. into the folders. EDB adds matches to the folders Results for a folder appear when the folder is selected Users can easily make a result into a “concept” (example) and put it into a folder Folders Are Determined by End Users
12 The full text is viewed in a pop up window The user will link to the source (the article on EBSCOhost, news site, the RSS feed provider, licensed database or intranet file) Users can email, save, print the document, or add it to their folder as a new example to be monitored Structured Content in Familiar Layout
13 Selected RSS articles are viewed in a pop up window The user links to the source Linking to RSS Providers Simplifies Access
14 Results Are Refined, Interactively Users can sort results by Date, Title, Publication and Relevance Users can narrow results by Publication or Content Type Users can delete previously read content, content of a specific relevance, or content published before a specific date
15 Users can set up email lists (groups and individuals) to automatically forward documents Users can set higher relevancy threshold for shared documents, vs. their own inbox (only send the “best” articles to colleagues Alerts Controlled by End User
16 LSI Use Case: Find deliberately obscured patents Compare prior art to current research Monitor pending patents Search patents in native languages USPTO European Patent Organization Japan Patent Office Expose patent search to more staff Bench scientists Competitive intelligence Risk managers
17 Sneak Peak: EBSCO Patent Monitor In development – Fall 2006 release Use Concept Searching to identify “conceptually related patents” Enable cross-database searching Patents (various sources) Published STM literature Proprietary research & intranets
18 Searching on “motorcycle” finds patents that do not include the term “motorcycle”
19 Patent #6,085,857 does not contain the word “motorcycle”, but it sure looks like one… aka: “motorcycle”
20 Running a concept search on the patent abstract creates an ‘instant context list” These terms are found in the USPTO database and relate to “saddle-type riding vehicles.” Users can search the USPTO database to find those patents, or they can research the individuals to see who else is an expert…
21 The terms and names on the Instant Context list can indicate the true nature of the patent… Shinobu Tsutsumikoshi is a developer at Suzuki...
22 Search using press release on the new Maxim Knee System and get hundreds of related patents….
23 US Patent #6,090,144 is about prosthetic knees even though the Maxim press release never used the term “prosthesis”
24 Finding Stuff: The Dead Mouse Test LSI, key words, proximity, etc… The real question is not which mouse trap works better… …just did we kill the mouse?
25 Joe Tragert Director, Market Development EBSCO Publishing O: +800-653-2726 ext. 661 E: firstname.lastname@example.org Thank You