Presentation is loading. Please wait.

Presentation is loading. Please wait.

Search Engine Industry Trends – Impact for Digital Libraries Dr. John M. Lervik, CEO FAST 7th International Bielefeld Conference 2004.

Similar presentations


Presentation on theme: "Search Engine Industry Trends – Impact for Digital Libraries Dr. John M. Lervik, CEO FAST 7th International Bielefeld Conference 2004."— Presentation transcript:

1 Search Engine Industry Trends – Impact for Digital Libraries Dr. John M. Lervik, CEO FAST 7th International Bielefeld Conference 2004

2 Oslo Boston Tokyo Munich San Francisco Chicago Rome London Washington DC Rio de Janeiro Fast Search & Transfer (FAST) Since 1997, FAST has grown globally –Public company (OSE: FAST) –200+ employees, 80 in R&D –Profitable and well capitalized –Fast growing > 900 customers & partners (Univ. Lib Bielefeld, HBZ, ZIB, Norwegian Natl Lib, Elsevier, LexisNexis, etc) #2 growing company in Europe 1998-2002 –Internet business sold to Overture/Yahoo! –Acquired AltaVista software w/200 customers Industrial Strength Magic Quadrant: Most Visionary Excellent Choice New York Tromsø

3 Mission-Critical Business Search Search has become mission-critical & strategic: –Internet portals: Google, MSN, Yahoo!, … –E-commerce: Amazon, eBay, … –Corporate web sites: Dell.com, IBM.com,... –Yellow Pages: SEAT PG, TPI PA, Findexa, … –Directory services: Thomas Publishing, Bonnier… –Mobile: Vodafone live!, … Common purpose: Connect buyer with seller

4 Search Trends The Google effect –Users demand simple one-field search –Users demand relevant results –Paid search (advertisement) is the main business driver Challenge: Search is much more difficult in academic and corporate world –Need to provide the relevant (correct) answer –Web search: Provide a relevant answer Solution: 3rd generation search technology –Improved relevance through content and query analysis –Tools for navigation, discovery, and visualization

5 Digital Library Challenges Digital libraries face an information management challenge –Huge and increasing amount of digital data –Data/content aggregation, data store (repository), information retrieval & discovery, etc Increasing volumes and types of digital data –Media types: Books, magazines, CDs,... –Media formats: Text/numbers (incl metadata), audio files, images, video –Must support various access patterns, copyright, etc Need flexible and efficient interfaces between information and users –Search engine as unified information access layer

6 Current Role of Search - Point Solutions SITE SEARCH IntranetDocuments SITE SEARCH IntranetDocuments eMail Mail System MAIL SEARCH Documents DMS, CMS DMS SEARCH RDBMS ERP, CRM Legacy Data Datawarehouse Datamarts BI SEARCH CORPORATE SEARCH ECOMMERCE SEARCH The Corporation Isolated Solutions

7 … to a Horizontal Search Platform… RDBMS (JDBC, ODBC, SQLNet, DW, DM) Applications (e.g. ERM, CRM, Help Desk) Legacy Data (e.g. ISAM, VSAM, IMS) Message Queues (e.g. TIBCO, MQ-Series) DMS (e.g. MSoft CMS, Documentum) eMail Systems (e.g. Notes, Exchange) Files (e.g. Word, Excel, pdf, images, mp3) Portals (e.g. WebSphere, WebLogic) WWW (HTML, XML, WML, JavaScript) Private Webs (e.g. news feeds, Intranets) Direct Push UNSTRUCTUREDSTRUCTURED REAL--TIME Enterprise Search Platform SITE SEARCH MAIL SEARCH BI SEARCH DMS SEARCH CORPORATE SEARCH ECOMMERCE SEARCH … A common, unified service for intelligent, dynamic information retrieval Web services GRID computing

8 Search Engine How It Works CONNECTORS Pipeline SEARCH QUERY & RESULT PROCESSING FILTER Query Results Alert Vertical Applications Portals Custom Front-Ends Mobile Devices DATABASE CONNECTOR FILE TRAVERSER WEB CRAWLER Content Push DOCUMENT PROCESSING Pipeline Web Content Files, Documents Databases Custom Applications CONNECTORS TUNING, ADMINISTRATION Index Files Pipeline Multimedia Open, modular, scalable architecture

9 Search Engine How It Works Connect to content sources and get data –Web pages (e.g. XML, HTML, WML): Crawler –Files, documents (e.g. Word, Excel, pdf): File traverser –Database content (e.g. Oracle, DB2): Database connectors –Applications (e.g. Notes, Exchange, CMS/DMS): Application connectors CONNECTORS Pipeline SEARCH QUERY & RESULT PROCESSING FILTER Query Results Alert Vertical Applications Portals Custom Front-Ends Mobile Devices DATABASE CONNECTOR FILE TRAVERSER WEB CRAWLER Content Push DOCUMENT PROCESSING Pipeline Web Content Files, Documents Databases Custom Applications CONNECTORS TUNING, ADMINISTRATION Index Files Multimedia

10 Search Engine How It Works Analyze and index content to make it searchable – Convert and process content through pre-processing pipeline: Lemmatization, entity extraction, taxonomy classification, ontology Custom logic (e.g. adding special tags) – Write content to index files Web Content CONNECTORS Pipeline SEARCH QUERY /RESULT PROCESSING FILTER Query Results Alert Vertical Applications Portals Custom Front-Ends Mobile Devices DATABASE CONNECTOR FILE TRAVERSER WEB CRAWLER DOCUMENT PROCESSING Pipeline CONNECTORS TUNING, ADMINISTRATION Index Files Files, Documents Databases Custom Applications Content Push Pipeline Multimedia

11 Search Engine How It Works Analyze query – Use query language or query API – Convert and process query through query pipeline: Linguistic processing Custom logic (e.g. query term modification/addition) Web Content CONNECTORS Pipeline SEARCH QUERY PROCESSING FILTER Query Results Alert Vertical Applications Portals Custom Front-Ends Mobile Devices DATABASE CONNECTOR FILE TRAVERSER WEB CRAWLER Content Push DOCUMENT PROCESSING Pipeline CONNECTORS TUNING, ADMINISTRATION Index Files Files, Documents Databases Custom Applications Multimedia

12 Search Engine How It Works Match query to content index – Query- and content adaptive matching – Exploit all information and structure in the data CONNECTORS Pipeline SEARCH QUERY /RESULT PROCESSING FILTER Query Results Alert Vertical Applications Portals Custom Front-Ends Mobile Devices DATABASE CONNECTOR FILE TRAVERSER WEB CRAWLER DOCUMENT PROCESSING Pipeline CONNECTORS TUNING, ADMINISTRATION Index Files Web Content Push Files, Documents Databases Custom Applications Pipeline Multimedia

13 CONNECTORS Search Engine How It Works Return results to user – Convert and process results through result pipeline: Resort, filter for security, analyze for navigation and discovery (dynamic drilldown) – Pass results on to application (generated or through API) – Push results to alert engine and then external environment (e.g. mail, queue) Web Content Pipeline SEARCH RESULT PROCESSING FILTER Query Results Alert Vertical Applications Portals Custom Front-Ends Mobile Devices DATABASE CONNECTOR FILE TRAVERSER WEB CRAWLER Content Push DOCUMENT PROCESSING Pipeline CONNECTORS TUNING, ADMINISTRATION Index Files Files, Documents Databases Custom Applications Multimedia

14 Search Engine Features Relevant, Organized Information Linguistic Analysis –Auto-language detection –Natural language processing –Approximate matching (spelling) –Lemmatization (grammar) –Entity extraction, anti-phrasing –Multiple dictionaries, thesauri Taxonomy and Classification –Structured, unstructured data –Supervised, unsupervised categorization –Dynamic classification –Auto-taxonomy generation (terms, Web) –Taxonomy toolkit –Ontologies Open, Flexible Relevancy Model –Absolute and relative query boosting –Relative document boosting –Custom processing logic (pre-index, query) –Rule-based matching Powerful Query Language –Exact matches, wildcards, multiple terms –more like this (query by example), near –Text, integer, Boolean expressions (infinite level of parentheses –Integer comparisons (>,, =, <,, ) –Fuzzy queries, concept, Flexible Search and Sort –Range searching –Default sort, sort by field –Static & dynamic teasers, any field –Full inclusion, exclusion URI control –Robot aware Navigation, Discovery & Visualization –Structure, unstructured data –Dynamic drill-down (faceted browsing) –Results-based binning –Statistical analysis

15 Relevance & Information Discovery Traditional: Results sets are typically lists of document identifiers 3rd generation: Result set depending on the query intentions –Traditional result set lists –Dynamic clustering: Supervised and unsupervised –Live analytics (dynamic drill-down) for navigation and discovery –Visualization... 2 ways to search: - I know what I want, but I dont know where it is - Im not sure what Im looking for but I know how to get there Intelligent Organization The search bar Live analytics

16 Traditional Result Set Languages –77 languages auto-detectable, searchable, sortable –20 languages include advanced linguistics –Multiple code sets for each language Multiple field sorting There are 2 ways to search for anything: - I know what I want, but I dont know where it is - Im not sure what Im looking for but I know how to get there The search bar Linguistics –Auto-language detection –Approximate matching (spelling) –Lemmatization (grammar) –Phrase detection –Anti-phrasing, stop words –Proximity search –Multiple dictionaries, thesauri –Full search language (incl. text, integer, boolean)

17 Relevance: Ranking – The FCASQ Framework C ompleteness –How well does the query match superior contexts like the title or the url? –Example: query=Mexico, Is Mexico or University of New Mexico best? A uthority –Is the document considered an authority for this query? –Examples: Web link cardinality, article references (citations), product revenue, page impressions,... S tatistics –How well does the contents of this document on overall match the query? –Examples: Proximity, context weights, tfidf, degree of linguistic normaliz., etc Q uality –What is the quality of the document? –Examples: Homepage?, Entry point to product group?, Press release?,... F reshness –How fresh is the document compared to the time of the query?

18 Navigation & Discovery There are 2 ways to search for anything: - I know what I want, but I dont know where it is - Im not sure what Im looking for but I know how to get there Live Analytics Multi-Dimensional Navigation –Taxonomic, ontological –Clustering of extracted entities –Field-based categories Dynamic, Automatic Generation –Auto-generated from configuration definitions –Re-generated on each query –Internal scoring for further refinement

19 Automatically Extracted Entities

20 Information Discovery Example: Scirus Metadata

21 Information Discovery Example: Medical Information (Medline) – 12M Documents Discovery MESH keywords Publication year Journal Title Author(s) Chemical substances Etc

22 Information Discovery Example: Medical Information – 12M Documents

23 Analytical Search Example: Author Analysis Data source: 12M Medline Publications

24 Example: Echocardiography - Author drill-down Jim Seward, Mayo Jim Seward: Publishing pattern Co-AuthorsResearch Topics

25 Example 1: Scirus (www.scirus.com) Scirus is the leading online search engine for scientific content Proprietary Databases Value Added Functionalities Scientific Web Pages Twice winner of SEW Best Specialty Search Engine award 140 million Web pages (.edu,.gov,.org,.com, …) 30M article records (Medline, SciencDirect, …) Large-scale content aggregation Automatic content & page classificat. Query refinements (1-D drill- down)

26 One integrated search engine across many diverse projects –One search interface for all catalogs – instead of search in 100+ databases –Information from objects of all types of media (multimedia, textual content, metadata) –In-house library production systems, end-user services and in ongoing innovation projects Projects –The Digital Radio Archive (DRA): NRK Radio historical radio archive (300,000 programs) –Culture Net Norway: The official gateway to Norwegian culture on the web –The Digital Newspaper Library: 300,00 pages from year 1763 and onwards –Cultural Heritage Ekofisk: Content related Ekofisk oil field (incl. OAI metadata harvester) –The National Librarys public web site –Paradigma (Preservation, Arrangement & Retrieval of Assorted DIGital MAterials) –The Nordic Web Archive (NWA): Harvesting and archiving of web documents Example 2: Norwegian National Library (www.nb.no)

27 Summary Search engines can do more than just search… –Unified information access solution for digital libraries –Open, scalable and modular architecture: Allows for customization –Adapts to content and queries –Powerful data discovery, navigation, and visualization Many exciting technology developments to come –More advanced content and query analysis –Adaptive, personalized query- & content-sensitive matching –Dynamic result set presentation, navigation, discovery, visualization –Federation across external content applications

28 Thank you!


Download ppt "Search Engine Industry Trends – Impact for Digital Libraries Dr. John M. Lervik, CEO FAST 7th International Bielefeld Conference 2004."

Similar presentations


Ads by Google