What's the story with open source? Searching and monitoring news media with open source technology Charlie Hull, Flax BCS IRSG Search Solutions 2010 Photo.

Slides:



Advertisements
Similar presentations
ELibrary The user-friendly general reference solution 2008.
Advertisements

ELibrary The user-friendly general reference solution
ELIBRARY CURRICULUM EDITION The ultimate K-12 curriculum and reference solution.
IATI Technical Advisory Group Technical Proposals Simon Parrish IATI Technical Advisory Group, DIPR March 2010.
Current Awareness Services. Definition n A service which provides the recipient with information on the latest developments within the subject areas in.
Altman IM Ltd | | capture | index | organise | workflow Enterprise document & content management … for all types & size.
© Copyright 2012 STI INNSBRUCK Apache Lucene Ioan Toma based on slides from Aaron Bannert
1 L U N D U N I V E R S I T Y a home grown, bespoke institutional Federated Search tool JIBS Conference at The John Rylands University Library,
“ Leveraging SharePoint 2010 Search Technologies ” With: Ivan Neganov.
IAEA International Atomic Energy Agency United Nations Library and Information Network for Knowledge Sharing (UN-LINKS) September 2013, Geneva.
11© 2011 Hitachi Data Systems. All rights reserved. HITACHI DATA DISCOVERY FOR MICROSOFT® SHAREPOINT ® SOLUTION SCALING YOUR SHAREPOINT ENVIRONMENT PRESENTER.
IAEA International Atomic Energy Agency ICSTI 2013 Annual Members’ Meeting March 2013.
Engineering Village ™ ® Basic Searching On Compendex ®
Information Retrieval in Practice
Search Engines and Information Retrieval
River Campus Libraries Metadata That Supports Real User Needs David Lindahl Director of Digital Library Initiatives University of Rochester Libraries.
Enterprise Search With SharePoint Portal Server V2 Steve Tullis, Program Manager, Business Portal Group 3/5/2003.
With the Help of the Microsoft Azure Platform, Devbridge Group Provides Powerful, Flexible, and Scalable Responsive Web Solutions MICROSOFT AZURE ISV PROFILE:
Using Social Care Online: an overview Version 1.0 April 2015.
Overview of Search Engines
RSS RSS is a method that uses XML to distribute web content on one web site, to many other web sites. RSS allows fast browsing for news and updates.
Making sense of the data jumble Trinity College Library Dublin’s Discovery Solution Experience Arlene Healy & Charles Montague Digital Systems and Services.
A complete solution for archiving Domino mails using one Server task for hundreds of Notes users A. Kogan EDV Ltd. & Co. KG
Interspire Website Publisher (Formerly Interspire ArticleLive)
Implementing search with free software An introduction to Solr By Mick England.
Word Up! Using Lucene for full-text search of your data set.
#acquia Commons The Open Alternative for Social Business Software Name Title Acquia Month XXth, 2011.
Xpantrac connection with IDEAL Sloane Neidig, Samantha Johnson, David Cabrera, Erika Hoffman CS /6/2014.
A Scalable Application Architecture for composing News Portals on the Internet Serpil TOK, Zeki BAYRAM. Eastern MediterraneanUniversity Famagusta Famagusta.
IAEA International Atomic Energy Agency Agenda item 2.6 INIS Collection Search 36 th Consultative Meeting of INIS Liaison Officers 4-5 October 2012, Vienna,
Search Engines and Information Retrieval Chapter 1.
Web Search Created by Ejaj Ahamed. What is web?  The World Wide Web began in 1989 at the CERN Particle Physics Lab in Switzerland. The Web did not gain.
Improving the Catalogue Interface using Endeca Tito Sierra NCSU Libraries.
RSS Feeds What, Why, & How… …without a CMS Don Parsons
UHCL Capstone Project Team #10 Final Presentation 05/01/2012 Drupal based Scholarship Application 1UHCL Capstone Team #10 Spring 2012.
APPLICATION Provisioning & Management made EASY EASY to ManageEASY to Manage EASY to MarketEASY to Market.
University of North Texas Libraries Building Search Systems for Digital Library Collections Mark E. Phillips Texas Conference on Digital Libraries May.
NCSU Libraries Kristin Antelman NCSU Libraries June 24, 2006.
SUMMON ® 2.0 DISCOVERY REINVENTED. What is Summon 2.0? A new, streamlined, modern interface New and enhanced features providing layers of contextual guidance.
Sébastien François, EPrints Lead Developer EPrints Developer Powwow, ULCC.
GUIDED BY DR. A. J. AGRAWAL Search Engine By Chetan R. Rathod.
Search & Searchability. Presentation from David Hawking – CSIRO Ineffectual corporate search tools can be the biggest drag on employee productivity. Knowledge.
ITGS Databases.
Nikola Tesla Museum Clipping Library Saša Malkov Nenad Mitić Žarko Mijajlović 3 rd SEEDI Int.Conf. Cetinje, Montenegro 14. September 2007.
807 - TEXT ANALYTICS Massimo Poesio Lab 2: (Quick intro to) SOLR Document clustering with MAHOUT.
ICM – API Server & Forms Gary Ratcliffe.
1. 2 A scalable, feature-rich VMS solution, delivers enterprise- level performance along with freedom of choice, enabling system customization and compatibility.
User Interfaces and Information Retrieval Dina Reitmeyer WIRED (i385d)
Yahoo! BOSS Open up Yahoo!’s Search data via web services Developer & Custom Tracks Big Goal – If you’re in a vertical and you perform a search, you should.
Axis AI Solves Challenges of Complex Data Extraction and Document Classification through Advanced Natural Language Processing and Machine Learning MICROSOFT.
You spoke © 2008 Acquire Media We listened...
A Faceted Interface to the Library Catalog Tito Sierra NCSU Libraries ALA Midwinter Meeting January 20, 2007.
Apache Solr Dima Ionut Daniel. Contents What is Apache Solr? Architecture Features Core Solr Concepts Configuration Conclusions Bibliography.
5/29/2001Y. D. Wu & M. Liu1 Content Management for Digital Library May 29, 2001.
Crafter case: European Bank Piergiorgio Lucidi Open Source ECM Specialist Certified Alfresco Instructor and Engineer Alfresco Wiki Gardener and Forum Moderator.
© 2007 IBM Corporation IBM Software Strategy Group IBM Google Announcement on Internet-Scale Computing (“Cloud Computing Model”) Oct 8, 2007 IBM Confidential.
Breeda Herlihy, IR Manager, UCC Library. UCC selected DSpace in 2008 Software selection group Staff from Library IT, Computer Centre, Special Collections,
A presentation on ElasticSearch
Working in Open Source Search
NLA media access – update
Information Retrieval in Practice
Using Social Care Online: an overview
Internet Made Easy! Make sure all your information is always up to date and instantly available to all your clients.
Summon® 2.0 Discovery Reinvented
Library Website, Catalog, DATABASES and Free Web Resources
A scalable, feature-rich VMS solution, delivers enterprise-level performance along with freedom of choice, enabling system customization and compatibility.
Drupal based Scholarship Application
Building Search Systems for Digital Library Collections
LAMP, WAMP and.. L. Grewe.
The New LexisNexis® Statistical
Presentation transcript:

What's the story with open source? Searching and monitoring news media with open source technology Charlie Hull, Flax BCS IRSG Search Solutions 2010 Photo source:

What is Flax?

What is Flax? Search engine specialists Formed in 2001 from the ashes of Muscat Ltd and Webtop as Lemur Consulting Ltd Based in Cambridge UK Contributors to and users of Xapian Recently selected as UK Authorized Partner by Lucid Imagination Customers include Mydeco, NLA, Durrants Ltd, Financial Times, MediaMiser, MySkreen Apache Lucene and Solr are trademarks of The Apache Software Foundation

The challenges

The challenges Content is created for publication, not for search

The challenges Content is created for publication, not for search Content isn't published consistently or available to all

The challenges Content is created for publication, not for search Content isn't published consistently or available to all Ranking is never simple

The challenges Content is created for publication, not for search Content isn't published consistently or available to all Ranking is never simple “We just want something like Google”

The challenges Content is created for publication, not for search Content isn't published consistently or available to all Ranking is never simple “We just want something like Google” Every system will have to scale beyond its originally planned size

The challenges Content is created for publication, not for search Content isn't published consistently or available to all Ranking is never simple “We just want something like Google” Every system will have to scale beyond its originally planned size - Every project is different

So how do we build news search?

So how do we build news search? Indexing

So how do we build news search? Indexing Historical, daily & updates (i.e. later editions)

So how do we build news search? Indexing Historical, daily & updates (i.e. later editions) Must cope with high volume, quickly

So how do we build news search? Indexing Historical, daily & updates (i.e. later editions) Must cope with high volume, quickly Essential metadata – byline, title, source

So how do we build news search? Indexing Historical, daily & updates (i.e. later editions) Must cope with high volume, quickly Essential metadata – byline, title, source File format translation not always necessary

So how do we build news search? Indexing Historical, daily & updates (i.e. later editions) Must cope with high volume, quickly Essential metadata – byline, title, source File format translation not always necessary BUT Pre-processing sometimes required

So how do we build news search? Indexing Historical, daily & updates (i.e. later editions) Must cope with high volume, quickly Essential metadata – byline, title, source File format translation not always necessary BUT Pre-processing sometimes required Content restriction & embargo data

So how do we build news search? Indexing Historical, daily & updates (i.e. later editions) Must cope with high volume, quickly Essential metadata – byline, title, source File format translation not always necessary BUT Pre-processing sometimes required Content restriction & embargo data Solution Lightweight, customisable index scripts using powerful open source libraries

So how do we build news search? import xapian import flax.core db = xapian.WritableDatabase('db', xapian.DB_CREATE) fm = flax.core.Fieldmap() fm.language = 'en' # stem for English fm.setfield('mytext', False) # freetext field fm.setfield('mydate', True) # filter field fm.save(db) doc = fm.document() doc.index('mytext', "I don't like spam.") doc.index('mydate', datetime(2010, 2, 3, 12, 0)) fm.add_document(db, doc) db.flush()

So how do we build news search? Searching

So how do we build news search? Searching Free text with Boolean operators

So how do we build news search? Searching Free text with Boolean operators Filters for metadata & date ranges

So how do we build news search? Searching Free text with Boolean operators Filters for metadata & date ranges Combine date and relevance ranking

So how do we build news search? Searching Free text with Boolean operators Filters for metadata & date ranges Combine date and relevance ranking Faceted search where appropriate

So how do we build news search? Searching Free text with Boolean operators Filters for metadata & date ranges Combine date and relevance ranking Faceted search where appropriate Saved searches & Alerting

So how do we build news search? Searching Free text with Boolean operators Filters for metadata & date ranges Combine date and relevance ranking Faceted search where appropriate Saved searches & Alerting 'More like this'

So how do we build news search? Searching Free text with Boolean operators Filters for metadata & date ranges Combine date and relevance ranking Faceted search where appropriate Saved searches & Alerting 'More like this' Content restriction & embargo filters

So how do we build news search? Searching Free text with Boolean operators Filters for metadata & date ranges Combine date and relevance ranking Faceted search where appropriate Saved searches & Alerting 'More like this' Content restriction & embargo filters Solution Template-based user interface scripts, again using open source libraries

So how do we build news search? Searching Free text with Boolean operators Filters for metadata & date ranges Combine date and relevance ranking Faceted search where appropriate Saved searches & Alerting 'More like this' Content restriction & embargo filters Solution Template-based user interface scripts, again using open source libraries Beware Javascript & older browsers!

So how do we build news search? Administration Indexing failures common Logging is essential

So how do we build news search? Administration Indexing failures common Logging is essential Log to text as a first pass, reports later

So how do we build news search? Administration Indexing failures common Logging is essential Log to text as a first pass, reports later Scalability Content is always growing Both indexing & searching must scale

So how do we build news search? Administration Indexing failures common Logging is essential Log to text as a first pass, reports later Scalability Content is always growing Both indexing & searching must scale Open source search libraries provide distributed indexing, replication, remote indexes Not simple to get this right!

So how do we build news search? Available open source technologies Languages – C/C++, Java, Python, Javascript Search libraries – Xapian, Lucene Search bindings/servers – Xappy, Flax.core, Solr External libraries – pyparsing, CherryPy, xmllib, mxODBC,... Presentation & UI – HTMLTemplate, MochiKit, JQuery, Yahoo! User Interface (YUI),...

So how do we build news search? Available open source technologies Languages – C/C++, Java, Python, Javascript Search libraries – Xapian, Lucene Search bindings/servers – Xappy, Flax.core, Solr External libraries – pyparsing, CherryPy, xmllib, mxODBC,... Presentation & UI – HTMLTemplate, MochiKit, JQuery, Yahoo! User Interface (YUI), … We can use whatever works!

Some examples Newspaper Licensing Agency – NLA Clipshare 20 million newspaper stories 6500 users Content from every major newspaper (and most regionals) Used by journalists, clippings agencies, media monitors Replacing internal systems at major newspapers

Some examples Newspaper Licensing Agency – NLA Clipshare 20 million newspaper stories 6500 users Content from every major newspaper (and most regionals) Used by journalists, clippings agencies, media monitors Replacing internal systems at major newspapers One of very few ways to search content from all the papers within hours of publication

Some examples Financial Times – press cuttings Web Service for easy integration XML source data Faceted search Area filters (whole article, body, headline, byline or any combination) Synonyms, spelling suggestions

Some examples Financial Times – press cuttings Web Service for easy integration XML source data Faceted search Area filters (whole article, body, headline, byline or any combination) Synonyms, spelling suggestions Built from scratch in a fortnight Designed as a prototype, scaled to production use without significant change

A different task – news monitoring Non-traditional use of search

A different task – news monitoring Non-traditional use of search Many automated searches on incoming content

A different task – news monitoring Non-traditional use of search Many automated searches on incoming content Searches reflect complex client needs

A different task – news monitoring Non-traditional use of search Many automated searches on incoming content Searches reflect complex client needs False positives require human checking

A different task – news monitoring Non-traditional use of search Many automated searches on incoming content Searches reflect complex client needs False positives require human checking False negatives should never occur!

A different task – news monitoring An example Durrants Ltd.

A different task – news monitoring An example Durrants Ltd. Thousands of client search profiles Hundreds of thousands of articles per day Complex publication heirarchy Established pipeline

A different task – news monitoring An example Durrants Ltd. Thousands of client search profiles Hundreds of thousands of articles per day Complex publication heirarchy Established pipeline Solution Flexible query language allows OCR errors, punctuation, fuzzy matching, weighting Supports features of previous engine Scalable master-slave architecture

A different task – news monitoring An example Durrants Ltd. Thousands of client search profiles Hundreds of thousands of articles per day Complex publication heirarchy Established pipeline Solution Flexible query language allows OCR errors, punctuation, fuzzy matching, weighting Supports features of previous engine Scalable master-slave architecture Accuracy improved in some cases from 95% rejected to 95% accepted Hardware budget 15% of previous system

Why open source? Flexible, extendable

Why open source? Flexible, extendable Powerful & scalable

Why open source? Flexible, extendable Powerful & scalable Lower cost

Why open source? Flexible, extendable Powerful & scalable Lower cost Commercial support available as necessary

Why open source? Flexible, extendable Powerful & scalable Lower cost Commercial support available as necessary - Freedom to innovate

Looking to the future

Looking to the future More and more content including social media

Looking to the future More and more content including social media Multiple delivery platforms

Looking to the future More and more content including social media Multiple delivery platforms Search-powered websites & applications

Looking to the future More and more content including social media Multiple delivery platforms Search-powered websites & applications 'No-SQL'

Looking to the future More and more content including social media Multiple delivery platforms Search-powered websites & applications 'No-SQL' Cloud

Looking to the future More and more content including social media Multiple delivery platforms Search-powered websites & applications 'No-SQL' Cloud Search no longer a bolt-on, but a platform for innovation

Looking to the future More and more content including social media Multiple delivery platforms Search-powered websites & applications 'No-SQL' Cloud Search no longer a bolt-on, but a platform for innovation Open source no longer an outsider, but the obvious choice

Thankyou! Questions? Photo source: