Presentation is loading. Please wait.

Presentation is loading. Please wait.

What's the story with open source? Searching and monitoring news media with open source technology Charlie Hull, Flax BCS IRSG Search Solutions 2010 Photo.

Similar presentations


Presentation on theme: "What's the story with open source? Searching and monitoring news media with open source technology Charlie Hull, Flax BCS IRSG Search Solutions 2010 Photo."— Presentation transcript:

1 What's the story with open source? Searching and monitoring news media with open source technology Charlie Hull, Flax BCS IRSG Search Solutions 2010 Photo source: http://www.flickr.com/photos/shironekoeuro/

2 www.flax.co.uk2 What is Flax?

3 www.flax.co.uk3 What is Flax? Search engine specialists Formed in 2001 from the ashes of Muscat Ltd and Webtop as Lemur Consulting Ltd Based in Cambridge UK Contributors to and users of Xapian Recently selected as UK Authorized Partner by Lucid Imagination Customers include Mydeco, NLA, Durrants Ltd, Financial Times, MediaMiser, MySkreen Apache Lucene and Solr are trademarks of The Apache Software Foundation

4 www.flax.co.uk4 The challenges

5 www.flax.co.uk5 The challenges Content is created for publication, not for search

6 www.flax.co.uk6 The challenges Content is created for publication, not for search Content isn't published consistently or available to all

7 www.flax.co.uk7 The challenges Content is created for publication, not for search Content isn't published consistently or available to all Ranking is never simple

8 www.flax.co.uk8 The challenges Content is created for publication, not for search Content isn't published consistently or available to all Ranking is never simple “We just want something like Google”

9 www.flax.co.uk9 The challenges Content is created for publication, not for search Content isn't published consistently or available to all Ranking is never simple “We just want something like Google” Every system will have to scale beyond its originally planned size

10 www.flax.co.uk10 The challenges Content is created for publication, not for search Content isn't published consistently or available to all Ranking is never simple “We just want something like Google” Every system will have to scale beyond its originally planned size - Every project is different

11 www.flax.co.uk11 So how do we build news search?

12 www.flax.co.uk12 So how do we build news search? Indexing

13 www.flax.co.uk13 So how do we build news search? Indexing Historical, daily & updates (i.e. later editions)

14 www.flax.co.uk14 So how do we build news search? Indexing Historical, daily & updates (i.e. later editions) Must cope with high volume, quickly

15 www.flax.co.uk15 So how do we build news search? Indexing Historical, daily & updates (i.e. later editions) Must cope with high volume, quickly Essential metadata – byline, title, source

16 www.flax.co.uk16 So how do we build news search? Indexing Historical, daily & updates (i.e. later editions) Must cope with high volume, quickly Essential metadata – byline, title, source File format translation not always necessary

17 www.flax.co.uk17 So how do we build news search? Indexing Historical, daily & updates (i.e. later editions) Must cope with high volume, quickly Essential metadata – byline, title, source File format translation not always necessary BUT Pre-processing sometimes required

18 www.flax.co.uk18 So how do we build news search? Indexing Historical, daily & updates (i.e. later editions) Must cope with high volume, quickly Essential metadata – byline, title, source File format translation not always necessary BUT Pre-processing sometimes required Content restriction & embargo data

19 www.flax.co.uk19 So how do we build news search? Indexing Historical, daily & updates (i.e. later editions) Must cope with high volume, quickly Essential metadata – byline, title, source File format translation not always necessary BUT Pre-processing sometimes required Content restriction & embargo data Solution Lightweight, customisable index scripts using powerful open source libraries

20 www.flax.co.uk20 So how do we build news search? import xapian import flax.core db = xapian.WritableDatabase('db', xapian.DB_CREATE) fm = flax.core.Fieldmap() fm.language = 'en' # stem for English fm.setfield('mytext', False) # freetext field fm.setfield('mydate', True) # filter field fm.save(db) doc = fm.document() doc.index('mytext', "I don't like spam.") doc.index('mydate', datetime(2010, 2, 3, 12, 0)) fm.add_document(db, doc) db.flush()

21 www.flax.co.uk21 So how do we build news search? Searching

22 www.flax.co.uk22 So how do we build news search? Searching Free text with Boolean operators

23 www.flax.co.uk23 So how do we build news search? Searching Free text with Boolean operators Filters for metadata & date ranges

24 www.flax.co.uk24 So how do we build news search? Searching Free text with Boolean operators Filters for metadata & date ranges Combine date and relevance ranking

25 www.flax.co.uk25 So how do we build news search? Searching Free text with Boolean operators Filters for metadata & date ranges Combine date and relevance ranking Faceted search where appropriate

26 www.flax.co.uk26 So how do we build news search? Searching Free text with Boolean operators Filters for metadata & date ranges Combine date and relevance ranking Faceted search where appropriate Saved searches & Alerting

27 www.flax.co.uk27 So how do we build news search? Searching Free text with Boolean operators Filters for metadata & date ranges Combine date and relevance ranking Faceted search where appropriate Saved searches & Alerting 'More like this'

28 www.flax.co.uk28 So how do we build news search? Searching Free text with Boolean operators Filters for metadata & date ranges Combine date and relevance ranking Faceted search where appropriate Saved searches & Alerting 'More like this' Content restriction & embargo filters

29 www.flax.co.uk29 So how do we build news search? Searching Free text with Boolean operators Filters for metadata & date ranges Combine date and relevance ranking Faceted search where appropriate Saved searches & Alerting 'More like this' Content restriction & embargo filters Solution Template-based user interface scripts, again using open source libraries

30 www.flax.co.uk30 So how do we build news search? Searching Free text with Boolean operators Filters for metadata & date ranges Combine date and relevance ranking Faceted search where appropriate Saved searches & Alerting 'More like this' Content restriction & embargo filters Solution Template-based user interface scripts, again using open source libraries Beware Javascript & older browsers!

31 www.flax.co.uk31 So how do we build news search? Administration Indexing failures common Logging is essential

32 www.flax.co.uk32 So how do we build news search? Administration Indexing failures common Logging is essential Log to text as a first pass, reports later

33 www.flax.co.uk33 So how do we build news search? Administration Indexing failures common Logging is essential Log to text as a first pass, reports later Scalability Content is always growing Both indexing & searching must scale

34 www.flax.co.uk34 So how do we build news search? Administration Indexing failures common Logging is essential Log to text as a first pass, reports later Scalability Content is always growing Both indexing & searching must scale Open source search libraries provide distributed indexing, replication, remote indexes Not simple to get this right!

35 www.flax.co.uk35 So how do we build news search? Available open source technologies Languages – C/C++, Java, Python, Javascript Search libraries – Xapian, Lucene Search bindings/servers – Xappy, Flax.core, Solr External libraries – pyparsing, CherryPy, xmllib, mxODBC,... Presentation & UI – HTMLTemplate, MochiKit, JQuery, Yahoo! User Interface (YUI),...

36 www.flax.co.uk36 So how do we build news search? Available open source technologies Languages – C/C++, Java, Python, Javascript Search libraries – Xapian, Lucene Search bindings/servers – Xappy, Flax.core, Solr External libraries – pyparsing, CherryPy, xmllib, mxODBC,... Presentation & UI – HTMLTemplate, MochiKit, JQuery, Yahoo! User Interface (YUI), … We can use whatever works!

37 www.flax.co.uk37 Some examples Newspaper Licensing Agency – NLA Clipshare 20 million newspaper stories 6500 users Content from every major newspaper (and most regionals) Used by journalists, clippings agencies, media monitors Replacing internal systems at major newspapers http://www.nla-clipshare.com

38 www.flax.co.uk38 Some examples Newspaper Licensing Agency – NLA Clipshare 20 million newspaper stories 6500 users Content from every major newspaper (and most regionals) Used by journalists, clippings agencies, media monitors Replacing internal systems at major newspapers One of very few ways to search content from all the papers within hours of publication http://www.nla-clipshare.com

39 www.flax.co.uk39

40 www.flax.co.uk40

41 www.flax.co.uk41

42 www.flax.co.uk42 Some examples Financial Times – press cuttings Web Service for easy integration XML source data Faceted search Area filters (whole article, body, headline, byline or any combination) Synonyms, spelling suggestions http://presscuttings.ft.com

43 www.flax.co.uk43 Some examples Financial Times – press cuttings Web Service for easy integration XML source data Faceted search Area filters (whole article, body, headline, byline or any combination) Synonyms, spelling suggestions Built from scratch in a fortnight Designed as a prototype, scaled to production use without significant change http://presscuttings.ft.com

44 www.flax.co.uk44

45 www.flax.co.uk45 A different task – news monitoring Non-traditional use of search

46 www.flax.co.uk46 A different task – news monitoring Non-traditional use of search Many automated searches on incoming content

47 www.flax.co.uk47 A different task – news monitoring Non-traditional use of search Many automated searches on incoming content Searches reflect complex client needs

48 www.flax.co.uk48 A different task – news monitoring Non-traditional use of search Many automated searches on incoming content Searches reflect complex client needs False positives require human checking

49 www.flax.co.uk49 A different task – news monitoring Non-traditional use of search Many automated searches on incoming content Searches reflect complex client needs False positives require human checking False negatives should never occur!

50 www.flax.co.uk50 A different task – news monitoring An example Durrants Ltd.

51 www.flax.co.uk51 A different task – news monitoring An example Durrants Ltd. Thousands of client search profiles Hundreds of thousands of articles per day Complex publication heirarchy Established pipeline

52 www.flax.co.uk52 A different task – news monitoring An example Durrants Ltd. Thousands of client search profiles Hundreds of thousands of articles per day Complex publication heirarchy Established pipeline Solution Flexible query language allows OCR errors, punctuation, fuzzy matching, weighting Supports features of previous engine Scalable master-slave architecture

53 www.flax.co.uk53 A different task – news monitoring An example Durrants Ltd. Thousands of client search profiles Hundreds of thousands of articles per day Complex publication heirarchy Established pipeline Solution Flexible query language allows OCR errors, punctuation, fuzzy matching, weighting Supports features of previous engine Scalable master-slave architecture Accuracy improved in some cases from 95% rejected to 95% accepted Hardware budget 15% of previous system

54 www.flax.co.uk54 Why open source? Flexible, extendable

55 www.flax.co.uk55 Why open source? Flexible, extendable Powerful & scalable

56 www.flax.co.uk56 Why open source? Flexible, extendable Powerful & scalable Lower cost

57 www.flax.co.uk57 Why open source? Flexible, extendable Powerful & scalable Lower cost Commercial support available as necessary

58 www.flax.co.uk58 Why open source? Flexible, extendable Powerful & scalable Lower cost Commercial support available as necessary - Freedom to innovate

59 www.flax.co.uk59 Looking to the future

60 www.flax.co.uk60 Looking to the future More and more content including social media

61 www.flax.co.uk61 Looking to the future More and more content including social media Multiple delivery platforms

62 www.flax.co.uk62 Looking to the future More and more content including social media Multiple delivery platforms Search-powered websites & applications

63 www.flax.co.uk63 Looking to the future More and more content including social media Multiple delivery platforms Search-powered websites & applications 'No-SQL'

64 www.flax.co.uk64 Looking to the future More and more content including social media Multiple delivery platforms Search-powered websites & applications 'No-SQL' Cloud

65 www.flax.co.uk65 Looking to the future More and more content including social media Multiple delivery platforms Search-powered websites & applications 'No-SQL' Cloud Search no longer a bolt-on, but a platform for innovation

66 www.flax.co.uk66 Looking to the future More and more content including social media Multiple delivery platforms Search-powered websites & applications 'No-SQL' Cloud Search no longer a bolt-on, but a platform for innovation Open source no longer an outsider, but the obvious choice

67 www.flax.co.uk67 Thankyou! Questions? charlie@flax.co.uk www.flax.co.uk/blog Twitter: @FlaxSearch Photo source: http://www.flickr.com/photos/katerha/4259440136/


Download ppt "What's the story with open source? Searching and monitoring news media with open source technology Charlie Hull, Flax BCS IRSG Search Solutions 2010 Photo."

Similar presentations


Ads by Google