Presentation on theme: "Efficient full-text search in databases Andrew Aksyonoff, Peter Zaitsev Percona Ltd. shodan (at) shodan.ru."— Presentation transcript:
Efficient full-text search in databases Andrew Aksyonoff, Peter Zaitsev Percona Ltd. shodan (at) shodan.ru
www.rit2007.ru Search in databases? Databases are continually growing everyone has got 1M records 10-100M record databases are not that rare 1B+ record databases which require full-text search do exist (most prominent example is Google) Open-source DBMS are widely used We will talk about MySQL The word on the street is that other DBMSes have similar problems Unfortunately, built-in solutions are not good enough for full-text search And especially so, if there is something beyond just full-text search required…
www.rit2007.ru Types of special requirements Just search is a key requirement, but… Amazing, but it happens rather rarely (in DBMS world) Rather a Web-search engine task Additional sorting is frequently required On a value different from relevance – for instance, on product price Additional filtering is frequently required For instance, by product category, or posting author ID Match grouping is frequently required For instance, by date, or by data source (eg. site) ID What do built-in solutions offer?
www.rit2007.ru Built-in MySQL FTS Pro – built-in, updates instantly Con – scales poorly Con – ignores word positions This causes ranking issues This causes phrase search to be slow Con – only 1 FT index per query (columns…) Con – does not interoperate with other indexes I.e. WHERE, ORDER/GROUP BY, LIMIT clauses would be handled separately and manually Conclusion – it is often unacceptable
www.rit2007.ru External engines shootout We tested a number of well-known (to us) open- source solutions Let the vendors advertise commercial solutions themselves MySQL FTS mnoGoSearch, http://mnogosearch.org/http://mnogosearch.org/ Designed for Web, but can do databases too (htdb) Lucene, http://lucene.apache.org/http://lucene.apache.org/ Popular Java full-text search library Sphinx, http://sphinxsearch.com/http://sphinxsearch.com/ Designed for full-text search in databases from day one
www.rit2007.ru ~3.5M records, ~5 GB text (from Wikipedia) mnoGoSearch dropped out of a race more details in EuroOscon2006 talk by Peter Zaitsev MySQLLuceneSphinx Indexing time, min162717684 Index size, MB301163282850 Match all, ms/q2863022 Match phrase, ms/q36922921 Match bool top-20, ms/q242913 Benchmarking results
www.rit2007.ru Existing solutions mnoGoSearch Con – indexing and searching time issues FATAL – did not complete indexing 5 GB in 24 hours Lucene Pro – instant index updates Pro – wildcard, fuzzy searches Con – integration cost (this is Java library) Con – filtering implementation (searching speed) Con – no support for grouping Sphinx Con – monolithic indexes Pro – everything else
www.rit2007.ru Sphinx – overview External solution for database search Two principal programs Indexer, used for re-indexing FT indexes Searchd, search daemon Easy integration Built-in support for MySQL, PostgreSQL Provides APIs for PHP, Python, Perl, Ruby, etc Provides MySQL Storage Engine High speed Indexing speed – 4-10 MB/sec Searching speed– avg 20-30 ms/q @ 5 GB, 3.5M docs
www.rit2007.ru Sphinx – ideology Indexes locally available databases A-la SQL document structure supported from day one Up to 256 full-text fields Any amount of attributes (integer/timestamp/etc) Fast re-indexing instead of slow searching Non-updateable index format – was initially chosen to maximize searching speed But then it turned out – that re-indexing is very fast, too In case of partial updates – we can still use re-indexing partial (delta) indexes once per N minutes
www.rit2007.ru Sphinx – searching Quality Always accounts for word positions, not just frequencies Scalability Up to 50-100 GB per 1 CPU Supports distributed searches Distributed indexes are fully transparent to client application Examples Boardreader.com – 500M+ records, 550+ GB text, 12 CPU cluster Mininova.org – not many records (less than 1M), but 2-3M searches per day
www.rit2007.ru Sphinx – advanced features Sorting On any attribute combination, SQL-like syntax Filtering matches with a condition Performed at earliest possible searching stage – for speed Attributes are always either kept in RAM, or copied multiple times all over the index in required order – for speed Fun fact – sometimes full scan of all matches and filtering those on Sphinx side are times faster than corresponding MySQL SELECT query – and are used in production instead…
www.rit2007.ru Sphinx – advanced features Grouping On any attribute Performed in fixed RAM Performed approximately (!) Performed quite efficiently (compared to MySQL etc) Query words highlighting Special service, which needs document bodies and the query passed to it MySQL Storage Engine Can be used for especially complex queries on MySQL side which can not be run fully on Sphinx side Can be used to simplify integration
www.rit2007.ru Conclusions Large and very large databases require external solutions for full-text search There is a number of requirements to such solutions beyond just searching (filtering, grouping, etc) There is a number of open-source solutions with different degrees of matching these requirements For most tasks, try Sphinx, http://sphinxsearch.com/ http://sphinxsearch.com/
Your consent to our cookies if you continue to use this website.