Presentation is loading. Please wait.

Presentation is loading. Please wait.

Empowering EPrints Search with Xapian

Similar presentations


Presentation on theme: "Empowering EPrints Search with Xapian"— Presentation transcript:

1 Empowering EPrints Search with Xapian
EPrints for Administrators University of Southampton 28th September 2011 Empowering EPrints Search with Xapian Sébastien François, EPrints Lead Developer EPrints Developer Powwow, ULCC EPrints Services, Web & Internet Science (WAIS) Research Group, Electronics & Computer Science, University of Southampton 2011.

2 EPrints for Administrators Training @ University of Southampton
28th September 2011 Summary Review of EPrints Internal Search Indexing Searching Extras TO-DO’s Using & contributing Demo(s) EPrints Services, Web & Internet Science (WAIS) Research Group, Electronics & Computer Science, University of Southampton 2011.

3 EPrints for Administrators Training @ University of Southampton
28th September 2011 EPrints “Internal” Search - Overview Search DataSet List 1 1..n 1..n MetaField Field Condition 1..n EPrints Services, Web & Internet Science (WAIS) Research Group, Electronics & Computer Science, University of Southampton 2011.

4 EPrints for Administrators Training @ University of Southampton
28th September 2011 EPrints “Internal” Search – Overview (2) match = “EX” queries the main & auxilliary dataset tables match = “IN” queries the __rindex dataset table ordering is done via the __ordervalues_$langid dataset table EPrints Services, Web & Internet Science (WAIS) Research Group, Electronics & Computer Science, University of Southampton 2011.

5 EPrints for Administrators Training @ University of Southampton
28th September 2011 EPrints “Internal” Search – Downsides Simple search is not scalable Lots of derived data in the DB (backup?) No relevance matching -> good matches do not surface up No advanced features: suggestions, facets, boolean op’s etc. Home-brewed: hard to maintain the code, hard to extend Difficult to debug… EPrints Services, Web & Internet Science (WAIS) Research Group, Electronics & Computer Science, University of Southampton 2011.

6 EPrints for Administrators Training @ University of Southampton
28th September 2011 EPrints Xapian Search Introduced in 3.3 Only integrated with the simple search Little flexibility in controlling what is indexed Advanced features “not really” enabled Searches every fields (“text_index” not respected) But the idea is good & worth building upon EPrints Services, Web & Internet Science (WAIS) Research Group, Electronics & Computer Science, University of Southampton 2011.

7 EPrints for Administrators Training @ University of Southampton
28th September 2011 Indexing Attempts to re-use EPrints’ default configuration: datasets’ field defintion (+ “text_index”) fields defined in the simple search (un-prefixed terms) But needs its own bits to define: default indexing methods (by MetaField type) facet-able indexes order-able indexes May be used to declare derived indexes – examples: “open_access”: to filter references from open full-text documents “year”: to filter by year of publication (rather than by date) “image_orientation”: if you had an archive of images, you could extract the orientation via EXIF EPrints Services, Web & Internet Science (WAIS) Research Group, Electronics & Computer Science, University of Southampton 2011.

8 EPrints for Administrators Training @ University of Southampton
28th September 2011 Indexing - Classes Xapian::Index Config Xapian DB IndexMethod OrderMethod Fulltext Name, etc. Alpha. Name, etc. EPrints Services, Web & Internet Science (WAIS) Research Group, Electronics & Computer Science, University of Southampton 2011.

9 EPrints for Administrators Training @ University of Southampton
28th September 2011 Indexing – Extra information Indexes are prefixed by “_” e.g. “_title” so we can sanitise the user query – otherwise users could do prefixed search (and search not necessarily allowed fields) Z notation: indicates a stemmed value or index: Z_title, Zhappi (internal Xapian convention) Script available to re-process the Xapian indexes (similar to “epadmin reindex” but doesn’t re-index the EPrints’ internal) Reserved indexes: _id: keep the internal id of the data-obj (/id/eprint/123) _dataset: to which dataset the record belongs to (‘eprint’, ‘user’…) _configuration_md5: keeps an MD5 of the conf. the item was indexed against (useful?) - _index_timestamp: when the item was last indexed EPrints Services, Web & Internet Science (WAIS) Research Group, Electronics & Computer Science, University of Southampton 2011.

10 EPrints for Administrators Training @ University of Southampton
28th September 2011 Searching Again, attempts to re-use EPrints’ configuration: simple search (mostly for ordering methods) advanced/staff search: which fields to use (prefixed terms) Extra bits can be configured such as which facets can be used on each search (simple, advanced, …) Only indexed stuff can be searched  you cannot use a facet which has not been generated you need to re-index your data if you change the simple search def. same if you add new order-able fields EPrints Services, Web & Internet Science (WAIS) Research Group, Electronics & Computer Science, University of Southampton 2011.

11 EPrints for Administrators Training @ University of Southampton
28th September 2011 Searching (2) Abstracted by Plugin::Search (original implementation) Tricky to make it work with EPrints’ UI because it expects an EPrints::Search object Plugin::Search::Internal is a wrapped EPrints::Search object (hack) so Plugin::Search::Xapian must emulate this behaviour EPrints Services, Web & Internet Science (WAIS) Research Group, Electronics & Computer Science, University of Southampton 2011.

12 EPrints for Administrators Training @ University of Southampton
28th September 2011 Searching – Classes & Op. Stack /cgi/xapian Search::XapianSearch Paginate::Facets Xapian::Facets Plugin::Search::Xapian Xapian DB EPrints Services, Web & Internet Science (WAIS) Research Group, Electronics & Computer Science, University of Southampton 2011.

13 EPrints for Administrators Training @ University of Southampton
28th September 2011 Searching – Extra information May be used in a script Exports & feeds work Can be serialised/de-serialised (including facets) so should work for Saved Searches (to test) EPrints Services, Web & Internet Science (WAIS) Research Group, Electronics & Computer Science, University of Southampton 2011.

14 EPrints for Administrators Training @ University of Southampton
28th September 2011 Extras “Related Items” Jiadi has developed a Bootstrap-based Pagination module: more sexy supports alternative “views” of the search results EPrints Services, Web & Internet Science (WAIS) Research Group, Electronics & Computer Science, University of Southampton 2011.

15 EPrints for Administrators Training @ University of Southampton
28th September 2011 TO-DO’s Range searching: possible in Xapian but not yet implemented (e.g ) Some refactoring: Xapian::Index -> Xapian::Indexer Plugin::Search::Xapianv2 => Plugin::Search::Xapian (and replace the default EPrints’ Xapian implementation) Test with real life data (done to a certain extent...) Load & scalability testing (+ number of slots etc.) Multi-lang considerations (and related IndexMethod) EPrints Services, Web & Internet Science (WAIS) Research Group, Electronics & Computer Science, University of Southampton 2011.

16 EPrints for Administrators Training @ University of Southampton
28th September 2011 TO-DO’s – Would be nice Page displaying how a data-obj has been indexed prefixes terms facets & order-able fields Status page (cf. “Admin > Status”): DB size number of Documents indexed datasets (and how) Weighting: supported (via conf.) but un-tested in real life EPrints Services, Web & Internet Science (WAIS) Research Group, Electronics & Computer Science, University of Southampton 2011.

17 EPrints for Administrators Training @ University of Southampton
28th September 2011 Internal Search vs Xapian Search Xapian is more of a user search The internal search is still required to: get records from the Database ($dataset->search()) this affects screens such as “Manage Deposits”, the “Review” etc. which cannot wait for items to be indexed (direct DB calls) may be needed to apply ACL’s (if some items cannot be searched): safer to use the (MySQL) DB as authority EPrints Services, Web & Internet Science (WAIS) Research Group, Electronics & Computer Science, University of Southampton 2011.

18 EPrints for Administrators Training @ University of Southampton
28th September 2011 Debugging Xapian Plugin::Search::Xapian may be set to debug mode: shows processing and query building Xapian comes with an analysis tool, “delve” to: view the content of the Xapian DB or some selected Documents see if a term exists in the DB (and in which Documents) other info (term frequency etc.) Knowing what Xapian is searching and how a data-obj is indexed is key to debug most search-relating issues EPrints Services, Web & Internet Science (WAIS) Research Group, Electronics & Computer Science, University of Southampton 2011.

19 EPrints for Administrators Training @ University of Southampton
28th September 2011 Using & Contributing Not quite at release stage but it is –currently- isolated so shouldn’t break your IR All the code is on GitHub: EPrints Services, Web & Internet Science (WAIS) Research Group, Electronics & Computer Science, University of Southampton 2011.

20 EPrints for Administrators Training @ University of Southampton
28th September 2011 Demos Simple search / facets / export / order Simple search with boolean op’s, suggestion Advanced search / facets / export / order Related items (more data + cached citations) EPrints Services, Web & Internet Science (WAIS) Research Group, Electronics & Computer Science, University of Southampton 2011.

21 EPrints for Administrators Training @ University of Southampton
28th September 2011 Q&A & what’s next Let’s have a play? Code overview? Doc? EPrints Services, Web & Internet Science (WAIS) Research Group, Electronics & Computer Science, University of Southampton 2011.


Download ppt "Empowering EPrints Search with Xapian"

Similar presentations


Ads by Google