Apache Lucene in LexGrid. Lucene Overview High-performance, full-featured text search engine library. Written entirely in Java. An open source project.

Apache Lucene in LexGrid

Lucene Overview High-performance, full-featured text search engine library. Written entirely in Java. An open source project available for free download. http://lucene.apache.org/

Lucene structure overview Index : Contains a sequence of documents. Document : Is a sequence of fields. Field : Is a named sequence of terms. Term : Is a string. –Same string can be assigned to different fields. –Indexes only text or Strings.

Index Or Store Fields Index : –Will be used for searching. –Stores statistics about terms in order to make term- based search more efficient. –Inverted index : for a term, it can list all the documents that contain it. Store : –Not used for searching. –Helpful for debugging. –Term is stored in the index literally.

Analyzer Responsible for breaking up the text in each of the document fields into individual tokens. Tokens are the smallest piece of information that you can search. You can also use a different analyzer for each field so they can be treated differently. However, at search time your search analyzer must match your indexing analyzer in order to get good results.

The Mapping Our indexer code reads LexGrid data from database. The reader code needs to assemble the concept information so that it can call this method: protected void addConcept(String codingSchemeName, String codingSchemeId, String conceptCode, String propertyType, String property, String propertyValue, Boolean isActive, String presentationFormat, String language, Boolean isPreferred, String conceptStatus, String propertyId, String degreeOfFidelity, Boolean matchIfNoContext, String representationalForm, String[] sources, String[] usageContexts, Qualifier[] qualifiers) Every time this method is called, it creates a Lucene document out of this information.

The Mapping (Cont..) The above method called for every Presentation, Property, Definitions etc in a concept code. This is all of the information from LexGrid that is currently stored in the index. When the Boolean parameters are indexed, they are stored as a 'T' or an 'F', if supplied. When constructing a field, we have to decide if it will be analyzed, stored, and indexed.

The Mapping (Cont..) Here is the breakdown of the lucene fields that we create: codingSchemeName -> “codingSchemeName” S codingSchemeId -> “codingSchemeId” S conceptCode -> “conceptCodeTokenized” T conceptCode -> “conceptCode” S conceptCode -> “conceptCodeLC” LC property -> “property” S language -> “language” S propertyType is special – if it is not supplied, it is automatically set to textualPresentation, definition, comment, instruction, or property, depending on the value of the property variable. propertyType -> “propertyType” S The following fields are optional – only added if the provided values are non-null: propertyValue -> “propertyValue” ST propertyValue ->(lowercased)-> “untokenizedLCPropertyValue” LC *If normalization enabled* - propertyValue -> “norm_propertyValue” T *If doubleMetaphone enabled* - propertyValue -> “dm_propertyValue” T *If stemming enabled* - propertyValue -> “stem_propertyValue” T isActive -> “inactive” S isPreferred -> “isPreferred” S presentationFormat -> “presentationFormat” S conceptStatus -> “conceptStatus” S propertyId -> “propertyId” S degreeOfFidelity -> “degreeOfFidelity” S representationalForm -> “representationalForm” S matchIfNoContext -> “matchIfNoContext” S sources -> “sources” S T* usageContexts -> “usageContexts” S T* qualifiers -> “qualifiers” S T**

The Mapping (Cont..) Field “fields” : –Added to each document. –List of present in this document. –Helps searching for documents that contain (or don't contain) a particular field. Field “UNIQUE_DOCUMENT_IDENTIFIER_FIELD” : –Added to each document. –Populated by the “codingSchemeName” plus a hyphen and a document counter value. –Makes it easier to remove documents.

The Mapping (Cont..) Analyzers / Tokenizers WhiteSpaceLowerCaseAnalyzer –The default analyzer. –Makes text lower case. –splits the text into tokens on white space and following : '-', ';', '(', ')', '{', '}', '[', ']', ' ', '|‘ –Removes the following characters: ',', '.', '/', '\\', '`', '\'', '"', '+', '*', '=', '@', '#', '$', '%', '^', '&','?', '!‘ –used on the “conceptCode” and “propertyValue” fields.

The Mapping (Cont..) Analyzers / Tokenizers NormAnalyzer –Used when normalization is enabled. –Uses the WhiteSpaceLowerCaseAnalyzer and LVG Norm. Ex : if the string “trees” is fed into the analyzer, Lucene will end up indexing “tree”. –Used on the “norm_propertyValue” field. EncoderAnalyzer –Used when Double Metaphone indexing is enabled. –Uses the WhiteSpaceLowerCaseAnalyzer and Apache Commons Codec Double Metaphone Algorithm.

Index Usage in LexBIG Restrictions on CodeNodeSets are turned into Lucene queries. Supported Queries: –LuceneQuery –DoubleMetaphoneLuceneQuery –StemmedLuceneQuery –StartsWith –ExactMatch –Contains –RegExp

Index Usage in LexBIG (Cont..) Simple Queries: –Queries constructed using default field, term value and Analyzer based on user specified query. –for example, if you specify 'activeOnly', we add a section to the query which would require the “isActive” field to have a value of 'T'. –Nearly all of the untokenized fields are handled this way. –“startsWith” and “exactMatch” queries are also handled this way.

Index Usage in LexBIG (Cont..) Complex Queries: –User queries containing boolean logic, embedded wild cards, etc –Rely on Lucene Query Parser by providing appropriate Analyzer and field depending on the type of search algorithm selected. –For example, for normalized search, we feed the matchText into a QueryParser with the NormAnalyzer, and the “norm_propertyField” set at the default field. For the “LuceneQuery” match algorithm, we provide the WhiteSpaceLowerCaseAnalyzer and the “propertyValue” field. –Wild card and fuzzy searches supported.

Index Usage in LexBIG (Cont..) Result from Lucene : –BitSet (an array of bits – either 1 or 0) – with one bit per Lucene document. Each bit will be set to 1 if the document matched the query, or it will be set to 0 if it did not satisfy the query. –We take advantage of the boundary documents. –Combination of boundary bitSet and user query bitSet gives all the matching unique concept code data. –Additional restriction on a codedNodeSet are resolved to a bitSet in the same way as above and then the bitSets are AND'ed together. –BitSet resolved into CardHolder object (contains only the conceptCode, codingScheme, and version + the score, if requested) which can be used for ‘union’, ‘intersection’ and ‘difference’.

Index Usage in LexBIG (Cont..) –If entire result to be returned at once : Each of the items in the CodeHolder is resolved into a ResolvedConceptReference – this is done through a series of SQL calls. –If asked for an iterator : An Iterator object is created which holds the CodeHolder. Individual ResolvedConceptReferences are resolved from the SQL Server as needed.

Questions ??

Apache Lucene in LexGrid. Lucene Overview High-performance, full-featured text search engine library. Written entirely in Java. An open source project.

Similar presentations

Presentation on theme: "Apache Lucene in LexGrid. Lucene Overview High-performance, full-featured text search engine library. Written entirely in Java. An open source project."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Apache Lucene in LexGrid. Lucene Overview High-performance, full-featured text search engine library. Written entirely in Java. An open source project.

Similar presentations

Presentation on theme: "Apache Lucene in LexGrid. Lucene Overview High-performance, full-featured text search engine library. Written entirely in Java. An open source project."— Presentation transcript:

Similar presentations

About project

Feedback