Oracle Database 11g New Search Features and Roadmap

Oracle Database 11g New Search Features and Roadmap
Roger Ford Senior Principal Product Manager

<Insert Picture Here>
Contents <Insert Picture Here> Oracle’s Search Products Oracle Text 11g New Features Oracle Text New Features Entity Extraction Name Search Result Set Interface Search Product Roadmap Oracle Text Secure Enterprise Search

Oracle’s Search Products
Oracle Text A SQL and PL/SQL based toolkit for creating full-text search applications Free with all database versions Previously known as Context Option, interMedia Text Secure Enterprise Search A complete search based on Oracle Text capabilities Crawlers for datasources such as web, , document repositories, databases End-user query application and APIs for embedding

Oracle Text 11g New Features
Composite Domain Indexes and SDATA sections Allows storage of structured info (eg numbers, dates) within text index Makes for much faster “mixed” queries Auto Lexer Automatic Language Recognition Segmentation and Stemming for 32 languages Context-sensitive stemming for 23 of these languages Off-line and time-limited index creation Enables rebuild of indexes offline in quiet periods for true 24x7 operation

Demo: Auto Lexer 6

11.2.0.2 New Features - Summary Entity Extraction
Find “entities” such as people, countries, cities, states, zip codes, phone numbers etc from the text Use default dictionary and rules or define your own dictionary and rules based on regular expressions Name Search (NDATA sections) Inexact searches, copes with mis-spellings, segmentation errors, contractions and word reversal Useful for many searches, but particular good for names ResultSet Interface Query request in XML and results returned as XML Avoids SQL layer and requirement to work within “SELECT” semantics

Entity Extraction Also can use ctxload to load user dictionary
Indentify names, places, dates, times, etc Tag each occurence with type and subtype Entities are defined by DICTIONARY and RULES Implemented by CTX_ENTITY package create_extract_policy – create a policy to which you can add extract rules Choose to use/not use built in rules and dictionary add_extract_rule – create an XML-based rule to define an entity add_stop_entity – prevent defined entities from being used compile – build the policy with its rules extract – get an XML-based list of entities for a doc Also can use ctxload to load user dictionary

Demo: Entity Extraction
9

Entities: built-in types
building city company country currency date day _address geo_political holiday location_other month non_profit organization_other percent person_jobtitle person_name person_other phone_number postal_address product region ssn state time_duration tod url zip_code

Entity Extraction – Example 1: Defaults
ctx_entity.create_extract_policy('my_default_policy'); ctx_entity.compile('mypolicy'); ctx_entity.extract('mypolicy', mydoc, mylang, myresults); Output in "myresults": <entities> <entity id="0" offset="75" length="8" source="SuppliedDictionary"> <text>New York</text> <type>city</type> </entity> <entity id="1" offset="55" length="16" source="SuppliedRule"> <text>Hupplewhite Inc.</text> <type>company</type> </entities>

Entity Extraction – Example 2: User rule
ctx_entity.create_extract_policy('mypolicy'); ctx_entity.add_extract_rule('mypolicy', 5, '<rule> <expression>((North|South)? America)</expression> <type refid="1">xContinent</type> </rule>'); ctx_entity.compile('mypolicy'); ctx_entity.extract('mypolicy', mydoc, mylang, myresults); Note parentheses around expression. refid="1" means take the first expression in paren – so "North America" or just "America". User defined types must be prefixed with a "x" – hence "xContinent" <entities> <entity id="0" offset="75" length="13" source="UserRule"> <text>North America</text> <type>xContinent</type> </entity> </entities>

Ent Ext: Adding a user dictionary
Create file ud.xml: <dictionary> <entities> <entity> <value>Dow Jones Industrial Average</value> <type>xIndex</type> </entity> <entity> <value>S&P 500</value> <type>xIndex</type> </entity> <entities> </dictionary> Create the policy with CTXLOAD (can add rules later) ctxload -user scott/tiger -extract -name pol1 -file ud.xml Compile the policy ctx_entity.compile('pol1'); Results <entity id="69" offset="1010" length="7" source="UserDictionary"> <text>S&P 500</text> <type>xIndex</type> </entity>

Entity Extraction – other stuff
Extracting only certain entity types: ctx_entity.extract('p1', mydoc, null, myresults, 'city,company,xContinent');

Searching names has many difficulties
Name Search Searching names has many difficulties Spelling (steven = stephen) Alternate Names (fred = alfred, chuck = charles) Transcription (copying from spoken to written form) Transliteration (copying from one writing system to another) Segmentation (Mary Jane, Maryjane) First, Middle, and Last Name Classification Name search does intelligent matching across all these issues

Demo: Name Search 16

NDATA section type Basic implementation for name search Limitations
511 characters 255 whitespace-delimited terms No offset information, therefore no: Highlighting / Markup NEAR or phrase search with NDATA Uses WORDLIST preference attributes: NDATA_ALTERNATE_SPELLING NDATA_BASE_LETTER NDATA_THESAURUS (for alternate names – default thesaurus provided) NDATA_JOIN_PARTICLES (list such as 'de:du:mc:mac') Query Syntax NDATA(fieldname, search terms [, order [, proximity ] ] )

Some queries are difficult to express in SQL:
Result Set Interface Some queries are difficult to express in SQL: eg "Give me the top 5 hits in each category" Result set interface uses a simple text query and an XML result set descriptor Hitlist is returned in XML according to result set descriptor Uses SDATA sections for Grouping Counting

Result Set Example Query
ctx_query.result_set('docidx', 'oracle', '<ctx_result_set_descriptor> <count/> <hitlist start_hit_num="1" end_hit_num="2" order="pubDate desc, score desc"> <score/> <rowid/> <sdata name="author"/> <sdata name="pubDate"/> </hitlist> <group sdata="pubDate"> </group> <group sdata="author"> </ctx_result_set_descriptor> ', rs);

Result Set Output <ctx_result_set> <hitlist> <hit> <score>3</score><rowid>AAAPoEAABAAAMWsAAC</rowid> <sdata name="AUTHOR">John</sdata> <sdata name="PUBDATE"> :00:00</sdata> </hit> <score>3</score><rowid>AAAPoEAABAAAMWsAAG</rowid> </hitlist> <count>100</count>

Result Set Output - Continued
<groups sdata="PUBDATE"> <group value=" :00:00"><count>25</count></group> <group value=" :00:00"><count>50</count></group> <group value=" :00:00"><count>25</count></group> </groups> <groups sdata="AUTHOR"> <group value="John"><count>50</count></group> <group value="Mike"><count>25</count></group> <group value="Steve"><count>25</count></group> </ctx_result_set>

Preview

Roadmap – merging Text and SES
Secure Enterprise Search Oracle Text Full Control Full Featured Fine-grained Index Options Data Storage Options Lexer Options Stoplists Use existing database RAC, Exadata Built in database and mid-tier Crawlers for many sources Simple Query Interface End user GUI / API Embedded security

Coming Search Features
Natural Language Processing enhancements Ontology based classification Question answering Automatic Partitioning Query load load balancing Full support for facetted navigation (MVDATA sections) Functional completeness for Result Set Interface Result Iterator – streaming support Parallel Query Replication Support Golden Gate / Logical Standby / Streams Operator improvements NEAR2 – best query in one operator MNOT – mild not, eg YORK mnot NEW YORK Nested near Substring index and query performance improvements

Coming Search Features - Continued
Multiple enhancements to query performance BIGIO leverages Secure Files CLOBs Automatic optimization of indexes with “stage index” Two level index – keep common search terms in memory Partition maintenance without reindexing Off-load filtering from database server Section specific index options Choose different options, eg language, stopwords, PRINTJOINS for each section Regular expression based stopwords Forward Index Hugely improved performance for highlighting, snippets PDF “Native” Highlighting Unlimited SDATA, MDATA and Field Sections

The preceding is intended to outline our general product direction
The preceding is intended to outline our general product direction. It is intended for information purposes only, and may not be incorporated into any contract. It is not a commitment to deliver any material, code, or functionality, and should not be relied upon in making purchasing decisions. The development, release, and timing of any features or functionality described for Oracle’s products remains at the sole discretion of Oracle.

Oracle Database 11g New Search Features and Roadmap

Similar presentations

Presentation on theme: "Oracle Database 11g New Search Features and Roadmap"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Oracle Database 11g New Search Features and Roadmap

Similar presentations

Presentation on theme: "Oracle Database 11g New Search Features and Roadmap"— Presentation transcript:

Similar presentations

About project

Feedback