Presentation is loading. Please wait.

Presentation is loading. Please wait.

Oracle Text saves your time Oracle Text Search saves your time Anna Suwalska European Organization for Nuclear Research - Geneva OracleWorld Paris 2003.

Similar presentations


Presentation on theme: "Oracle Text saves your time Oracle Text Search saves your time Anna Suwalska European Organization for Nuclear Research - Geneva OracleWorld Paris 2003."— Presentation transcript:

1 Oracle Text saves your time Oracle Text Search saves your time Anna Suwalska European Organization for Nuclear Research - Geneva OracleWorld Paris 2003

2 Oracle Text saves your time CERN Engineering Data Management System at CERN Oracle Text How we profit from this technology Conclusion Content

3 Oracle Text saves your time CERN Content

4 Oracle Text saves your time The world’s largest particle physics research laboratory Founded in 1954, CERN has today 20 member states 2400 staff Over 6500 scientists come here to use research facilities 500 universities, over 80 nationalities CERN explores what matter is made of, and what forces hold it together WWW was born here CERN - European Organization for Nuclear Research

5 Oracle Text saves your time LHC - The Large Hadron Collider Project

6 Oracle Text saves your time LHC - Cryodipole

7 Oracle Text saves your time EDMS Engineering Data Management System Content

8 Oracle Text saves your time EDMS Portal EDMS Common layer AxalantMP5Other DB’s Design Data Documents and Drawings Asset tracking Work management EDMS - Engineering Data Management System

9 Oracle Text saves your time Structures Managing EDMS - Engineering Data Management System

10 Oracle Text saves your time Structures Complete life-cycle for a single/compound documents. Managing EDMS - Engineering Data Management System

11 Oracle Text saves your time Structures Complete life-cycle for a single/compound documents. Document versioning Managing EDMS - Engineering Data Management System

12 Oracle Text saves your time Structures Complete life-cycle for a single/compound documents. Document versioning Document approval processes (comments collector) Managing EDMS - Engineering Data Management System

13 Oracle Text saves your time Structures Complete life-cycle for a single/compound documents. Document versioning Document approval processes (comments collector) Assemblies Managing EDMS - Engineering Data Management System

14 Oracle Text saves your time Structures Complete life-cycle for a single/compound documents. Document versioning Document approval processes (comments collector) Assemblies Managing Equipment workflow, data EDMS - Engineering Data Management System

15 Oracle Text saves your time Structures Complete life-cycle for a single/compound documents. Document versioning Document approval processes (comments collector) Assemblies Managing Equipment workflow, data Installation (jobs, locations, etc..) EDMS - Engineering Data Management System

16 Oracle Text saves your time Manage a full description of the LHC project’s engineering data over it’s lifetime (>25 years) Support and coordinate engineering work / information / data workflow Establish a knowledge transfer: evolving staff, many short time visitors A full description of the machine and its components through their lifecycle must be constantly available for all concerned parties Help tracing solutions to all problems occurring in the machine Provide an efficient search tool to support with requirements above - our choice Oracle Text EDMS mandate Operation InstallationDesignOperationDismantling

17 Oracle Text saves your time Our needs Oracle Text – our choice

18 Oracle Text saves your time Index metadata & files : First line search is done on meta data, however the possibility to index files is essential Our needs Oracle Text – our choice

19 Oracle Text saves your time Bi-lingual : Official CERN languages are English and French. We have to support both Index metadata & files : First line search is done on meta data, however the possibility to index files is essential Our needs Oracle Text – our choice

20 Oracle Text saves your time Bi-lingual : Official CERN languages are English and French. We have to support both Performance: Response time is very important Index metadata & files : First line search is done on meta data, however the possibility to index files is essential Our needs Oracle Text – our choice

21 Oracle Text saves your time Bi-lingual : Official CERN languages are English and French. We have to support both Simple for users Simple to develop Simple to maintain Performance: Response time is very important Index metadata & files : First line search is done on meta data, however the possibility to index files is essential Our needs Simplicity: Oracle Text – our choice

22 Oracle Text saves your time Bi-lingual : Official CERN languages are English and French. We have to support both Performance: Response time is very important Index metadata & files : First line search is done on meta data, however the possibility to index files is essential Oracle Text supports most of the document formats Oracle text supports 39 languages Results with scoring methodology to help navigate through a result Standard SQL statements Easy to maintain with ALTER INDEX or CTX_DDL packages Very efficient for searches within big collection of data Our needs Simple for users Simple to develop Simple to maintain Simplicity: Oracle Text – our choice

23 Oracle Text saves your time Bi-lingual : Official CERN languages are English and French. We have to support both Performance: Response time is very important Index metadata & files : First line search is done on meta data, however the possibility to index files is essential Oracle Text supports most of the document formats Oracle text supports 39 languages Very efficient for searches within big collection of data Oracle text comes as an option in RDBMS - no additional costs Our needs Results with scoring methodology to help navigate through a result Standard SQL statements Easy to maintain with ALTER INDEX or CTX_DDL packages Simple for users Simple to develop Simple to maintain Simplicity: Oracle Text – our choice

24 Oracle Text saves your time Oracle Text Content

25 Oracle Text saves your time Oracle Text Takes care of: Enables the building of a Text Query Application and a Document Classification Application Oracle text indexing searching: word and theme viewing text Uses standard SQL

26 Oracle Text saves your time CREATE INDEX index_name ON table_name(column_name) INDEXTYPE IS CTXSYS.CONTEXT PARAMETERS(‘parameters string’); [datastore datastore_pref] [filter filter_pref] [charset column charset_column_name] [format column format_column_name] [lexer lexer_pref] [language column language_column_name] [wordlist wordlist_pref] [storage storage_pref] [stoplist stoplist] [section group section_group] [memory memsize] [populate | nopopulate] CONTEXT Index Creation

27 Oracle Text saves your time IndexQuery OperatorCharacteristics CONTEXTCLOB, BLOB, BFILE, CHAR, VARCHAR2, XML On text column Most complete of all 3 types. CTXCATCHAR, VARCHAR2 Combined index on a text column and one or more other columns. Transactional – no need for synchronizing when DML. Creating can be longer because of the sub-indexes. Supports: INDEX SET, LEXER*, STOPLIST, STORAGE, WORDLIST* Has it’s own query language. CONTAINS CTXRULE CATSEARCH MATCHES Used for Building a document classification application For indexing small text fragments and related information. To improve mixed query performance VARCHAR2, CLOB On column containing a set of queries. Supports: LEXER (only BASIC) Does not support number of operators. Large coherent documents Types of indexes

28 Oracle Text saves your time ALTER INDEX index_name REBUILD [ONLINE][PARAMETERS(parameters string)]; ALTER INDEX cdi_text_ctx REBUILD ONLINE PARAMETERS(‘optimize fast’); ALTER INDEX cdi_text_ctx REBUILD ONLINE PARAMETERS(‘optimize full maxtime10’); ALTER INDEX cdi_text_ctx REBUILD ONLINE PARAMETERS(‘optimize full’); Index Maintenance & Optimization

29 Oracle Text saves your time ALTER INDEX index_name REBUILD [ONLINE][PARAMETERS(parameters string)]; CTX_DDL package CTX_DDL.OPTIMIZE_INDEX CTX_DDL.SYNC_INDEX Index Maintenance & Optimization

30 Oracle Text saves your time INSERT A new row inserted in DR$PENDING queue, not available for query before synchronization UPDATE Existing ROWID is placed in DR$PENDING, neither new nor old content is available for query before synchronization DELETE The row is immediately unavailable for query(marked as invalid), but only removed when optimization complete CTX_USER_PENDING (CTX_PENDING) view To check records waiting for synchronization DML processing

31 Oracle Text saves your time “To calculate a relevance score for a returned document in a word query, Oracle uses an inverse frequency algorithm based on Salton's formula. Inverse frequency scoring assumes that frequently occurring terms in a document set are noise terms, and so these terms are scored lower. For a document to score high, the query term must occur frequently in the document but infrequently in the document set as a whole.” Oracle Text Reference, Release 9.0.1 In data set: M number of occurrences of TERM1, N number of occurrences of TERM2 M >> N Document having equal (n-occurrences) of TERM1 and TERM2 Example Result SCORE for querying TERM1 < SCORE for querying TERM2 Scoring

32 Oracle Text saves your time SYNonym ABOUT STEM Translation Term Broader, Narrower, Preferred, Related Term Boolean Linguistics Others OR NOT MINUS AND lhc AND magnet AND NOT cryogenic FUZZY NEAR SOUNDEX WITHIN SQE SYN (science) ABOUT (particle) begin ctx_query.store_sqe ( ‘particle‘, ’atom, molecule proton’ ); end; ‘SQE (particle)’ Query Operators

33 Oracle Text saves your time Administer servers and the data dictionary (only ctxsys user) Create and manage the preferences, section groups, stoplists, manage indexes Document presentation features (only for CONTAINS indexes) Manage logs for the indexes Manage and browse thesaurus Generating query feedback, counting hits, and creating SQE (stored query expressions) CTX_ADMIN CTX_DDL CTX_DOC CTX_OUTPUT CTX_QUERY CTX_THES CTX packages

34 Oracle Text saves your time How we profit from this technology Content

35 Oracle Text saves your time EDMS metadata index preferences

36 Oracle Text saves your time Version 1.5 accelerateur lhc méthode EDMS search for both languages

37 Oracle Text saves your time EDMS metadata index preferences

38 Oracle Text saves your time To be able to query on reserved words or symbols such as “minus”, “-”, “near” they must be escaped. There are 2 methods to escape the character, using “{}” or “\”. When using: We had to hardcode it for each symbol and word. A standard “dictionary table” with the reserved characters would be useful. Escaping characters to query them

39 Oracle Text saves your time It is important to know how users will search the data and what kind of data you are going to index before you actually do it. EDMS metadata index preferences

40 Oracle Text saves your time Meta dataFilesEnvironment Hardware & system: Two node cluster based on two Sun SPARC 450, running Solaris 2.6 + Sun Cluster 2.1 RDBMS: 8.1.7.4 ~500 MB SGA size 60-80 concurrent users (during working hours) EDMS Index Maintenance & Optimization 4 000New documents (monthly) 88 800Files (CSV, DOC, DOT, HTM, HTML, MPP, PDF, PPT, PS, RES, TXT) / total 45Files (GB) / total 4 000Document updates (monthly) 266 500Drawings 74 000Test documents / 148 000 / 78

41 Oracle Text saves your time Meta dataFiles Index synchronization: every 10 min, takes a few seconds Index optimization: every weekend, takes ~30 min PROCEDURE rebuild_metedata_ctx IS BEGIN EXECUTE IMMEDIATE ('alter index CDI_TEXT_CTX rebuild online parameters(' ' sync ' ')'); END; PROCEDURE optimize_metedata_ctx IS BEGIN EXECUTE IMMEDIATE ('alter index CDI_TEXT_CTX rebuild online parameters(' ' optimize full' ')'); END; Environment EDMS Index Maintenance & Optimization 4 000New documents (monthly) 88 800Files (CSV, DOC, DOT, HTM, HTML, MPP, PDF, PPT, PS, RES, TXT) / total 45Files (GB) / total 4 000Document updates (monthly) 266 500Drawings 74 000Test documents / 148 000 / 78

42 Oracle Text saves your time Synchronize every 24h ? Optimize (fast, full) every month? Meta dataFilesEnvironment EDMS Index Maintenance & Optimization 4 000New documents (monthly) 88 800Files (CSV, DOC, DOT, HTM, HTML, MPP, PDF, PPT, PS, RES, TXT) / total 45Files (GB) / total 4 000Document updates (monthly) 266 500Drawings 74 000Test documents / 148 000 / 78

43 Oracle Text saves your time SQL> SELECT c_id,score(10) FROM compound_doc_info WHERE CONTAINS(c_text,’lhc’,10)>0 AND c_id = 1738594907; C_ID SCORE(10) ------------------ ---------------- 1738594907 9 SQL> SELECT c_id,score(10) FROM compound_doc_info WHERE CONTAINS(c_text,’evolution’,10)>0 AND c_id = 1738594907; C_ID SCORE(10) ------------------ ---------------- 1738594907 15 Scoring

44 Oracle Text saves your time DECLARE xtab ctx_thes.exp_tab; …. BEGIN ctxsys.ctx_thes.rt(xtab,p_term,’edms_thes’); FOR i IN 1..xtab.COUNT LOOP IF xtab(i).xrel = C_RELETED_TERM THEN htp.anchor ( L_DOC_SEARCH ||'?cookie=' ||cookie ||'&p_search_type=' ||p_search_type ||'&p_free_text=' ||LOWER(xtab(i).xphrase),LOWER(xtab(i).xphrase) ); END IF; END LOOP; END; Propose the RT (Related Term) if nothing found with the original term(s). Would be nice to have a spell checker corrector, using existing tokens. Using the thesaurus

45 Oracle Text saves your time Using the thesaurus - example

46 Oracle Text saves your time Using the thesaurus - example

47 Oracle Text saves your time …WHERE CONTAINS (c_text, p_free_text) > 0; Total 83 ms Querying with Oracle Text versus standard SQL

48 Oracle Text saves your time … WHERE UPPER(c_text) LIKE '%’||UPER(p_free_text)||’%’ Total 03.98s Querying with Oracle Text versus standard SQL

49 Oracle Text saves your time ToolOracle TextStandard SQL CharacteristicsUnderperforming. StatementWHERE UPPER(c_text) LIKE '%’||UPER(p_free_text)||’%’ Fast. Time WHERE CONTAINS (c_text,p_free_text) > 0 * Tests done with TOra 1.3.8 (in parentheses repeated 10x) 83 ms (821ms)03.98s (39.14s)* p_free_text is a single word or an exact sentence Querying with Oracle Text versus standard SQL

50 Oracle Text saves your time ToolOracle TextStandard SQL CharacteristicsUnderperforming. StatementWHERE ( UPPER(c_text) LIKE '%’||UPPER(p_text_1)||’%’ OR UPPER(c_text) LIKE '%’||UPER(p_text_2)||’%’ ) Fast. Time WHERE CONTAINS (c_text,p_free_text) > 0 * Tests done with TOra 1.3.8 (in parentheses repeated 10x) 103ms (01:03 )09:09 (1:22.09)* p_free_text is an expression with OR operator Querying with Oracle Text versus standard SQL

51 Oracle Text saves your time Querying with Oracle Text Total 02.36s

52 Oracle Text saves your time Querying with Oracle Text Total 02.36s Total 02.31s

53 Oracle Text saves your time Querying with Oracle Text Total 02.36s Total 02.31s Total 02.25s

54 Oracle Text saves your time Mixed queries “LHC-Q-EI-0002” is a document number Search is done on: 1) the document number column using a standard index 2) the context index

55 Oracle Text saves your time Formatted documents such as Microsoft Word, PDF has to be filtered File_format column stores “TEXT” or “BINARY” value. INSO_FILTER ignores all with “TEXT” in the format column. Indexing various file formats NULL_FILTER for plain text and HTML formats

56 Oracle Text saves your time Some indexing problems we have The creation of an Intermedia Text Index (with URL_DATASTORE) is failing with ORA-4030 out of process memory. After successful indexing of the PDF files (using INSO_FILTER), some are indexed only “partially” without any error being created in the error table. In June 2002 this was identified to be a memory leak fixed in 8.1.7.4.0 We observe now the same ORA-4030 error with 8.1.7.4.0 OPS Result : very difficult to verify if the document is correctly indexed.

57 Oracle Text saves your time Conclusion Content

58 Oracle Text saves your time Oracle text is worth using because … Performance Simplicity of the code (integrated with Oracle, no external search engine) Simplicity of the index maintenance Functional features: bi-lingual support, special query operators, thesaurus Document presentation features Conclusion

59 Oracle Text saves your time EDMS SERVICE https://edms.cern.ch This presentation: https://edms.cern.ch/file/402581/1/Oracle_Text_OracleWorld2003.ppt Contact: Anna.Suwalska@cern.ch Thank you


Download ppt "Oracle Text saves your time Oracle Text Search saves your time Anna Suwalska European Organization for Nuclear Research - Geneva OracleWorld Paris 2003."

Similar presentations


Ads by Google