Presentation is loading. Please wait.

Presentation is loading. Please wait.

One Tool, Many Industries Text Mining with Oracle Omar Alonso Chuck Adams Oracle Corp. Text Mining Summit, Boston, 2005.

Similar presentations


Presentation on theme: "One Tool, Many Industries Text Mining with Oracle Omar Alonso Chuck Adams Oracle Corp. Text Mining Summit, Boston, 2005."— Presentation transcript:

1 One Tool, Many Industries Text Mining with Oracle Omar Alonso Chuck Adams Oracle Corp. Text Mining Summit, Boston, 2005

2 Agenda Introduction Text mining Define problems Present solutions A look at Oracles technology stack Oracles roadmap A case study Conclusions

3 Data mining and Text mining OLTP OLAP DM Keyword search BK TM Classification Clustering Ontologies NLP Inexact match Structured DataUnstructured Data

4 An analogy RFID and robot vision – Put tags on everything instead having the robot do the vision Similar approach for text mining – Language is very social, not technical – Instead, start with a unified storage model – Then do mining

5 What about text mining? Text mining is one of many features in text technology Real future of text technology is business intelligence (BI) What is BI? – Ability to make better decisions What are the obstacles today? – Structured data is well understood – Unstructured data is different

6 Text and XML Increased exploitation of structure Plain Old File System File System on Steroids (WinFS) Records Mgmt, ECM Dynamic Doc Generation Traditional Content Mgmt XML Content Mgmt.

7 First problem: access No uniform access over all sources Each source has separate storage and algebra Examples – – Databases – Applications – Web

8 Second problem: management Management of unstructured of data very poor compared with structure data Cleaning Noise is larger than in structure data Security Multilingual

9 Third problem – user needs Perception with current search engines Large data -> 80/20 rule Doesn't provide uniform information Two users type same query and get the same results – Cricket the game or cricket the bug?

10 Foundations XML as the common model XML allows: – Manipulation data with standards – Mining becomes more data mining – RDF emerging as a complementary model The more structure you can explore the better you can do mining Integration use cases

11 Foundations - II Unstructured data is too AI Too easy to get fooled by the complexity Hybrid solution Domain knowledge – You know your domain – You own the content – You can do better

12 Remember?

13 Personalization problem Lack of personalization You own the content, you own the user Two users type the same query: financials – Sales rep looks for customers and other deals – Tech guy looks for bugs, architecture, etc. LDAP shows who they are Combination with query logs shows patterns in the same peer group Recommendation systems

14 Better Answers: Beyond Keywords Noise theory – As you cast your nets ever wider, you catch disproportionately more junk Must develop new models of Quality in the face of comprehensiveness – Combine Link-Analysis with Context-sensitive relevance – Personalization Must summarize information – Theme Maps, Gists Show patterns in information vs. many pages of hit-lists – Tree Maps, Stretch Viewer Ability to post-process and refine search hit lists – Dynamic categories for navigation – Reorder by date Progressive query relaxation – Nearest inexact match

15 Technology Stack Better Answers Relevance Toward BI Progressive Relaxation Multi-Criterion Support Visualization Classification Personalization Direct Answers Link Analysis Query Log Analysis Metadata Extraction Keyword Ranking Intelligent Match Duplicate Elimination

16

17

18

19 Oracles position Text mining is one of many tools for information retrieval and discovery in many assets Text mining is best used in the context of other techniques – Personalization – Search query logs – Visualization Product: one integrated platform

20 Oracle platform Integrated platform vs. niche technology Full-text searching XML Classification Clustering Visualization Google, FAST Tamino Autonomy Vivisimo Inxight One platform, low cost, low complexity Several products, different APIs, performance, maintenance cost, etc. Application searchSAP/TREX

21 Oracle platform If I can see further than anyone else, it is only because I am standing on the shoulders of giants – Isaac Newton Oracle provides you all the functionality – Plus you get backup, recovery, scalability, and other benefits You build the mining application

22 Case study Federal customer High Performance Text Information Mining and Entity Extraction

23 Business Need Enterprise Search Capability Information Fusion Profiles and alerting Security – user need to know Entity identification and extraction High Performance ingestion, search, and indexing Scalability

24 Challenges Search quality Performance Scalability Document formats Integration Operations and maintenance

25 Solutions Architecture Oracle 10g Integrated Framework 10g release 2 – Oracle Real Application Clusters – Oracle Text Full text and rule based indexing Extensible thesauri Document classification Document filters – Oracle Partitioning – Oracle Virtual Private database – Oracle Advanced Security

26 Technical Architecture

27 Scalable load and indexing

28 Real world results Single search for user Profiles and alerts Couple second query response 80,000,000 + documents indexed 1.2 TB raw text and growing 700 Gig index size Incremental index 1-2 Gig / day

29 Next Steps Entity Extraction and Relationship Awareness

30 Oracle database 10g release 2 Enterprise Search Capability Information Fusion Profiles and alerting Security – user need to know Entity identification and extraction High Performance ingestion, search, and indexing Scalability

31 Conclusions Text mining is one of many features needed for BI on unstructured data – Not a silver bullet in itself Must exploit other approaches – metadata (XML, RDF), personalization, classification, entity extraction, full-text search, … – Hybrid solution Focus on an integrated platform that gives you all the functionality Drive the platform for your information need

32


Download ppt "One Tool, Many Industries Text Mining with Oracle Omar Alonso Chuck Adams Oracle Corp. Text Mining Summit, Boston, 2005."

Similar presentations


Ads by Google