Presentation is loading. Please wait.

Presentation is loading. Please wait.

April 2002 Gio ArtInt 1 Information Interoperation versus Integration Gio Wiederhold Stanford University April 2002 www-db.stanford.edu/people/gio.html.

Similar presentations


Presentation on theme: "April 2002 Gio ArtInt 1 Information Interoperation versus Integration Gio Wiederhold Stanford University April 2002 www-db.stanford.edu/people/gio.html."— Presentation transcript:

1 April 2002 Gio ArtInt 1 Information Interoperation versus Integration Gio Wiederhold Stanford University April 2002 www-db.stanford.edu/people/gio.html Thanks to Jan Jannink, Prasenjit Mitra, Stefan Decker.

2 April 2002 Gio ArtInt 2 Layout Objectives and definitions 3 – 8 –information, knowledge, actions Integration, articulation, interoperation 9 -- 13 Semantic problems 14 – 20 Effect of the Internet 21 – 26 Precision 27 -- 31 SKC project: 31 -- 52 –Ontology algebra & tools Summary 53 -- 59

3 April 2002 Gio ArtInt 3 Today's sources and needs Many resources computing – databases – analytical software – simulations people – organizations – domains  semantics Many objectives operations – reports – routine actions – optimization decision-making – crisis response – allocations – conflict resolution Require information

4 April 2002 Gio ArtInt 4 Objective of Information Systems Initiate Optimal Actions, i.e., that are – highest in delivering benefits for the costs and risks Effective-- achieve the goal Satisfactory-- move towards the goal Feasible-- valid and possible Optimality requires perfect -- complete and error-free -- data and knowledge. Not always attainable The best is often the enemy of the good

5 April 2002 Gio ArtInt 5 Sources of Information Recall of what you have forgotten Information from external sources news papers books Induction from Data – Experience Processing of Data – Statistics  Combining Information that was previously disjoint Processing by intermediaries

6 April 2002 Gio ArtInt 6 Data + Knowledge  Information Action Knowledge Loop = rules  process Experience Data Loop = facts Storage Education Selection Integration Abstraction Decision-making State changes Recording Projection Information Assimilation

7 April 2002 Gio ArtInt 7 4 Information Technology Tasks Actionable Information is created at the confluence of data -- the state & knowledge -- the ability to select, integrate, abstract and project the state into the future Selection: SQL one verb language Integration: Middleware Mediation. Abstraction: Visualization Summarization Projection: Simulation Spreadsheets Mediation * * Focus of this talk

8 April 2002 Gio ArtInt 8 Encoded Knowledge  Lexicons: collect terms used in information systems  Taxonomies: categorize, abstract, classify terms  Schemas of databases: attributes, ranges, constraints  Data dictionaries: systems with multiple files, owners  Object libraries: grouped attributes, inherit., methods  Symbol tables: terms bound to implemented programs  Domain object models: (XML DTD): interchange terms  Ontologies: networks of terms and relationships... Metadata formalizes Knowledge

9 April 2002 Gio ArtInt 9 Combining data the obvious way Application Data A MetaData A Data B MetaData B Data A  B MetaData A  B } Integrate 1. Integration: Create a large database, that incorporates all material in the sources 2. Operation: Access the integrated data Update the integrated data when sources change

10 April 2002 Gio ArtInt 10 or linkages to the data Data A MetaData A Data B MetaData B Application A  B } Articulate Rules for A  B 1. Articulation: Create Links to needed Data perhaps with rules 2. Interoperation: Exploit the linkages and rules Inter- operation

11 April 2002 Gio ArtInt 11 Benefits and problems Integration –Long time to establish common metadata  more +Easy for applications, everything is there –Massive replication –Difficult to maintain as sources are updated Articulation +Metadata defined by application need  more +Rapid implementation and modification –Interoperation via indirect access +Mitigated by caching –Many articulations –Up to one per application +Affected by selected source changes

12 April 2002 Gio ArtInt 12 Alternatives for Applications Data A MetaData A Data B MetaData B Application A  B } Articulate Rules for A  B Application Data A MetaData A Data B MetaData B Data A  B MetaData A  B } Integrate Operate & Maintain Inter- Operate

13 April 2002 Gio ArtInt 13 Central Solutions do not Scale What works with 7 modules (Apps and DBs) and one person in charge fails when we have 100 and need a committee Any changes in sources affect the integrated module

14 April 2002 Gio ArtInt 14 Semantic Heterogeneity: A Major Cause of Problems: Integration / Interoperation joins domains t Domains have their own terminologies  Need autonomy to deal with knowledge growth t The usage of terms in a domain is efficient  Appropriate granularity  Mechanic working on a truck vs. logistics manager  Shorthand notations  PS (power steering) vs. PS (partial shipment) t Functions differ in scope  Payroll versus Personnel  getting paid vs. available (includes contract staff)

15 April 2002 Gio ArtInt 15 Why now semantic heterogeneity Heterogeneity has been addressed Platform and Operating Systems   Representation and Access Conventions  Naming and Ontology  Whenever interoperation involves distinct domains mismatch ensues Autonomy conflicts with consistency, – Local Needs have Priority, – Outside uses are a Byproduct

16 April 2002 Gio ArtInt 16 Domains and Consistency. a domain will contain many objects the object configuration is consistent within a domain all terms are consistent & relationships among objects are consistent context is implicit No committee is needed to forge compromises * within a domain  Compromises hide valuable details Domain Ontology

17 April 2002 Gio ArtInt 17 SKC grounded definition. Ontology: a set of terms and their relationships Term: a reference to real-world and abstract objects Relationship: a named and typed set of links between objects Reference: a label that names objects Abstract object: a concept which refers to other objects Real-world object: an entity instance with a physical manifestation

18 April 2002 Gio ArtInt 18 Examples of Semantic Mismatches Information comes from many autonomous sources Differing viewpoints ( by source ) –differing terms for similar items { lorry, truck } –same terms for dissimilar items trunk( luggage, car) –differing coverage vehicles ( DMV, AIA ) –differing granularity trucks ( shipper, manuf. ) –different scope student ( museum fee, Stanford ) Hinders use of information from disjoint sources –missed linkages loss of information, opportunities –irrelevant linkages overload on user or application program Poor precision when merged Still ok for web browsing, poor for business & science

19 April 2002 Gio ArtInt 19 Structural Semantic Heterogeneity Same concept? If incorporated in differing structures?

20 April 2002 Gio ArtInt 20 Effect on Information Integration Big integrated systems ? Big integrated systems ? – cost, delay – obsolescence – inflexibility – unmaintainable Composed systems ? Composed systems ? – no or limited control – operational inefficiency – understanding heterogeneity – distributed maintenance

21 April 2002 Gio ArtInt 21 Integration up-to-dateness Frequency of source change frequency of visits Feb.2000 F(capability given 2.2M public sites with 288M pages ) never 1/second 1/minute 1/hour 1/day 1/week 1/month 1/year as often as possible 0 1 ? 100% 50% 0% % tage up-to-date F(user need)   effort, methods

22 April 2002 Gio ArtInt 22 Web capabilities: more domains World-wide connectivity –internet –world-wide web More databases – public & corporate Faster communication –digital –packeting: TCP-IP, ATM Disintermediation –ubiquitous publishing Information overload Data starvation

23 April 2002 Gio ArtInt 23 Global consistency: Hope, but.. Common semantic assumptions in assembling and integrating distributed information resources The language used by the resources is the same Sub languages used by the resources are subsets of a globally consistent language These assumptions are provably false Working towards the goal of globally consistency is 1. naïve -- the goal cannot be achieved 2.inefficient -- languages are efficient in local contexts 3.unmaintainable – terminology evolves with progress

24 April 2002 Gio ArtInt 24 Knowledge from the Web Knowledge about the data sources –where is what –contents - semantics –scope of contents Knowledge to apply the data sources –mapping to application terms –mapping to application scope –integration from multiple sources

25 April 2002 Gio ArtInt 25 Availability of Knowledge XML based schemas DTDs, extended semantics  a very small fraction now Embedded or related knowledge text, dictionaries, context ontologies, DB schemas  the majority of sources must exploit both flexibly in order to make progress with KB-processing costly to develop

26 April 2002 Gio ArtInt 26 Search engines need knowledge Yahoo humans catalog and organize useful web sites into hierarchies with crosslinks. About 300 topic experts in 2001. Scalable? demoz public service classification with 3000 editors. Quality? AltaVista worms (surfs) and indexes all of the web. Frequency/volatility? Excite also tracks queries and classifies customers. 1D, privacy? Firefly provides customer control over their profiles. Ease/effective? Junglee integrates diverse wrapped sources, compares Semantics? Alexa collects webpages and their usage, keeps old pages Massive, suggests pages with similar access reference pattern. Value? Google Pagerank gives the reference importance of web pages, using weighted votes, iterated closure. Find new stuff? and others, used by most of the e-commerce sites.

27 April 2002 Gio ArtInt 27 Problems for integrating the web Unsuitable source representations part classification: HTML --- XML.......... print formats: postscript, adobe PDF....... non-text:  images,   video (more  ), sound [Audio Mining (Dragon)],......... hidden in databases behind CGI scripts... Inconsistent semantics context distinct / scope / view............. Naïve modeling of customers/apps personalization, versus context, roles & growth......................... Search engines, trying to deal with all of the web, cannot solve the problems precisely Being improved. Rate? Progress

28 April 2002 Gio ArtInt 28 Historical note Ich weiss nicht was soll es bedeuten,... -- an early complaint about semantics [Heinrich Heine: Die Lorelei]

29 April 2002 Gio ArtInt 29 Precision of Language Our languages must serve our needs –Efficiency of expression: minimal symbols  carpenter versus householder domain –Commitment  surgeon versus pathologist –Locality provides context  geographical locality: downtown irrelevant on the web  conceptual locality: 5 pounds remains important Business requires precision, including that terms used map consistently to instances knowledge controls use of data

30 April 2002 Gio ArtInt 30 Need for precision adapted from Warren Powell, Princeton Un. More precision is needed as data volume increases --- a small error rate still leads to too many errors False Positives have to be investigated ( attractive-looking supplier - makes toys apparent drug-target with poor annotation ) False Negatives cause lost opportunities, suboptimal to some degree False positives = poor precision typically cost more than false negatives = poor recall information quantity human limit acceptable limit human with tools? data errors Information Wall

31 April 2002 Gio ArtInt 31 Relationships among search parameters volume retrieved volume available 100% 50% 0% space of methods % tage actually relevant perfect recall v.relevant p= v.retrieved v.relevant r = v.available recall perfect precision precision

32 April 2002 Gio ArtInt 32 Large scale demands articulation Many source ontologies Diverse application domains –with their ontologies Articulation = application-specific intersection of sources –keeps and exploits connection to sources –many articulations are possible –articulation results are small (  not  ) –reliable articulations require consistent sources –modest amount of knowledge needed SKC project: Scalable Knowledge Composition

33 April 2002 Gio ArtInt 33 Sample Intersections. Shoe Store Shoes {... } Customers {... } Employees {... } size = size color =table(colcode) style = style Ana- tomy {... } Material inventory {...} Employees {... } Machinery {... } Processes {... } Shoes {... } Shoe Factory Hard- ware Articulation ontology matching rules : foot = foot Employees Nail (toe, foot) Nail (fastener)... Department Store

34 April 2002 Gio ArtInt 34 System setting: mediation data and simulation resources value-added services decision-making support applications Application Layer Mediation Layer Foundation Layer

35 April 2002 Gio ArtInt 35 Sample Operation: INTERSECTION Source Domain 1: Owned and maintained by Store Result contains shared terms, useful for purchasing Source Domain 2: Owned and maintained by Factory Articulation

36 April 2002 Gio ArtInt 36 Sample Processing in HPKB What is the most recent year an OPEC member nation was on the UN security council? –Related to DARPA HPKB Challenge Problem –SKC resolves 3 Sources  CIA Factbook ‘96 (nation)  OPEC (members, dates)  UN (SC members, years) –SKC obtains the Correct Answer  1996 (Indonesia) –Other groups obtained more, but factually wrong answers –Problems resolved by SKC *Factbook has out of date OPEC & UN SC lists Indonesia not listed Gabon (left OPEC 1994) *different country names Gambia => The Gambia *historical country names Yugoslavia  UN lists future security council members Gabon 1999  intent of original question Temporal variants

37 April 2002 Gio ArtInt 37 Many ontologies  Interoperation Have devolved maintenance onto domain-specific experts / authorities Many articulations But applications often span multiple domains Can't make domains bigger and keep them consistent Need an algebra to compute composed domains, –whose ontologies are limited to the terms defined by their articulations SKC

38 April 2002 Gio ArtInt 38 An Ontology Algebra A knowledge-based algebra for ontologies The Articulation Ontology (AO) consists of matching rules that link domain ontologies Intersection create a subset ontology keep sharable entries Union create a joint ontology merge entries Difference create a distinct ontology remove shared entries

39 April 2002 Gio ArtInt 39 Other Basic Operations typically prior intersections UNION: merging entire ontologies DIFFERENCE: material fully under local control Arti- culation ontology

40 April 2002 Gio ArtInt 40 Features of an algebra Operations can be composed Operations can be rearranged Alternate arrangements can be evaluated Optimization is enabled The record of past operations can be kept and reused when sources change

41 April 2002 Gio ArtInt 41 INTERSECTION support Store Ontology Articulation ontology Matching rules that use terms from the 2 source domains Factory Ontology Terms useful for purchasing

42 April 2002 Gio ArtInt 42 Knowledge Composition Articulation knowledge for UUUU U (A(AB)B) U (B(BC)C) U (C(CE)E) Knowledge resource B Knowledge resource A Knowledge resource C Knowledge resource D U (C(CD)D) U (B(BC)C) Articulation knowledge Composed knowledge for applications using A,B,C,E Knowledge resource E U (C(CE)E) Legend: U : union U : intersection Articulation knowledge for (A(A B)B) U

43 April 2002 Gio ArtInt 43 Tools to create articulations Combine ontology graphs with expert selection based on spelling, graph matching, and a nexus derived from a dictionary (O.E.D.) Vehicle registration ontology Vehicle sales ontology more 

44 April 2002 Gio ArtInt 44 1. Initialize by word matching Graph matcher for Articulation- creating Expert Vehicle ontology Transport ontology Suggestions for articulations

45 April 2002 Gio ArtInt 45 2. continue from initial point Also suggest similar terms for further articulation: by spelling similarity, by graph position by term match repository Expert response: 1. Okay 2. False 3. Irrelevant to this articulation All results are recorded Okay’s are converted into articulation rules

46 April 2002 Gio ArtInt 46 3. A Term Net to Propose Match Candidates Term linkages automatically extracted from 1912 Webster’s dictionary * * free, other sources. have been processed. Based on processing headwords ý definitions using algebra primitives Notice presence of 2 domains: chemistry, transport

47 April 2002 Gio ArtInt 47 4. Using the term net

48 April 2002 Gio ArtInt 48 5. Navigating the term net

49 April 2002 Gio ArtInt 49 Primitive Operations Unary Summarize -- structure up Glossarize - list terms Filter - reduce instances Extract - circumscription Binary Match - data corrobaration Difference - distance measure Intersect - schem discovery Blend - schema extension Constructors create object create set Connectors match object match set Editors insert value edit value move value delete value Converters object - value object indirection reference indirection Model and Instance

50 April 2002 Gio ArtInt 50 Matching rules (in process) A equals U A equals {u, … } A equals to Union (U, V) A equals to Union (U, v...) A equals to Intersection (U, W) A equals to Intersection (U, w...) A equals to Difference (U, X) A equals to Difference (U, x...) must obey algebraic properties

51 April 2002 Gio ArtInt 51 Future: exploiting the result Processing & query evaluation is best performed within Source Domains & by their engines Result has links to source Avoid n 2 problem of interpreter mapping as stated by Swartout as an issue in HPKB year 1

52 April 2002 Gio ArtInt 52 Innovation in SKC No need to harmonize full ontologies Focus on what is critical for interoperation Rules specific for articulation Potentially many sets of articulation rules Maintenance is distributed –to n sources –to m articulation agents is m < n 2, depends on architecture density a research question

53 April 2002 Gio ArtInt 53 Application Informal, pragmatic Client-control Use up to 7  2 mediators Mediation Formal, reliable service Domain-Expert control Use up to 7  2 sources Integration at higher Levels

54 April 2002 Gio ArtInt 54 Domain Specialization. Knowledge Acquisition (20% effort) & Knowledge Maintenance (80% effort *) to be performed Domain specialists Professional organizations Field teams of modest size Empowerment autonomously maintainable * based on experience with software

55 April 2002 Gio ArtInt 55 Summary To sustain the growth of web usage 1. The value of the results has to keep increasing precision, relevance not volume 2. Value is provided by mediating experts, encoded as models of diverse resources, diverse customers Problems being addressed redundancy mismatches quality maintenance Need to develop flexible tools for these tasks } Clear models

56 April 2002 Gio ArtInt 56 Integration Science ? Integration Science Integration Science Artificial Intelligence knowledge mgmt linguistic models uncertainty Artificial Intelligence knowledge mgmt linguistic models uncertainty Systems Engineering analysis documentation costing Systems Engineering analysis documentation costing Databases access storage algebras scalability Databases access storage algebras scalability

57 April 2002 Gio ArtInt 57 E-commerce Gartner: 2000 prediction for 2004: 7.3 T$ Revision:2001 prediction for 2004: 5.9 T$ drastic loss? Growth and Perception 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15... Realistic growth Perceived growth Invisible growth Extrapolated growth Disap- pointment Combi- natorial growth Perceived initial growth Perception level Examples Artificial Intelligence Databases Neural networks E-commerce 50 companies, each after 20% of the market Failures

58 April 2002 Gio ArtInt 58 E-Commerce Trends 1998/1999/... Users of the Internet 40% Ü 52% of U.S. population Growth of Net Sites (now 2.2M public sites with 288M pages) Expected growth in E-commerce by Internet users [BW, 6 Sep.1999] segment 1998 1999 –books 7.2% Ü 16.0% –music & video 6.3% Ü 16.4% –toys 3.1% Ü 10.3% –travel 2.6% Ü 4.0% –tickets 1.4% Ü 4.2% –Overall 8.0% Ü 33.0% = $9.5Billion Will change all retail businesses, when? An unstainable trend cannot be sustained [Herbert Stein] Ü new services 98 99 00 01 02 03 04 0.3 1 3 9 27 81 ** 90 80 70 60 50 40 30 20 10 0 Year / % Ü % Ü Centroid, in 1999 ~1% of total market E-penetration Toys

59 April 2002 Gio ArtInt 59 Stanford Some Groups at Stanford University KSL: Feigenbaum, Fikes et al. Logic: Genesereth, Koller, McCarthy, Nielsen et al. Industrial: Petrie et al. Medical Informatics: Musen et al. Infolab: Garcia-Molina, Manning, Ullman, Wiederhold www-db.stanford.edu/people/gio.html The configurations are not fixed, projects specifically can include a variety of combinations

60 April 2002 Gio ArtInt 60 Final Panel Agreement with Colette Rolland and Arne Solvberg –consider the research question as a Hypothesis (Hx)  note that the job is to Convert a hypo (less than) -thesis to a thesis –justify the Hx in terms of a real-world problem Computer Science is broadening into all areas of human endeavor –Most new reserachers will interact with other disciplines –Many new entrants will contribute competencies from other areas This means that we all will need to develop an understanding for the differences among these areas

61 April 2002 Gio ArtInt 61 Differences among Sciences Scientific incompatibilities sciences have often been characterized as differences in language, perhaps, mor formally differences in Ontology, and differences in problem decomposition –alternate high-level objectives –divide-and-conquer paradigms One person's (or discipline's) optimal approach is viewed by others as a bureaucracy to be fought There are a more fundamental differences..

62 April 2002 Gio ArtInt 62 Paradigms of Science Disciplines use distinct scientific models 1.Mathematical Sciences: Formal proofs, to hold in all cases. A single exception invalidates the proof Complexity is dealt with explicitly (big Oh) 2.Engineering sciences recognition of alternative approaches to a goal A thesis shows the best (cost-benefit) case in some range 3.Social Sciences surveys can measure truth of a Hx to some degree of belief having some exceptions does not invalidate the Hx 4.Biological Sciences cases provide distinct examples and prediction assumes a linear world One should know in which paradigm one's research is performed


Download ppt "April 2002 Gio ArtInt 1 Information Interoperation versus Integration Gio Wiederhold Stanford University April 2002 www-db.stanford.edu/people/gio.html."

Similar presentations


Ads by Google