Presentation is loading. Please wait.

Presentation is loading. Please wait.

A Portal for Access to Complex Distributed Information about Energy Jose Luis Ambite, Yigal Arens, Eduard H. Hovy, Andrew Philpot DGRC Information Sciences.

Similar presentations


Presentation on theme: "A Portal for Access to Complex Distributed Information about Energy Jose Luis Ambite, Yigal Arens, Eduard H. Hovy, Andrew Philpot DGRC Information Sciences."— Presentation transcript:

1 A Portal for Access to Complex Distributed Information about Energy Jose Luis Ambite, Yigal Arens, Eduard H. Hovy, Andrew Philpot DGRC Information Sciences Institute University of Southern California Walter Bourne, Peter T. Davis, Steven Feiner, Judith L. Klavans, Samuel Popper, Ken Ross, Ju-Ling Shih, Peter Sommer, Surabhan Temiyabutr, Laura Zadoff DGRC Columbia University

2 The Vision: Ask the Government... How have property values in the area changed over the past decade? How many people had breast cancer in the area over the past 30 years? Is there an orchestra? An art gallery? How far are the nightclubs? We’re thinking of moving to Denver... What are the schools like there? Census Labor Stats Query results

3 The Energy Data Collection project EDC research team Government partners Research challenge Information Sciences Institute, USC Dept of CS, Columbia University Energy Information Admin. (EIA) Bureau of Labor Statistics (BLS) Census Bureau Make accessible in standardized way the contents of thousands of data sets, represented in many different ways (webpages, pdf, MS Access, Excel, text…) Xxx x x Xx xxxxxx Xx xx Xxx xx X Xxx x x xx

4 Data Integration Trade EPA Census EIA Labor Heterogeneous Data Sources User Interface Information Access Data Access and Query Processing Metadata and Terminology Management User Evaluation Interface Design and Task-based Evaluation Concept Ontology Terminology Sources

5 Data access using SIMS ‘Hide’ from user details of data sources: 1. ‘Wrap’ each source in software that handles access to its data 2. Record the types of info in each source in a ‘Source Model’ 3. Arrange all source models together in the same space—the Domain Model SIMS data access planner transforms user’s request into individual access queries SIMS extracts the right data from the appropriate sources Current databases and models: –Databases: 58,000+ series (EIA OGIRS and others) –Webpages: 60+ (BLS, CEC tables) SENSUS ontology: 90,000 nodes (from ISI’s NLP technology) –Domain model: 500 nodes (manual; for database access planner) –LKB: 6000 nodes (NL term/info extraction from glossaries) Xxx x x Xx xxxxxx Xx xx Xxx xx X Xxx x x xx Sources: Models: x x (Ambite et al., ISI)

6 Data access using in-memory query processing How can you provide fast access to millions of data values? Cache data that doesn’t change much in data warehouse Create rich multidimensional index structures; keep in memory Adapt index depending on user’s patterns of use Technical details: Same engine for many data sets Client/server parallel Branch Misprediction SIMD Asynchronous work Use: Real-time interactive data exploration: ‘fly’ over the data (Ross et al., Columbia) Mediator Data Request Unified Results User Web... Graphical User Interface Dynamic Query Data Files e.g., PUMS Dynamic Query Engine

7 Large ontology (SENSUS) Data sources Domain-specific ontologies (SIMS models) Logical mapping Linguistic Mapping (semi-automated) Concepts from glossaries (by GlossIT) The Heart of EDC (Hovy et al., ISI)

8 http://edc.isi.edu:8011/dino Taxonomy, multiple superclass links Approx. 90,000 concepts Top level: Penman Upper Model (ISI) Body: WordNet 1.6 (Princeton), rearranged New information added by text mining Used at ISI for machine translation, text summarization, database access SENSUS and DINO browser (Knight et al., ISI)

9 Extracting term info from online sources GetGloss: given a URL, find all the glossary files ParseGloss: given a set of NL glossary definitions, extract and format the important information (Klavans et al., Columbia) GetGloss: –Glossary identification rules consider format tags, etc. –F-score: 0.68 (2nd after SVM at 0.92) ParseGloss: –Identify term, def, head noun, etc. –Evaluation underway

10 Term-to-ontology alignment How to link new concepts into the Ontology (or Domain Model) in the right places? Manual approach expensive: NxM steps Approach: try to automatically propose links, then hand-check only the best proposals –Created and tested various match heuristics ( NAME, DEF, TAXONOMY, DISPERSAL ) –Tried various clustering methods: CLINK, SLINK, Ward’s Method…, new version of k-Means (Euclidean and spherical distance measures) –Tested numerous parameter combinations (stemming, etc.) in EDC and NHANES domains; see http://edc.isi.edu/alignment/http://edc.isi.edu/alignment/  Results not great (Hovy et al., ISI) ?

11 User interface testbed Ontology entry shown in beam for selected item –Located as near as possible –Color coding shows parental and semantic relationships Fisheye magnification of region of interest –Magnified group laid out to avoid internal overlap Menu presented as grid of alternating rows and columns (Feiner et al., Columbia)

12 AskCal: User requests in English ATN: –341 nodes –14 question types Automated paraphrase to confirm Dialogue continues via menus for detailed selection (Philpot et al., ISI)

13 Interface/usage evaluation Evaluation study, started late 2001 What to evaluate? Variables –Category display –Magnifying columns –Fisheye proximity & magnification –Searchlight –Synonyms Methods –Observe cognitive styles –Examples in other domains Research on content –Energy vs. Census domains (Sommer et al., Columbia) Task evaluation Process –Task scenario –Interview –Observation Goal –User behaviors –User intuitiveness for different groups of users –Strengths and weaknesses of the design Participants –Content experts –Government agency workers –Faculty and students

14 Thank you! Please come see our demos this afternoon!


Download ppt "A Portal for Access to Complex Distributed Information about Energy Jose Luis Ambite, Yigal Arens, Eduard H. Hovy, Andrew Philpot DGRC Information Sciences."

Similar presentations


Ads by Google