HPEC10-0 DMS 1/15/2015 MIT Lincoln Laboratory Toward a Scalable Knowledge Space on the Cloud: Initial Integration and Evaluation Delsey Sherrill, Jonathan.

Slides:



Advertisements
Similar presentations
The HPEC Challenge Benchmark Suite
Advertisements

2008 EPA and Partners Metadata Training Program: 2008 CAP Project Geospatial Metadata: Intermediate Course Module 3: Metadata Catalogs and Geospatial One.
1. XP 2 * The Web is a collection of files that reside on computers, called Web servers. * Web servers are connected to each other through the Internet.
1 Senn, Information Technology, 3 rd Edition © 2004 Pearson Prentice Hall James A. Senns Information Technology, 3 rd Edition Chapter 7 Enterprise Databases.
1 Capability Set - Detail. 2 Common Content Problems Content Mayhem –File management and storage confusion Content Multiplication –Editing déjà vu - same.
Chapter 1: The Database Environment
Copyright © 2006 by The McGraw-Hill Companies, Inc. All rights reserved. McGraw-Hill Technology Education Copyright © 2006 by The McGraw-Hill Companies,
1 Towards an Open Service Framework for Cloud-based Knowledge Discovery Domenico Talia ICAR-CNR & UNIVERSITY OF CALABRIA, Italy Cloud.
Fatma Y. ELDRESI Fatma Y. ELDRESI ( MPhil ) Systems Analysis / Programming Specialist, AGOCO Part time lecturer in University of Garyounis,
Performance in Decentralized Filesharing Networks Theodore Hong Freenet Project.
Dr. Leo Obrst MITRE Information Semantics Information Discovery & Understanding Command & Control Center February 6, 2014February 6, 2014February 6, 2014.
…to Ontology Repositories Mathieu dAquin Knowledge Media Institute, The Open University From…
Taxonomy & Ontology Impact on Search Infrastructure John R. McGrath Sr. Director, Fast Search & Transfer.
Texas Digital Library Services Preservation Network.
ASYCUDA Overview … a summary of the objectives of ASYCUDA implementation projects and features of the software for the Customs computer system.
The creation of "Yaolan.com" A Site for Pre-natal and Parenting Education in Chinese by James Caldwell DAE Interactive Marketing a Web Connection Company.
18 Copyright © 2005, Oracle. All rights reserved. Distributing Modular Applications: Introduction to Web Services.
1 Copyright © 2005, Oracle. All rights reserved. Introducing the Java and Oracle Platforms.
Designing Services for Grid-based Knowledge Discovery A. Congiusta, A. Pugliese, Domenico Talia, P. Trunfio DEIS University of Calabria ITALY
Universität Innsbruck Leopold Franzens Copyright 2006 DERI Innsbruck LarCK Workshop, ISWC/ASWC Busan, Korea 16-Feb-14 Towards Scalable.
FANS (Future Air Navigation System) Flight Crew Procedures
Peer-to-peer and agent-based computing Peer-to-Peer Computing: Introduction.
Chapter 3 Critically reviewing the literature
Database Systems: Design, Implementation, and Management
Introduction Lesson 1 Microsoft Office 2010 and the Internet
Microsoft Office 2010 Basics and the Internet
The internet. Background Created in 1969, connected computers at UCLA, Stanford Research Institute, U. of Utah, and UC at Santa Barbara With an estimated.
Configuration management
Distributed Data Processing
Information Systems Today: Managing in the Digital World
1 Migrating from Access to SQL Server Simon Kingston, CSU / NPS NRGIS.
1 Web-Enabled Decision Support Systems Access Introduction: Touring Access Prof. Name Position (123) University Name.
State of Connecticut Core-CT Project Query 8 hrs Updated 6/06/2006.
Request Tracker IT Partners Conference Oliver Thomas 19 April 2005.
1 IC GS J. Broome, Mar Introduction to the Informatics and Data Aspects John Broome (Canada)
Microsoft Office Illustrated Fundamentals Unit C: Getting Started with Unit C: Getting Started with Microsoft Office 2010 Microsoft Office 2010.
CAR Training Module PRODUCT REGISTRATION and MANAGEMENT Module 2 - Register a New Document - Without Alternate Formats (Run as a PowerPoint show)
1 Evaluations in information retrieval. 2 Evaluations in information retrieval: summary The following gives an overview of approaches that are applied.
the Entity-Relationship (ER) Model
Database System Concepts and Architecture
All Rights Reserved. National Library Board Singapore Tracking and Measuring Performance of Reference Services at the National Library Board, Singapore.
Executional Architecture
Macromedia Dreamweaver MX 2004 – Design Professional Dreamweaver GETTING STARTED WITH.
ETIS+: European Transport Policy Information System - Development and Implementation of Data Collection Methodology for EU Transport Modelling Funded by.
25 seconds left…...
XP New Perspectives on Browser and Basics Tutorial 1 1 Browser and Basics Tutorial 1.
® Microsoft Office 2010 Browser and Basics.
IT Analytics for Symantec Endpoint Protection
Chapter 10: The Traditional Approach to Design
Systems Analysis and Design in a Changing World, Fifth Edition
Chapter 19 Design Model for WebApps
Chapter 15 A Table with a View: Database Queries.
Distributed DBMS©M. T. Özsu & P. Valduriez Ch.15/1 Outline Introduction Background Distributed Database Design Database Integration Semantic Data Control.
Chapter 13 The Data Warehouse
CFR 250/590 Introduction to GIS, Autumn 1999 Data Search & Import © Phil Hurvitz, find_data 1  Overview Web search engines NSDI GeoSpatial Data.
Altman IM Ltd | | capture | index | organise | workflow Enterprise document & content management … for all types & size.
11© 2011 Hitachi Data Systems. All rights reserved. HITACHI DATA DISCOVERY FOR MICROSOFT® SHAREPOINT ® SOLUTION SCALING YOUR SHAREPOINT ENVIRONMENT PRESENTER.
1 Introduction The Database Environment. 2 Web Links Google General Database Search Database News Access Forums Google Database Books O’Reilly Books Oracle.
Microsoft ® Official Course Interacting with the Search Service Microsoft SharePoint 2013 SharePoint Practice.
Databases & Data Warehouses Chapter 3 Database Processing.
Distributed Indexing of Web Scale Datasets for the Cloud {ikons, eangelou, Computing Systems Laboratory School of Electrical.
Master Thesis Defense Jan Fiedler 04/17/98
Open Search Office Web Services Database Doc Mgt Sys Pipeline Index Geospatial Analysis Text Search Faceting Caching Query parsing Clustering Synonyms.
MANAGING DATA RESOURCES ~ pertemuan 7 ~ Oleh: Ir. Abdul Hayat, MTI.
Scalable Hybrid Keyword Search on Distributed Database Jungkee Kim Florida State University Community Grids Laboratory, Indiana University Workshop on.
Clusterpoint Margarita Sudņika ms RDBMS & NoSQL Databases & tables → Document stores Columns, rows → Schemaless documents Scales UP → Scales UP.
September 2003, 7 th EDG Conference, Heidelberg – Roberta Faggian, CERN/IT CERN – European Organization for Nuclear Research The GRACE Project GRid enabled.
The Database Environment
The Search Engine Architecture
Presentation transcript:

HPEC10-0 DMS 1/15/2015 MIT Lincoln Laboratory Toward a Scalable Knowledge Space on the Cloud: Initial Integration and Evaluation Delsey Sherrill, Jonathan Kurz, Craig McNally, and Will Smith {dsherrill, jonkurz, cmcnally, High Performance Embedded Computing Workshop September 2010

MIT Lincoln Laboratory SKS-1 DMS Attempted Terrorist Attack 12/25/09 11 November/UK Cable to US: “pledge to jihad” “Umar Farouk” Anwar al-Awlaki Umar Farouk Abdulmutallab 19 November/CIA UFA’s father: “son in Yemen”,“extreme religious views” U.S Embassy, Nigeria 25 December/DHS Cash ticket, no luggage checked NWA flight 253 Amsterdam  Detroit August US Intel: “meeting to plan operation” “Nigerian” Al Qaida of the Arabian Peninsula / Yemen Key breakdowns: - Dissemination and access - Name ambiguity - Structured/unstructured data correlation Key breakdowns: - Dissemination and access - Name ambiguity - Structured/unstructured data correlation

MIT Lincoln Laboratory SKS-2 DMS Challenges Dissemination and access –“Silos of excellence” –Coarse-grained classification (default to “system high”) –Varying levels of clearance among DoD, IC, Coalition partners Name ambiguity –Aliases, common names –Spelling variation (foreign names, typos) –Partial name references –Lack of structured data context Structured / unstructured data correlation –Data volumes overwhelm capacity for human review » Structured: 10 2 passengers x 10 4 daily flights into US = 10 6 reservations / day » Unstructured: 10 4 new reports per day; years of archives –Variations in dates, times, locations, etc. expressed in free text

MIT Lincoln Laboratory SKS-3 DMS Outline Introduction Structured Knowledge Space (SKS) Overview SKS-on-Cloud Integration SKS-on-Cloud Benchmarking Future Work & Summary

MIT Lincoln Laboratory SKS-4 DMS Structured Knowledge Space (SKS) SKS can address key intelligence challenges by enriching unstructured documents and supporting discovery over the network to users at multiple classification levels Secure Data Store Secure Data Store Search Engine Document Ingest Dissemination and sharing Name ambiguity Structured/unstructured data correlation Web-based Search Real-time Alerts  Named entity recognition, query expansion  Secure multilevel access, web search  Geo/time extraction, alerting

MIT Lincoln Laboratory SKS-5 DMS Keyword searches are limited to exact or near matches, precluding fundamental document discovery use cases Indexed Text Document Collection Target Folder 1A Target is moving from Leander 15RWQ1545 back into JOA Bear on the evening of 17 April 2006 will OBJ1A between 2000 and 2200 hours. Significance: AQI Leader Sources of INTEL: OGI Description Extremist Operatives Rabah Muhtadi Alrhu Oldegi Umar Nawaf Target Folder 1A Target is moving from Leander 15RWQ1545 back into JOA Bear on the evening of 17 April 2006 will OBJ1A between 2000 and 2200 hours. Significance: AQI Leader Sources of INTEL: OGI Description Extremist Operatives Rabah Muhtadi Alrhu Oldegi Umar Nawaf Indexed Text Document Discovery Use Cases.txt Target Document Search for “AQI Leader” Search at “15RWQ1545” Search on “17 April 2006” Search for “Umar Nawaf” Avoiding Keyword Search Pitfalls Indexed Text & Extracted Entities Find people associated with AQI in Apr 2006 near 15RVQ9050 PEOPLE, RELATIONSHIPS Search within 30km of 15RVQ9050 GEOSPATIAL COORDINATE Search for “Al Qaeda in Iraq” ORGANIZATION Search between 4/12/06 – 4/18/06 DATE Entity extraction enables geospatial, temporal, and entity category searches for documents

MIT Lincoln Laboratory SKS-6 DMS Web-Based Search Capabilities

MIT Lincoln Laboratory SKS-7 DMS Web-Based Search Capabilities Query by keyword, phrase, fuzzy match, wildcard, geo, date, source, format, and Arabic name variant “Facets” reveal the top 20 people, organizations, etc. within documents matching search Search hits sorted by relevance with highlighted snippets, attributes, and download links

MIT Lincoln Laboratory SKS-8 DMS Outline Introduction Structured Knowledge Space Overview SKS-on-Cloud Integration SKS-on-Cloud Benchmarking Future Work & Summary

MIT Lincoln Laboratory SKS-9 DMS Standards Development Performance To Cloud or Not to Cloud? Scale up: costly high end HW, proprietary RDBMS* Centralized (move data to computation nodes) Relational store: defined in advance, natural data representation Low-level data integrity guaranteed by database Standard Query Language (SQL): cross-platform Well-established technology, large pool of expertise Scale out: commodity hardware, FOSS**/GOTS Decentralized (move computation to data nodes) Key-value store: free-form, add columns on the fly, app dependent model Data integrity left to application logic Non-standard APIs: every cloud for itself Still novel technology; specialized skill set * Relational Database Management System ** Free Open Source Software TraditionalCloud

MIT Lincoln Laboratory HPEC10-10 DMS 1/15/2015 Integration Plan Secure Relat’l Store Secure Relat’l Store Parsers & Processors Documents Services & Interfaces Users Distributed Search Engine Secure Cloud Store Secure Cloud Store Side-car approach mitigates risk of exploring new technologies; proven critical path remains intact Search Engine

MIT Lincoln Laboratory HPEC10-11 DMS 1/15/2015 Search Components: “SKS Classic” Oracle 10g DDM* Facet Results Facet Computation Facet Computation Facet Retrieval Facet Retrieval Local File System Search Results Metadata Retrieval Text Content Retrieval Results Formatter  PL-3 Accredited System *Dimensional Data Model Local Indexes Local Indexes Local Indexes Local Indexes Local Indexes Local Indexes Lucene 2.4 Multi- Searcher Analysts Web Search Interface “Mullah Omar”

MIT Lincoln Laboratory HPEC10-12 DMS 1/15/2015 Search Components: SKS-on-Cloud Analysts Web Search Interface “Mullah Omar” Solr Node L L Lucene Indexes Solr RESTful Search API L L L L Facet Results Facet Computation Facet Computation Facet Retrieval Facet Retrieval “Bigtable- like” Store Search Results Results Formatter Metadata & Text Content Retrieval  PL-3 Accredited System (in process)

MIT Lincoln Laboratory HPEC10-13 DMS 1/15/2015 Cloud Hardware: MIT LL Compute Clusters LAN Switch Network Storage Resource Manager Configuration Server Compute Nodes Service Nodes Cluster Switch To Lincoln LAN Cluster(s)TX-2500TX-3DTX-X ClassificationUnclassifiedClassifiedExternal Compute Nodes Processors Total RAM4,056 GB1,800 GB960 GB Central Storage36.0 TB4.3 TB Total Local Disk Space817.6 TB90.0 TB40.3 TB MIT-LL owns and operates multiple state-of-the-art computing clusters for information technology and application development research

MIT Lincoln Laboratory HPEC10-14 DMS 1/15/2015 Outline Introduction Structured Knowledge Space Overview SKS-on-Cloud Integration SKS-on-Cloud Benchmarking Future Work & Summary

MIT Lincoln Laboratory HPEC10-15 DMS 1/15/2015 Benchmarking Method Oracle Lucene Indexes JMeter (request bot) File System Lucene Multi- Searcher Web Search Interface Facet Results Search Results Facet Computation Facet Computation Facet Retrieval Facet Retrieval Metadata Retrieval Text Content Retrieval Results Formatter RLS Secure Access  (Accredited) “Mullah Omar” * Repeat for 200 different keywords “Mullah Omar”* t0t0 t1t1 t2t2 t3t3

MIT Lincoln Laboratory HPEC10-16 DMS 1/15/2015 SKS-Classic Benchmarking Results All three subcomponents contribute significantly to total timing, so all are worthwhile scaling targets (max(t2,t3)-t0) (t1-t0) (t2-t1) (t3-t1) Better

MIT Lincoln Laboratory HPEC10-17 DMS 1/15/2015 NOTIONAL Comparison Results # Documents Loaded Search Time SKS-Classic Max acceptable search time  5 sec 10M?? Crossover point Better SKS-on-Cloud Goal: sufficient samples at escalating loads to estimate crossover point (if exists) and extrapolate to billion-documents regime

MIT Lincoln Laboratory HPEC10-18 DMS 1/15/2015 Outline Introduction Structured Knowledge Space Overview SKS-on-Cloud Integration SKS-on-Cloud Benchmarking Future Work & Summary

MIT Lincoln Laboratory HPEC10-19 DMS 1/15/2015 What Might Have Been 11 November/UK Cable to US: “pledge to jihad” “Umar Farouk” Anwar al-Awlaki Umar Farouk Abdulmutallab 19 November/CIA UFA’s father: “son in Yemen”,“extreme religious views” U.S Embassy, Nigeria 25 December/DHS Cash ticket, no luggage checked NWA flight 253 Amsterdam  Detroit August US Intel intercept: “meeting to plan operation” “Nigerian” Al Qaida of the Arabian Peninsula / Yemen Analyst searching for “Umar Farouk Abdulmutallab” finds connections to Awlaki, Nigerian, planned operation Father’s warnings plus other derogatory evidence enough to take preventive action (Revoke visa, No-fly list) Correlation engine alerts authorities that person of interest has suspicious reservation and is about to board plane bound for US

MIT Lincoln Laboratory HPEC10-20 DMS 1/15/2015 Future Work Develop Analytics Engine to leverage cloud processing capabilities –Correlating structured with unstructured data (e.g. Entity Track Analysis) –Clustering of entity mentions within documents to improve name disambiguation Operationalize SKS-on-Cloud system Complete comparative search benchmarks to at least 10 million documents Scale to 1 billion, 10 billion, …

MIT Lincoln Laboratory HPEC10-21 DMS 1/15/2015 Summary MIT LL has developed the Structured Knowledge Space system to extract entities and relationships from weakly structured intelligence reporting formats –Web services and browser-based user interfaces support discovery and access over the network To explore the feasibility and desirability of migrating the full SKS application suite to a cloud-based distributed storage & processing platform, we integrated cloud storage as a data storage sidecar on the existing system Early benchmarks indicate that existing system performs adequately up to 3M documents (< 2 sec for simple searches) but timings show an upward trend –Too early to predict Cloud-based system performance; however theoretical benchmarks are promising

MIT Lincoln Laboratory HPEC10-22 DMS 1/15/2015 Acknowledgements Gary Condon Jason Hepp Jeremy Kepner Ben Landon Bob Piotti Chuck Yee The LLGrid team The SKS-RTRG development team Contact: Delsey Sherrill, Jonathan Kurz, Craig McNally, and Will Smith {dsherrill, jonkurz, cmcnally,