April 15, 2014 Faceted Browsing: Analysis and implementation of a Big Data Solution using Apache Solr. Advisor: Prof. Sonia Bergamaschi Co-Advisor: Prof.

Slides:



Advertisements
Similar presentations
© Copyright 2012 STI INNSBRUCK Apache Lucene Ioan Toma based on slides from Aaron Bannert
Advertisements

Adding scalability to legacy PHP web applications Overview Mario A. Valdez-Ramirez.
June 22-23, 2005 Technology Infusion Team Committee1 High Performance Parallel Lucene search (for an OAI federation) K. Maly, and M. Zubair Department.
1 Bringing P2P to the Web: Security and Privacy in the Firecoral Network Jeff Terrace Harold Laidlaw Hao Eric Liu Sean Stern Michael Freedman.
The user entered the query “What is the historical relation between Greek and Roma”. Here are the query’s results. The user clicked the topic “Roman copies.
Cassandra Database Project Alireza Haghdoost, Jake Moroshek Computer Science and Engineering University of Minnesota-Twin Cities Nov. 17, 2011 News Presentation:
Introduction to Open Source Search with Apache Lucene and Solr Grant Ingersoll.
Copyright © 2006 by The McGraw-Hill Companies, Inc. All rights reserved. McGraw-Hill Technology Education Copyright © 2006 by The McGraw-Hill Companies,
Distributed Databases
Implementing search with free software An introduction to Solr By Mick England.
ECPRD seminar on the net IX”, Brussels, 2011 Faceted Search Some examples of applied faceted search on websites developed by the EP Jerry.
A Study in NoSQL & Distributed Database Systems John Hawkins.
Xpantrac connection with IDEAL Sloane Neidig, Samantha Johnson, David Cabrera, Erika Hoffman CS /6/2014.
Search Search Drupal with Apache Solr with CERN Web Communications Group – Copyright 2013.
Data Mining on the Web via Cloud Computing COMS E6125 Web Enhanced Information Management Presented By Hemanth Murthy.
Technology Overview. Agenda What’s New and Better in Windows Server 2003? Why Upgrade to Windows Server 2003 ?  From Windows NT 4.0  From Windows 2000.
DISTRIBUTED DATABASES IN ADBMS Shilpa Seth
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
W HAT IS H ADOOP ? Hadoop is an open-source software framework for storing and processing big data in a distributed fashion on large clusters of commodity.
Revolutionizing enterprise web development Searching with Solr.
Server Performance, Scaling, Reliability and Configuration Norman White.
1 Biometric Databases. 2 Overview Problems associated with Biometric databases Some practical solutions Some existing DBMS.
Bayesian Networks Optimization of the Human-Computer Interaction process in a Big Data Scenario Candidate: Emanuele Charalambis University of Modena and.
Scalable Hybrid Keyword Search on Distributed Database Jungkee Kim Florida State University Community Grids Laboratory, Indiana University Workshop on.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
Clusterpoint Margarita Sudņika ms RDBMS & NoSQL Databases & tables → Document stores Columns, rows → Schemaless documents Scales UP → Scales UP.
Apache Solr Dima Ionut Daniel. Contents What is Apache Solr? Architecture Features Core Solr Concepts Configuration Conclusions Bibliography.
MarkLogic The Only Enterprise NoSQL Database Presented by: Aashi Rastogi ( ) Sanket Patel ( )
Abstract MarkLogic Database – Only Enterprise NoSQL DB Aashi Rastogi, Sanket V. Patel Department of Computer Science University of Bridgeport, Bridgeport,
Hadoop Big Data Usability Tools and Methods. On the subject of massive data analytics, usability is simply as crucial as performance. Right here are three.
A presentation on ElasticSearch
CSCI5570 Large Scale Data Processing Systems
Organizations Are Embracing New Opportunities
Big Data is a Big Deal!.
Big Data Enterprise Patterns
An Open Source Project Commonly Used for Processing Big Data Sets
CS122B: Projects in Databases and Web Applications Winter 2017
Searching and Indexing
MongoDB Er. Shiva K. Shrestha ME Computer, NCIT
Open Source distributed document DB for an enterprise
Every Good Graph Starts With
BDII Performance Tests
Fast Kernel-Density-Based Classification and Clustering Using P-Trees
Pipeline Execution Environment
CLUSTER COMPUTING Presented By, Navaneeth.C.Mouly 1AY05IS037
Informix Red Brick Warehouse 5.1
Collaboration Spotting: Visualisation of LHCb process data
Custom search forms with Apache Solr David Hernández
Central Florida Business Intelligence User Group
Ministry of Higher Education
Design and Maintenance of Web Applications in J2EE
Designed for Big Data Visual Analytics, Zoomdata Allows Business Users to Quickly Connect, Stream, and Visualize Data in the Microsoft Azure Platform MICROSOFT.
Mapping the Data Warehouse to a Multiprocessor Architecture
MSIS 655 Advanced Business Applications Programming
April 30th – Scheduling / parallel
CS6604 Digital Libraries IDEAL Webpages Presented by
MANAGING DATA RESOURCES
Introduction to PIG, HIVE, HBASE & ZOOKEEPER
DATABASE SYSTEM UNIT I.
Tools for Processing Big Data Jinan Al Aridhee and Christian Bach
CS110: Discussion about Spark
Lecture 1: Multi-tier Architecture Overview
Overview of big data tools
What's New in eCognition 9
Charles Tappert Seidenberg School of CSIS, Pace University
Chapter 3 Database Management
What's New in eCognition 9
Intro to Azure Search Julie Smith 2019.
Intro to Azure Search Julie Smith 2019.
Presentation transcript:

April 15, 2014 Faceted Browsing: Analysis and implementation of a Big Data Solution using Apache Solr. Advisor: Prof. Sonia Bergamaschi Co-Advisor: Prof. H.V. Jagadish Dott. Ing. Francesco Guerra Paolo Malavolta University of Modena and Reggio Emilia

Big Data visualization 90% of the data in the world today was created in the last 2 years alone. -IBM- Why should we be interested in visualization? Because the human visual system is a pattern seeker of enormous power and subtlety. The eye and the visual cortex of the brain form a massively parallel processor that provides the highest-bandwidth channel into human cognitive centers. At higher levels of processing, perception and cognition are closely interrelated, which is the reason why the words “understanding” and “seeing” are synonymous. -Colin Ware, 2000- as you need not to start from scratch = non dovendo così partire da 0 50% of our brains are involved in visual processing 70% of all of our sensory receptors are in our eyes

Apache Solr Solr (pronounced “solar”) is the most popular and open source enterprise search platform from the Apache Lucene project. Its major features include: Full-text search, Dynamic clustering, Near Real-time indexing, Traditional DB, rich document (e.g., Word, PDF, jpeg) handling, NoSQL DB support, HTTP Restfull requests. as you need not to start from scratch = non dovendo così partire da 0

Big Data solution Jetty features: Tomcat features: Full-featured and standards-based. Embeddable and Asynchronous. Open source and commercially usable. Dual licensed under Apache and Eclipse. Flexible and extensible, Enterprise scalable. Low maintenance cost. Small and Efficient. Famous open source under Apache. Easier to embed Tomcat in your applications. Implements the Servlet 3.0, JSP 2.2 and JSP-EL 2.2 support. Strong and widely commercially usable and use. Easy integrated with other application such as Spring. Flexible and extensible, Enterprise scalable. Faster JSP parsing. Stable. as you need not to start from scratch = non dovendo così partire da 0

Big Data solution Solr Tomcat Apache ZooKeeper: Sharding Replication as you need not to start from scratch = non dovendo così partire da 0

Big Data solution over Apache Solr Tomcat Apache ZooKeeper: Sharding Replication Hadoop Distributed File System (HDFS) as you need not to start from scratch = non dovendo così partire da 0 Big Data solution over Apache Solr

Faceted Browsing WE NEED IT Work Solr over Big Data  DONE Big Data visualization over Solr  NOT YET! WE NEED IT as you need not to start from scratch = non dovendo così partire da 0

Faceted Browsing Faceted Browsing over Solr AND between facets OR within facet as you need not to start from scratch = non dovendo così partire da 0

Faceted Browsing Faceted Browsing concerns: Checkbox Scale Efficiency CAPCOLOR (79) [x] Brown (23) [x] Gray (45) [ ] Red (11) CAPSURFACE (18) [ ] Scaly (7) [ ] Smooth (11) ODOR (183) [ ] None (34) [x] Pungent (80) [ ] Spicy (65) [ ] Anise (4) as you need not to start from scratch = non dovendo così partire da 0

Faceted Browsing How to make Faceted Browsing over Solr: Analysis and learning of a multi-faceted query Modified Solr front-end Added javascript as you need not to start from scratch = non dovendo così partire da 0 http://localhost:8081/solr/collection1/browse? &q=&fq={!tag=dt}CapColor:brown%22OR%22{!tag=dt}CapColor:gray &q=&fq={!tag=dt}Odor:pungent &facet=on &facet.field={!ex=dt}CapColor &facet.field={!ex=dt}CapSurface &facet.field={!ex=dt}Odor

Bayesian Networks Issues: How: Huge volume of facet User search experience How: Bayesian networks Open Markov Open source GUI available API available as you need not to start from scratch = non dovendo così partire da 0

Bayesian Networks How to: Added javascript Added Servlet Open Markov API modified as you need not to start from scratch = non dovendo così partire da 0

Bayesian Networks Aware User Search Dynamic Summary as you need not to start from scratch = non dovendo così partire da 0

Keywords Search Keywords Search over Relational Database: Very popular method Allow non-expert user to formulate queries Not needed to how the data is represented inside DB Current techniques: A-priori instance analysis Construct an index Retrieve information from index Experimented approach: Learning Bayesian Network from DB Nodes represents columns Retrieve information from BN as you need not to start from scratch = non dovendo così partire da 0

Keywords Search How it works: Learning Bayesian Network Automatic Hill Climbing algorithm Compute probabilities: Variable Elimination algorithm as you need not to start from scratch = non dovendo così partire da 0 Print results: Quick Sort algorithm

Keywords Search Test: Portion of a real DB 2-way path Full outer join: First 1000 rows 50 runs as you need not to start from scratch = non dovendo così partire da 0 Execution time: ~ 3.0 sec.

Conclusions and Future work I have provided over Apache Solr: Big Data solution Faceted browsing Integrated and ready-to-use solution Aware user search using Bayesian Networks Dynamic summary using Bayesian Networks I have provided over Keywords Search: A different approach for keyword search over DB A software that allow user to formulate queries Future work: For Solr I can: Use Machine Learning algorithm to automate the learning of the facets For Keywords Search I can: Learn the BN with more Database rows for better results as you need not to start from scratch = non dovendo così partire da 0

Thank for your time.