807 - TEXT ANALYTICS Massimo Poesio Lab 2: (Quick intro to) SOLR Document clustering with MAHOUT.

Slides:



Advertisements
Similar presentations
12 October 2011 Andrew Brown IMu Technology EMu Global Users Group 12 October 2011 IMu Technology.
Advertisements

Building and using REST information services Rion Dooley.
/ department of mathematics and computer sciencedepartment of mathematics and computer science / architecture of information systems.
© Copyright 2012 STI INNSBRUCK Apache Lucene Ioan Toma based on slides from Aaron Bannert
How to Use LucidWorks Search
ClearTK: A Framework for Statistical Biomedical Natural Language Processing Philip Ogren Philipp Wetzler Department of Computer Science University of Colorado.
Web Applications Development Using Coldbox Platform Eddie Johnston.
For ITCS 6265 Professor: Wensheng Wu Present by TA: Xu Fei.
Solr has a lot of extensive features Solr Integration and Enhancements Todd Hatcher.
Introducing Apache Mahout Scalable Machine Learning for All! Grant Ingersoll Lucid Imagination.
Introduction to Open Source Search with Apache Lucene and Solr Grant Ingersoll.
Content Management, Working with WordPress Svetlin Nakov Telerik Corporation
Multiple Tiers in Action
14 1 Chapter 14 Database Connectivity and Web Development Database Systems: Design, Implementation, and Management, Seventh Edition, Rob and Coronel.
XIS™ XML Intranet System. XIS, the XML Intranet System provides the foundation for your database production and management. XIS maximizes the flexible.
Apache Jakarta Tomcat Suh, Junho. Road Map Tomcat Overview Tomcat Overview History History What is Tomcat? What is Tomcat? Servlet Container.
Implementing search with free software An introduction to Solr By Mick England.
ECPRD seminar on the net IX”, Brussels, 2011 Faceted Search Some examples of applied faceted search on websites developed by the EP Jerry.
Xpantrac connection with IDEAL Sloane Neidig, Samantha Johnson, David Cabrera, Erika Hoffman CS /6/2014.
Introduction to Apache Lucene/Solr CSCI 572: Information Retrieval and Search Engines Summer 2010.
Search Search Drupal with Apache Solr with CERN Web Communications Group – Copyright 2013.
A Brief Overview by Aditya Dutt March 18 th ’ Aditya Inc.
M. Taimoor Khan * Java Server Pages (JSP) is a server-side programming technology that enables the creation of dynamic,
What's the story with open source? Searching and monitoring news media with open source technology Charlie Hull, Flax BCS IRSG Search Solutions 2010 Photo.
Avro Apache Course: Distributed class Student ID: AM Name: Azzaya Galbazar
Apache Mahout Industrial Strength Machine Learning Jeff Eastman.
CS525: Big Data Analytics Machine Learning on Hadoop Fall 2013 Elke A. Rundensteiner 1.
VIVO Multi-site search Structure and function overview.
Is Apache CouchDB for you?
1CONFIDENTIAL | Thinking Lucene Think Lucid Grant Ingersoll Chief Scientist Lucid Imagination Enhancing Discovery with Solr and Mahout.
Web Indexing and Searching By Florin Zidaru. Outline Web Indexing and Searching Overview Swish-e: overview and features Swish-e: set-up Swish-e: demo.
University of North Texas Libraries Building Search Systems for Digital Library Collections Mark E. Phillips Texas Conference on Digital Libraries May.
Revolutionizing enterprise web development Searching with Solr.
Introduction to Nutch CSCI 572: Information Retrieval and Search Engines Summer 2010.
Overview of IU Digital Collections Search Hui Zhang Jon Dunn Indiana University Digital Library Program IU Digital Library Brown Bag October 19, 2011.
Data Management BIRN supports data intensive activities including: – Imaging, Microscopy, Genomics, Time Series, Analytics and more… BIRN utilities scale:
Iccha Sethi Serdar Aslan Team 1 Virginia Tech Information Storage and Retrieval CS 5604 Instructor: Dr. Edward Fox 10/11/2010.
RUBRIC IP1 Ruben Botero Web Design III. The different approaches to accessing data in a database through client-side scripting languages. – On the client.
Advanced Analytics on Hadoop Spring 2014 WPI, Mohamed Eltabakh 1.
HTML Forms Chapter 9 pp Basic Form Join our list Name:
Field Trip #24 Setting Up a Web Server. Apache Apache is one of the most successful open source web servers In 1995 the most popular web server was the.
Modern Programming Language. Web Container & Web Applications Web applications are server side applications The most essential requirement.
Steven Perry Dave Vieglais. W a s a b i Web Applications for the Semantic Architecture of Biodiversity Informatics Overview WASABI is a framework for.
(ITI310) By Eng. BASSEM ALSAID SESSIONS 10: Internet Information Services (IIS)
Page 1 Cloud Study: Algorithm Team Mahout Introduction 박성찬 IDS Lab.
Guided By Ms. Shikha Pachouly Assistant Professor Computer Engineering Department 2/29/2016.
Google Code Libraries Dima Ionut Daniel. Contents What is Google Code? LDAPBeans Object-ldap-mapping Ldap-ODM Bug4j jOOR Rapa jongo Conclusion Bibliography.
Apache Solr Dima Ionut Daniel. Contents What is Apache Solr? Architecture Features Core Solr Concepts Configuration Conclusions Bibliography.
Apache Mahout Industrial Strength Machine Learning Jeff Eastman.
DreamFactory for Microsoft Azure Is an Open Source REST API Platform That Enables Mobilization of Data in Minutes across Frameworks and Storage Methods.
Apache Lucene Searching the Web and Everything Else Daniel Naber Mindquarry GmbH ID 380.
Image taken from: slideshare
Web Programming Language
Global Search: An Introduction and Administrator Perspective
Presented by: Javier Pastorino Fall 2016
Introducing Apache Mahout
Searching and Indexing
Open Source distributed document DB for an enterprise
Custom search forms with Apache Solr David Hernández
Building Search Systems for Digital Library Collections
Unit 6-Chapter 2 Struts.
CS6604 Digital Libraries IDEAL Webpages Presented by
Introduction to Apache
Database Connectivity and Web Development
Lucene/Solr Architecture
Rafał Kuć – Sematext sematext.com
Indexing with ElasticSearch
Introducing Apache Mahout
Web Application Development Using PHP
Copyright © JanBask Training. All rights reserved Get Started with Hadoop Hive HiveQL Languages.
Presentation transcript:

807 - TEXT ANALYTICS Massimo Poesio Lab 2: (Quick intro to) SOLR Document clustering with MAHOUT

Clustering Many packages – CLUTO – Weka – MALLET MAHOUT – Supported by the Apache foundation – Industrial strength (builds on top of Hadoop) – Includes libraries for reading in index files in different formats including Weka.arff and Lucene index files – We’ll use SOLR to produce Lucene index files

This Lab Clustering with Mahout Clustering with indices produced using Lucene: brief review of SOLR

MAHOUT A machine learning framework Built to be usable on top of Hadoop – scalability What’s in it: –Simple Matrix/Vector library –Taste Collaborative Filtering –Clustering Canopy/K-Means/Fuzzy K-Means/Mean-shift/Dirichlet –Classifiers Naïve Bayes Complementary NB –Evolutionary Integration with Watchmaker for fitness function

Basic format bin/mahout bin/mahout kmeans bin/mahout seqdirectory

INPUT FORMAT IMF p. 155: ‘for clustering Mahout relies on data in org.apache.mahout.matrix.Vector format’ – Vector = a tuple of floats SparseVector vs DenseVector Several libraries for creating Vectors from other formats – Weka – Apache Lucene – programmatic

K_MEANS CLUSTERING The Federalist papers example

CONVERSION The Reuters example

For more sophisticated indexing … … can use SOLR for preprocessing; Mahout knows how to read in Lucene-style indices

What is Solr? Solr is an open source enterprise search server based on the Lucene Java search library. Solr runs in a Java servlet container such as Tomcat or Jetty Solr is free software and a project of the Apache Software Foundation Solr is a sub-project of Lucene and can be found at By Mick England

Key Features Advanced Full-Text search Optimized for High Volume Web Traffic Standards Based Open Interfaces – XML and HTTP Comprehensive HTML Administration Interface Server statistics exposed over JMX for monitoring Scalability through efficient replication Flexibility with XML configuration and Plugins Push vs Crawl indexing method

Solr Clients Solr can be integrated with, among others… – Ruby – PHP – Java – Python – JSON – Forrest/Cocoon – C# or Deveel Solr Client or solrnet – Coldfusion – Drupal or apacheSolr project for Drupal

Why SOLR? It can be used to preprocess documents and produce an index for them that can then be used as representation

Indexing Push vs Crawl Schema.xml Add documents HTML interface – Update – Delete – Commit DataImportHandler – For searching databases By Mick England

SOLR: what you should do (Installing SOLR on your laptop: see Section 0 of Lab script) Posting docs to SOLR Searching Getting the indexed docs

Posting documents to SOLR SOLR documents – fields schema.xml

SOLR Documents: fields

Importing Lucene indices into MAHOUT Use the lucene.vector option