Video Analysis in Hadoop A Case Study Alex Gorbachev & Alan Gardner San Jose, CA

Slides:



Advertisements
Similar presentations
© Hortonworks Inc MapReduce over snapshots HBASE-8369 Enis Soztutar Enis [at] apache [dot] Page 1.
Advertisements

COMPANY’S OVERVIEW & SOFTWARE DEFINATION.
MapReduce.
© Hortonworks Inc Daniel Dai Thejas Nair Page 1 Making Pig Fly Optimizing Data Processing on Hadoop.
June 22-23, 2005 Technology Infusion Team Committee1 High Performance Parallel Lucene search (for an OAI federation) K. Maly, and M. Zubair Department.
11© 2011 Hitachi Data Systems. All rights reserved. HITACHI DATA DISCOVERY FOR MICROSOFT® SHAREPOINT ® SOLUTION SCALING YOUR SHAREPOINT ENVIRONMENT PRESENTER.
An Information Architecture for Hadoop Mark Samson – Systems Engineer, Cloudera.
Mihai Pintea. 2 Agenda Hadoop and MongoDB DataDirect driver What is Big Data.
Hadoop tutorials. Todays agenda Hadoop Introduction and Architecture Hadoop Distributed File System MapReduce Spark 2.
Hadoop Ecosystem Overview
Simplify your Job – Automatic Storage Management Angelo Session id:
Copyright © 2012 Cleversafe, Inc. All rights reserved. 1 Combining the Power of Hadoop with Object-Based Dispersed Storage.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
Hadoop File Formats and Data Ingestion
Big Data Analytics with R and Hadoop
HADOOP ADMIN: Session -2
Hadoop & Cheetah. Key words Cluster  data center – Lots of machines thousands Node  a server in a data center – Commodity device fails very easily Slot.
Hadoop Team: Role of Hadoop in the IDEAL Project ●Jose Cadena ●Chengyuan Wen ●Mengsu Chen CS5604 Spring 2015 Instructor: Dr. Edward Fox.
This presentation was scheduled to be delivered by Brian Mitchell, Lead Architect, Microsoft Big Data COE Follow him Contact him.
Data Mining on the Web via Cloud Computing COMS E6125 Web Enhanced Information Management Presented By Hemanth Murthy.
Linux Operations and Administration
Cloud Computing for the Enterprise November 18th, This work is licensed under a Creative Commons.
Research on cloud computing application in the peer-to-peer based video-on-demand systems Speaker : 吳靖緯 MA0G rd International Workshop.
Analytics Map Reduce Query Insight Hive Pig Hadoop SQL Map Reduce Business Intelligence Predictive Operational Interactive Visualization Exploratory.
USING HADOOP & HBASE TO BUILD CONTENT RELEVANCE & PERSONALIZATION Tools to build your big data application Ameya Kanitkar.
Committed to Deliver….  We are Leaders in Hadoop Ecosystem.  We support, maintain, monitor and provide services over Hadoop whether you run apache Hadoop,
Hadoop File Formats and Data Ingestion
Presented by CH.Anusha.  Apache Hadoop framework  HDFS and MapReduce  Hadoop distributed file system  JobTracker and TaskTracker  Apache Hadoop NextGen.
NoSQL continued CMSC 461 Michael Wilson. MongoDB  MongoDB is another NoSQL solution  Provides a bit more structure than a solution like Accumulo  Data.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Introduction to Hadoop and HDFS
CSE 548 Advanced Computer Network Security Document Search in MobiCloud using Hadoop Framework Sayan Cole Jaya Chakladar Group No: 1.
Create Content Capture Content Review Content Edit Content Version Content Version Content Translate Content Translate Content Format Content Transform.
CERN IT Department CH-1211 Geneva 23 Switzerland t CF Computing Facilities Agile Infrastructure Monitoring CERN IT/CF.
Hadoop IT Services Hadoop Users Forum CERN October 7 th,2015 CERN IT-D*
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
Nov 2006 Google released the paper on BigTable.
Harnessing Big Data with Hadoop Dipti Sangani; Madhu Reddy DBI210.
Learn. Hadoop Online training course is designed to enhance your knowledge and skills to become a successful Hadoop developer and In-depth knowledge of.
An Introduction To Big Data For The SQL Server DBA.
BIG DATA/ Hadoop Interview Questions.
Apache Hadoop on Windows Azure Avkash Chauhan
Abstract MarkLogic Database – Only Enterprise NoSQL DB Aashi Rastogi, Sanket V. Patel Department of Computer Science University of Bridgeport, Bridgeport,
Best Practices for Columnstore Indexes Warner Chaves SQL MCM / MVP SQLTurbo.com Pythian.com.
ORACLE's Approach ORALCE uses a proprietary mechanism for security. They user OLS.... ORACLE Labeling Security. They do data confidentiality They do adjudication.
Data Analytics Challenges Some faults cannot be avoided Decrease the availability for running physics Preventive maintenance is not enough Does not take.
Hadoop Introduction. Audience Introduction of students – Name – Years of experience – Background – Do you know Java? – Do you know linux? – Any exposure.
Everything you've ever wanted to know about using Control-M to integrate any application workload September 9, 2016 David Fernandez Senior Presales Consultant.
OMOP CDM on Hadoop Reference Architecture
Protecting a Tsunami of Data in Hadoop
Connected Infrastructure
Organizations Are Embracing New Opportunities
  Choice Hotels’ journey to better understand its customers through self-service analytics Narasimhan Sampath & Avinash Ramineni Strata Hadoop World |
Hadoop.
By Chris immanuel, Heym Kumar, Sai janani, Susmitha
MapReduce Types, Formats and Features
Hadoopla: Microsoft and the Hadoop Ecosystem
Couchbase Server is a NoSQL Database with a SQL-Based Query Language
Connected Infrastructure
SQOOP.
Presented by: Warren Sifre
Introduction of Week 6 Assignment Discussion
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
CS6604 Digital Libraries IDEAL Webpages Presented by
Johannes Peter MediaMarktSaturn Retail Group
Introduction to PIG, HIVE, HBASE & ZOOKEEPER
Introduction to Apache
DriveScale Log Collection Method of Procedure
Business Document Platform
Presentation transcript:

Video Analysis in Hadoop A Case Study Alex Gorbachev & Alan Gardner San Jose, CA

@AlexGorbachev Pythian Incubator of things Database geek Cloudera Champion of Big Solutions Pythian Founder, Ottawa Drones Polyglot Hacker Part-time Data Scientist © 2013 Pythian

Datafication Era © 2013 Pythian 3 Tier 3 Data Insight from Big Data Value of Data Impact of an incident, whether it be data loss, security, human error, etc. Tier 2 Data Tier 1 Data Profit Loss LOVE YOUR DATA

Who is Pythian? 15 Years of Data infrastructure management consulting 170+ Top brands databases under management Over 200 DBA’s, in 26 countries Top 5% of DBA work force Oracle, SQL Server, MySQL, Netezza, Hadoop, MongoDB, IT Infrastructure © 2013 Pythian4

Agenda Introducing Adminiscope The case for Video OCR Video processing in Hadoop Architecture MapReduce workflow details Solr Integration Optimizing Hadoop cluster for OCR Beyond text recognition and video processing © 2013 Pythian

6 Trust but Verify in the physical world

We wanted surveillance capabilities over administrative access to data infrastructure © 2013 Pythian

Adminiscope architecture simplified © 2013 Pythian

9 Trust but Verify in the digital world

Can’t we do it more efficiently and reliably in digital age? © 2013 Pythian

11 DEMO

Hadoop as Data Reservoir © 2013 Pythian Adminiscope Internal Systems Files transfer audit Sessions metadata Video, keystrokes Ticketing & monitoring Knowledge base

Hadoop as Data Reservoir © 2013 Pythian Adminiscope Internal Systems Files transfer audit Sessions metadata Video, keystrokes Ticketing & monitoring Knowledge base

What is Run-Length Encoding? © 2013 Pythian t dog cat elephant

Screen text processing options One page per frame Store text of each frame in a stream Large volume Contextual analysis Detect Personal Identifiable Information (PII) Detect credit card patterns Run-Length Encoded Store term appearance in a stream Small volume Termed search Find when “DROP TABLE” was on the screen © 2013 Pythian

Ingest Architecture Now © 2013 Pythian 16.bmp Encoder Encoder writes directly to HDFS using libhdfs Custom serialization format Binary, compressed, splittable Chosen over Avro for simplicity on the C side Wrote custom InputFormat, RecordReader

Flume Ingest Architecture © 2013 Pythian 17 Video source Archive.bmp Encoder Support in Cloudera Search for binary files in the directory spooler and REST endpoint.

© 2013 Pythian18 Video Processing Architecture

OCR Mapper RLE.bmp

© 2013 Pythian20 RLE and Secondary Sort

Avro Serialization Second MapReduce job to aggregate all terms per session Separate from RLE for modularity and parallelism Output records include a bag of words for indexing and a JSON representation for the web UI Avro chosen for Cloudera Search support © 2013 Pythian21

Morphlines Part of Cloudera Development Kit, provides a quick way to transform data and index it in Solr Common ETL operations are supplied, can be extended with user-defined function Can be run as MapReduce, or in a low-latency configuration consuming Flume output © 2013 Pythian22

Morphlines - Example morphlines : [ { id : morphline1 importCommands : [ "com.cloudera.**", "org.apache.solr.**" ] commands : [ # Some commands go here ] } ] © 2013 Pythian23

Morphlines – Avro Commands readAvroContainer { readerSchemaFile : /path/to/json_schema.avsc } extractAvroPaths { flatten : false paths : { id : /session_id bag_of_words : /bag_of_words json_rle : /json_rle } © 2013 Pythian24

Morphlines – Solr Commands sanitizeUnknownSolrFields { solrLocator : ${SOLR_LOCATOR} } loadSolr { solrLocator : ${SOLR_LOCATOR} } © 2013 Pythian25

Optimizing task trackers for OCR Nodes running OCR don’t utilize much memory, disk, network, so optimize: Move OCR to a separate Hadoop cluster oriented on CPU or in the cloud Schedule OCR MR jobs using task trackers on non- data-nodes Move OCR outside of Hadoop But then unable to do other types of processing that need combine multiple data-sources © 2013 Pythian

Full text search Automatic recognition of text patterns –CC# –SSN –Suspicious activity ( DROP TABLE ) Similar video sessions Related tickets / knowledge base articles Keystroke / mouse movement analysis User working tired or under influence? © 2013 Pythian27 Adminiscope initial use cases

Beyond Adminiscope Online video analytics Security camera analytics Beyond text Faces on the screen License plates Brain activity scans Other time series data audio geo-location data © 2013 Pythian28

Thank you – Q&A To contact @alanctgardner © 2013 Pythian29