Video Analysis in Hadoop A Case Study Alex Gorbachev & Alan Gardner San Jose, CA
@AlexGorbachev Pythian Incubator of things Database geek Cloudera Champion of Big Solutions Pythian Founder, Ottawa Drones Polyglot Hacker Part-time Data Scientist © 2013 Pythian
Datafication Era © 2013 Pythian 3 Tier 3 Data Insight from Big Data Value of Data Impact of an incident, whether it be data loss, security, human error, etc. Tier 2 Data Tier 1 Data Profit Loss LOVE YOUR DATA
Who is Pythian? 15 Years of Data infrastructure management consulting 170+ Top brands databases under management Over 200 DBA’s, in 26 countries Top 5% of DBA work force Oracle, SQL Server, MySQL, Netezza, Hadoop, MongoDB, IT Infrastructure © 2013 Pythian4
Agenda Introducing Adminiscope The case for Video OCR Video processing in Hadoop Architecture MapReduce workflow details Solr Integration Optimizing Hadoop cluster for OCR Beyond text recognition and video processing © 2013 Pythian
6 Trust but Verify in the physical world
We wanted surveillance capabilities over administrative access to data infrastructure © 2013 Pythian
Adminiscope architecture simplified © 2013 Pythian
9 Trust but Verify in the digital world
Can’t we do it more efficiently and reliably in digital age? © 2013 Pythian
11 DEMO
Hadoop as Data Reservoir © 2013 Pythian Adminiscope Internal Systems Files transfer audit Sessions metadata Video, keystrokes Ticketing & monitoring Knowledge base
Hadoop as Data Reservoir © 2013 Pythian Adminiscope Internal Systems Files transfer audit Sessions metadata Video, keystrokes Ticketing & monitoring Knowledge base
What is Run-Length Encoding? © 2013 Pythian t dog cat elephant
Screen text processing options One page per frame Store text of each frame in a stream Large volume Contextual analysis Detect Personal Identifiable Information (PII) Detect credit card patterns Run-Length Encoded Store term appearance in a stream Small volume Termed search Find when “DROP TABLE” was on the screen © 2013 Pythian
Ingest Architecture Now © 2013 Pythian 16.bmp Encoder Encoder writes directly to HDFS using libhdfs Custom serialization format Binary, compressed, splittable Chosen over Avro for simplicity on the C side Wrote custom InputFormat, RecordReader
Flume Ingest Architecture © 2013 Pythian 17 Video source Archive.bmp Encoder Support in Cloudera Search for binary files in the directory spooler and REST endpoint.
© 2013 Pythian18 Video Processing Architecture
OCR Mapper RLE.bmp
© 2013 Pythian20 RLE and Secondary Sort
Avro Serialization Second MapReduce job to aggregate all terms per session Separate from RLE for modularity and parallelism Output records include a bag of words for indexing and a JSON representation for the web UI Avro chosen for Cloudera Search support © 2013 Pythian21
Morphlines Part of Cloudera Development Kit, provides a quick way to transform data and index it in Solr Common ETL operations are supplied, can be extended with user-defined function Can be run as MapReduce, or in a low-latency configuration consuming Flume output © 2013 Pythian22
Morphlines - Example morphlines : [ { id : morphline1 importCommands : [ "com.cloudera.**", "org.apache.solr.**" ] commands : [ # Some commands go here ] } ] © 2013 Pythian23
Morphlines – Avro Commands readAvroContainer { readerSchemaFile : /path/to/json_schema.avsc } extractAvroPaths { flatten : false paths : { id : /session_id bag_of_words : /bag_of_words json_rle : /json_rle } © 2013 Pythian24
Morphlines – Solr Commands sanitizeUnknownSolrFields { solrLocator : ${SOLR_LOCATOR} } loadSolr { solrLocator : ${SOLR_LOCATOR} } © 2013 Pythian25
Optimizing task trackers for OCR Nodes running OCR don’t utilize much memory, disk, network, so optimize: Move OCR to a separate Hadoop cluster oriented on CPU or in the cloud Schedule OCR MR jobs using task trackers on non- data-nodes Move OCR outside of Hadoop But then unable to do other types of processing that need combine multiple data-sources © 2013 Pythian
Full text search Automatic recognition of text patterns –CC# –SSN –Suspicious activity ( DROP TABLE ) Similar video sessions Related tickets / knowledge base articles Keystroke / mouse movement analysis User working tired or under influence? © 2013 Pythian27 Adminiscope initial use cases
Beyond Adminiscope Online video analytics Security camera analytics Beyond text Faces on the screen License plates Brain activity scans Other time series data audio geo-location data © 2013 Pythian28
Thank you – Q&A To contact @alanctgardner © 2013 Pythian29