Unicage Development Method Big Data Case Studies

Slides:



Advertisements
Similar presentations
Symantec 2010 Windows 7 Migration EMEA Results. Methodology Applied Research performed survey 1,360 enterprises worldwide SMBs and enterprises Cross-industry.
Advertisements

Symantec 2010 Windows 7 Migration Global Results.
Tax Information Network of Income Tax Department (managed by NSDL)
AGVISE Laboratories %Zone or Grid Samples – Northwood laboratory
ETutorial – Form 16A October 19, Tax Information Network of Income Tax Department (managed by NSDL)
1 Yell / The Law and Special Education, Second Edition Copyright © 2006 by Pearson Education, Inc. All rights reserved.
Simplifications of Context-Free Grammars
Census Bureau DRIS Date: 01/16/ Index Data Modeling Data Modeling Current Datafile Current Datafile Current Dataload Current Dataload Data Overlook.
Process Description and Control
Zhongxing Telecom Pakistan (Pvt.) Ltd
Feichter_DPG-SYKL03_Bild-01. Feichter_DPG-SYKL03_Bild-02.
Media6. Who We Are Media6° is an Online Advertising Company Specializing in Social Graph Targeting –Birds of a feather flock together! –We build.
Foundations of Relational Implementation (1) IS 240 – Database Management Lecture #13 – Prof. M. E. Kabay, PhD, CISSP Norwich University
Copyright © 2006 by The McGraw-Hill Companies, Inc. All rights reserved. McGraw-Hill Technology Education Copyright © 2006 by The McGraw-Hill Companies,
1 Copyright © 2013 Elsevier Inc. All rights reserved. Chapter 4 Computing Platforms.
Processes and Operating Systems
Copyright © 2013 Elsevier Inc. All rights reserved.
1 Copyright © 2013 Elsevier Inc. All rights reserved. Chapter 116.
1 Copyright © 2013 Elsevier Inc. All rights reserved. Appendix 01.
1 Copyright © 2013 Elsevier Inc. All rights reserved. Chapter 38.
Chapter 1 Image Slides Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
1 Building a Fast, Virtualized Data Plane with Programmable Hardware Bilal Anwer Nick Feamster.
How Much Do I Remember? Are you ready to play.....
Chapter 6 File Systems 6.1 Files 6.2 Directories
1 Chapter 12 File Management Patricia Roy Manatee Community College, Venice, FL ©2008, Prentice Hall Operating Systems: Internals and Design Principles,
Photo Slideshow Instructions (delete before presenting or this page will show when slideshow loops) 1.Set PowerPoint to work in Outline. View/Normal click.
© Tally Solutions Pvt. Ltd. All Rights Reserved Shoper 9 License Management December 09.
Impressive Star Softwares (P) Ltd. Presents Sent Item Box-Detail of Mails from Tally ( 1.0 )
Break Time Remaining 10:00.
This module: Telling the time
The basics for simulations
Figure 12–1 Basic computer block diagram.
Configuration management
Tanenbaum, Modern Operating Systems 3 e, (c) 2008 Prentice-Hall, Inc
Suite Suite 2 TPF Software – Overview Binary Editor Remote Scripts zTREX Add-Ins & Project Integration with Source Control Manager.
Debugging operating systems with time-traveling virtual machines Sam King George Dunlap Peter Chen CoVirt Project, University of Michigan.
13 Copyright © 2005, Oracle. All rights reserved. Monitoring and Improving Performance.
Managing Web server performance with AutoTune agents by Y. Diao, J. L. Hellerstein, S. Parekh, J. P. Bigu Jangwon Han Seongwon Park
Database Performance Tuning and Query Optimization
Copyright © 2011 by the Commonwealth of Pennsylvania. All Rights Reserved. Load Test Report.
State of Connecticut Core-CT Project Query 8 hrs Updated 6/06/2006.
Operating Systems Operating Systems - Winter 2010 Chapter 3 – Input/Output Vrije Universiteit Amsterdam.
1 Prediction of electrical energy by photovoltaic devices in urban situations By. R.C. Ott July 2011.
15. Oktober Oktober Oktober 2012.
Note: A bolded number or letter refers to an entire lesson or appendix. A Adding Data Through a View ADD_MONTHS Function 03-22, 03-23, 03-46,
We are learning how to read the 24 hour clock
Database System Concepts and Architecture
Sets Sets © 2005 Richard A. Medeiros next Patterns.
MOTION. 01. When an object’s distance from another object is changing, it is in ___.
Operating System.
Executional Architecture
KAIST Computer Architecture Lab. The Effect of Multi-core on HPC Applications in Virtualized Systems Jaeung Han¹, Jeongseob Ahn¹, Changdae Kim¹, Youngjin.
: 3 00.
5 minutes.
Performance Tuning for Informer PRESENTER: Jason Vorenkamp| | October 11, 2010.
Copyright © 2003 by Prentice Hall Computers: Tools for an Information Age Chapter 15 Programming and Languages: Telling the Computer What to Do.
1 Titre de la diapositive SDMO Industries – Training Département MICS KERYS 09- MICS KERYS – WEBSITE.
Clock will move after 1 minute
Select a time to count down from the clock above
Murach’s OS/390 and z/OS JCLChapter 16, Slide 1 © 2002, Mike Murach & Associates, Inc.
1 Atlas Copco Distribution Center DS Connect User’s Guide This document is uncontrolled if viewed or printed outside the IMS.
Schutzvermerk nach DIN 34 beachten 05/04/15 Seite 1 Training EPAM and CANopen Basic Solution: Password * * Level 1 Level 2 * Level 3 Password2 IP-Adr.
1 (usp BigData Oriented Architecture) Universal Shell Programming Laboratory, Ltd. January 2013 Big Data Software Appliance Simple, High-Speed Big Data.
Paper on Best implemented scientific concept for E-Governance Virtual Machine By Nitin V. Choudhari, DIO,NIC,Akola By Nitin V. Choudhari, DIO,NIC,Akola.
Paper on Best implemented scientific concept for E-Governance projects Virtual Machine By Nitin V. Choudhari, DIO,NIC,Akola.
CMAQ Runtime Performance as Affected by Number of Processors and NFS Writes Patricia A. Bresnahan, a * Ahmed Ibrahim b, Jesse Bash a and David Miller a.
Chapter 1 Basic Concepts of Operating Systems Introduction Software A program is a sequence of instructions that enables the computer to carry.
课程名 编译原理 Compiling Techniques
Computer Software CS 107 Lecture 2 September 1, :53 PM.
Presentation transcript:

Unicage Development Method Big Data Case Studies Universal Shell Programming Laboratory, Ltd. February 2013

Big Data Case Studies using Unicage Replacement of Batch Processing System (Major Credit Card Company) Complex ETL (Investment Bank) Complex ETL (Electric Power Utility) Search of Large Data Set (Korean Search Engine) Summary 2

① Replacement of Batch Processing System (Major Credit Card Company) Large data set is processed on the host. This processing will be ported to Unicage. We receive the data that needs processing from the host, Unicage performs some processing, then compare. Host Database Large Dataset Processing (1) Large Dataset Processing (30) Large Dataset Processing (50) Large Dataset Flat File Database Upload Processing Flat File Flat File Compare Receive Unicage Server We compare one part of the Large Dataset Processing Flat File Large Dataset Processing (30) Flat File Flat File Flat File Flat File Flat File

Processing Speed COBOL Unicage (Single x86 Server) (Five x86 Servers) Processing time was reduced to 1/8 of the COBOL system (116.00/929.69=12.4%) Unicage was measured running on 5 x86 servers (6-core CPU x 2, 48GB RAM) If the number of servers is increased and processing is distributed, even faster processing is possible. COBOL Unicage (Single x86 Server) (Five x86 Servers) Processing Time 929.69 mins. (15 hrs. 29 mins.) 313.58 mins. (5 hrs. 13 mins.) 116.00 mins. (1 hr. 56 mins.) Hardware Host Initial Investment over $1M Maintenance Fee also High Single x86 Server Dual 6-core CPUs 48GB RAM 2 x HDD (SATA 2TB) Initial Investment $10K Maintenance Fee is Low Five x86 Servers Initial Investment $50K 4

Development Productivity Using COBOL 24 processes and 7 jobs required, so development took 3 months. Using Unicage Coding: 5 days Testing: 5 days Performance Tweaking: 3 days Developed by a Unicage engineer with 5 years experience in 13 days. COBOL Unicage Number of Processes 7 Jobs & 24Processes 11 Shell Scripts Development Time 3 Months 13 days Lines of Code 3,645 981 5

 Complex ETL (Investment Bank) Using the Unicage development method, we will perform reformatting of data so that it is in a format that can be loaded into the transaction storage database. We will then compare processing time. Transaction Log Record Types (approx. 100) Data to be Loaded in DB Parent A Parent Child1 Grandchild1−1 Grandchild1−2 Grandchild2−1 Child 1 Grandchild 1-1 A Grandchild 1−2 B Parent Child1 Child2 Child 2 Grandchild 2−1 Parent C Parent Child1 Child2 Child 1 B Layout resolves the Parent/Child/Grandchild relationships Child 2 Parent Child 1 C Execution Speed using Java+ PostgresSQL is about 90 minutes Child 2 Heirarchical Multi-Layout Data

Processing Speed Development/Testing Environment Execution Speed: Computer Desktop PC (Intel Core i7 processor, 16GB RAM) Operating System FreeBSD 9.0 Release#0 Shell Commands USP Unicage Enterprise Version Application Details Records Processed Lines of Code PROCESS-MASTER Top Shell 29 PROCESS-001 Exception Processing 1 8,327 8 PROCESS-002 Exception Processing 2 117,838 9 PROCESS-003 Exception Processing 3 81 11 PROCESS-004 Exception Processing 4 5,028 19 PROCESS-005 Exception Processing 5 332 14 PROCESS-006 Normal Processing 27,614,260 6 29,015,393 (4.36 GB) 96 Execution Speed: Real: 91.58 sec User: 132.85 sec Sys: 22.53 sec

③ Complex ETL (Electric Power Utility) Character set conversion of host data (from native to SJIS) Automatic Meter Reading Terminal Mainframe UNIX Server Code Conversion Processing (Java) Meter Data Meter Data (native) Meter Data (native) Meter Data (SJIS) EBCDIK Zone Pack Binary Kanji Code Compare Receive The legacy system converts the character set from native to SJIS. We ported this process to Unicage. We confirmed the input and output files are the same and calculated the difference in processing speed using Unicage. Unicage Server Code Conversion Processing (Unicage) Meter Data (native) Meter Data (SJIS)

Processing Speed Data Amount 2GB 7,240,555 records We tested on 2GB, 5GB and 10GB data sets. We used the following server environment: Java: HP-UX, Itanium 1.60GHz 2core, 4GB Unicage: FreeBSD, Core i7 4core, 16GB, SATA (2TB)  Data Amount 2GB 7,240,555 records 5GB 18,095,303 records 10GB 36,178,437 records Java 3hrs 7mins 53secs 7hrs 30mins 15 hrs Unicage 43.411secs 1 min 49.085secs 4mins 16.906secs Difference 11273/43.411= 259x faster 27000/109.085= 247x faster 54000/256.906= 210x faster

④ Search of Large Data Set (Korean Search Engine) Analysis of search logs from a major search engine site Analysis based on text search and user IP address search 【Configuration】 Expected data: 10.8GB/day x 365 days x 5 years = 19.2TB (27,610,000 records) (50 Billion records) Front-end Terminals WebServer (distribution) Shell Script + Pompa ・・・ UnicageServer Cluster 0.5TB x 40 servers Scale Out

SQL and Shell Programming (1/2) B3: Count number of records for each C_QUERY_NOSP, C_USER B4: Count number of records for each C_USER, output counts over 30 B5: Output C_QUERY_NOSP list using conditions C_DATE and C_USER B6: Count number of records for each C_REQ_FRM, output row counts in descending order B7: Count number of records for each C_CONNECTION B8: Count number of records for each C_QUERY_NOSP using conditions C_DATE and C_CONNECTION B9: Count number of records for each C_QUERY_NOSP with C_CONNECTION‘X’ over 500 B10: Count number of records for each C_QUERY_NOSP with unique C-SESSION1 over 3 B11: Count number of records for each C_QUERY_NOSP that don’t occur on a specific date B12: Count number of records with C_IP of 3 or higher and count number of records with unique C_QUERY_NOSP

SQL and Shell Programming (2/2) 2. Shell Programming Example Shows equivalent shell script for each SQL code B3 【SQL】: select C_QUERY_NOSP, C_USER, count(*) from SEARCHLOG where C_DATE='2006-09-18‘ group by C_QUERY_NOSP, C_USER; B3 【USP】: cat ${lv3d}/L3.DAY | awk '$4=="20060918" | self 23 16 | dsort key=1/2 | count 1 2 B9 【SQL】: select A.q1, A.cnt1 as a1, B.cnt2 as a2 from (select C_QUERY_NOSP as q1, count(*) as cnt1 from searchlog where C_DATE='2006-09-18' and C_CONNECTION='X' group by C_QUERY_NOSP having count(*)>500 ) A, (select C_QUERY_NOSP as q2, count(*) as cnt2 where C_DATE='2006-09-18' and C_CONNECTION<>'X' group by C_QUERY_NOSP) B where A.q1=B.q2 order by a1 desc, a2 asc; B9 【USP】: cat ${lv3d}/L3.DAY | awk '$4=="20060918"&&$14!="X"' | self 23 | dsort key=1 | count 1 1 > $tmp-b awk '$4=="2006-09-18"&&$14=="X"' | count 1 1 | awk '$2>500' | join1 key=1 $tmp-b - | sort -k2,2nr -k3,3n

Processing Speed Development/Testing Environment #01 B3 1.132 1.357 Computer Desktop PC (Intel Core i7 processor, 16GB RAM) Operating System FreeBSD 9.0 Release#0 Storage SATA HDD (1) Shell Commands USP Unicage Enterprise Edition Corresponding SQL Execution Time (MIN) Execution Time (MAX) Execution Time (AVG) #01 B3 1.132 1.357 1.235 #02 B4 0.139 0.140 #03 B5 0.002 0.003 #04 B6 #05 B7 1.154 1.155 #06 B8 0.030 #07 B9 2.673 2.898 2.748 #08 B10 1.440 #09 B11 4.760 4.766 4.763 #10 B12 0.006

⑤ Summary Challenge #1 【Reduced Performance】   As the amount of data increases and as the business logic changes repeatedly, processing performance gradually decreases, causing problems for the business. Challenge #2 【Cost】   Requires specialized high-performance hardware and advanced middleware, increasing the initial investment cost and ongoing maintenance cost. 【Legacy Methods】 Purchase and deploy the latest high-performance specialized hardware and advanced middleware. Performance is improved, but costs skyrocket. Re-write software using latest techniques (Hadoop, etc.) High Cost. Difficult to recruit and train engineers. 【Background】  As the precision of data and the storage of past data increases, the amount of data increases to the point that legacy Relational Databases cannot handle. 14

Why is Unicage Fast? (1/2) We do not use middleware with huge overhead We use only the core functions of the OS, without any database, runtime or middleware. From this aspect, UNIX/Linux OSes like FreeBSD are excellent since they have compact kernel code and you can select the required peripheral software from the PORTS collection. USP Unicage commands have been precisely tuned We have developed the commands used in the shell scripts in the C language and they control memory and CPU directly. They are extensively tuned, for example by using the SIMD command inline. For this reason, it is tens of times faster than commands written in Java. (This is clear by the difference in the size of the post-compilation assembler code.) Parallel Processing using Pipelines Shell scripts can easily use the “pipe” which is a unique feature of UNIX. By connecting USP Unicage commands with a pipeline you can achieve parallel processing which improves processing speed. In one project for an investment bank, we utilized 95% of CPU in a 16-core machine to process 30 million records 60 times faster than their legacy system.

Why is Unicage Fast? (2/2) ush In order to eliminate the overhead of the shell itself, we have created our own shell called “ush” which is based on “ash”. The same shell script runs 1.7 times faster on “ush” than on standard “bash”. We continue to improve the “ush” shell, for example by changing the implementation of pipes to “mmap” (kernel memory) with ID passing. Pompa Technology In order to search large datasets, we employ directory tree division and memory cache control. Our “Pompa Technology” embeds the search key in the path name, enabling two-layer search at the OS level and the Unicage level. Using this technology we were able to return search results from 10TB of log data (from a Korean search engine) in less than 0.1 second without using expensive appliances.