Multithreaded ingestion of BUFR messages from the IDD John Caron Oct 8, 2008.

Slides:



Advertisements
Similar presentations
Interfacing BUFR to NMC Systems Jeff Ator NOAA National Weather Service United States of America WORLD METEOROLOGICAL ORGANIZATION RA.
Advertisements

ICOM 6005 – Database Management Systems Design Dr. Manuel Rodríguez-Martínez Electrical and Computer Engineering Department Lecture 8 – File Structures.
1 © Crown copyright 2003 GRAS SAF User Workshop Helsingør, Denmark, June 2003 WMO BUFR Format for NRT Dissemination of RO Data Dave Offiler.
File Processing - Organizing file for Performance MVNC1 Organizing Files for Performance Chapter 6 Jim Skon.
OS/2 Warp Chris Ashworth Cameron Davis John Weatherley.
File Systems.
Streaming NetCDF John Caron July What does NetCDF do for you? Data Storage: machine-, OS-, compiler-independent Standard API (Application Programming.
7 +/- 2 Maybe Good Ideas John Caron June (1) NetCDF-Java (aka CDM) has lots of functionality, but only available in Java – NcML Aggregation – Access.
File Management Chapter 12. File Management A file is a named entity used to save results from a program or provide data to a program. Access control.
Database Implementation Issues CPSC 315 – Programming Studio Spring 2008 Project 1, Lecture 5 Slides adapted from those used by Jennifer Welch.
Quick Review of Apr 15 material Overflow –definition, why it happens –solutions: chaining, double hashing Hash file performance –loading factor –search.
Quick Review of material covered Apr 8 B+-Tree Overview and some definitions –balanced tree –multi-level –reorganizes itself on insertion and deletion.
1 Friday, July 07, 2006 “Vision without action is a daydream, Action without a vision is a nightmare.” - Japanese Proverb.
1.1 CAS CS 460/660 Introduction to Database Systems File Organization Slides from UC Berkeley.
MOVE-4: Upgrading Your Database to OpenEdge® 10 Gus Björklund Wizard, Vice President Technology.
Selecting and Implementing An Embedded Database System Presented by Jeff Webb March 2005 Article written by Michael Olson IEEE Software, 2000.
Database Systems: Design, Implementation, and Management Eighth Edition Chapter 10 Database Performance Tuning and Query Optimization.
BUFR Information Model Gil Ross CAeM Met Office. BUFR Most BUFR Documentation is not easily understood –It treats it as a Decoding process Note – not.
Presented by CH.Anusha.  Apache Hadoop framework  HDFS and MapReduce  Hadoop distributed file system  JobTracker and TaskTracker  Apache Hadoop NextGen.
How to build your own Dark Archive (in your spare time) Priscilla Caplan FCLA.
Training on Meteorological Telecommunications Alanya, Turkey, September 2010 General Philosophy of Table Driven Code Forms Simon Elliott, EUMETSAT.
Physical Database Design & Performance. Optimizing for Query Performance For DBs with high retrieval traffic as compared to maintenance traffic, optimizing.
Operating Systems ECE344 Ashvin Goel ECE University of Toronto Threads and Processes.
Chapter 13 Query Processing Melissa Jamili CS 157B November 11, 2004.
NoSQL Databases Oracle - Berkeley DB Rasanjalee DM Smriti J CSC 8711 Instructor: Dr. Raj Sunderraman.
NoSQL Databases Oracle - Berkeley DB. Content A brief intro to NoSQL About Berkeley Db About our application.
Slide 1 TIGGE phase1: Experience with exchanging large amount of NWP data in near real-time Baudouin Raoult Data and Services Section ECMWF.
File Processing - Indexing MVNC1 Indexing Jim Skon.
ETL Extract. Design Logical before Physical Have a plan Identify Data source candidates Analyze source systems with data- profiling tools Receive walk-through.
The Design and Implementation of Log-Structure File System M. Rosenblum and J. Ousterhout.
3-May-2006cse cache © DW Johnson and University of Washington1 Cache Memory CSE 410, Spring 2006 Computer Systems
MySQL spatial indexing for GIS data in a web 2.0 internet application Brian Toone Samford University
Efficient RDF Storage and Retrieval in Jena2 Written by: Kevin Wilkinson, Craig Sayers, Harumi Kuno, Dave Reynolds Presented by: Umer Fareed 파리드.
Disk & File System Management Disk Allocation Free Space Management Directory Structure Naming Disk Scheduling Protection CSE 331 Operating Systems Design.
CE Operating Systems Lecture 17 File systems – interface and implementation.
Physical Database Design Purpose- translate the logical description of data into the technical specifications for storing and retrieving data Goal - create.
Introduction.  Administration  Simple DBMS  CMPT 454 Topics John Edgar2.
Lecture 10 Page 1 CS 111 Summer 2013 File Systems Control Structures A file is a named collection of information Primary roles of file system: – To store.
Applications of BUFR (Why, when and how to use BUFR) Jeff Ator NOAA National Weather Service United States of America WORLD METEOROLOGICAL.
NTFS Filing System CHAPTER 9. New Technology File System (NTFS) Started with Window NT in 1993, Windows XP, 2000, Server 2003, 2008, and Window 7 also.
MISSION CRITICAL COMPUTING SQL Server Special Considerations.
ICOM 5016 – Introduction to Database Systems Lecture 13- File Structures Dr. Bienvenido Vélez Electrical and Computer Engineering Department Slides by.
WORLD METEOROLOGICAL ORGANIZATION RA-VI Regional Training on BUFR and Migration to Table Driven Code Forms Langen, Germany, April, 2007 General.
GPFS: A Shared-Disk File System for Large Computing Clusters Frank Schmuck & Roger Haskin IBM Almaden Research Center.
Chapter 5 Record Storage and Primary File Organizations
W4118 Operating Systems Instructor: Junfeng Yang.
Same Plan Different Performance Mauro Pagano. Consultant/Developer/Analyst Oracle  Enkitec  Accenture DBPerf and SQL Tuning Training Tools (SQLT, SQLd360,
WMO GRIB Edition 3 Enrico Fucile Inter-Program Expert Team on Data Representation Maintenance and Monitoring IPET-DRMM Geneva, 30 May – 3 June 2016.
Cofax Scalability Document Version Scaling Cofax in General The scalability of Cofax is directly related to the system software, hardware and network.
COMP 3500 Introduction to Operating Systems Directory Structures Block Management Dr. Xiao Qin Auburn University
Review of Argo data performance on GTS
Chapter 11: File System Implementation
Indexing Goals: Store large files Support multiple search keys
Binary Universal Form Representation (BUFR) Paul Hamer November, 2009
File System Structure How do I organize a disk into a file system?
Database Implementation Issues
Chapter 11: File System Implementation
Disk Storage, Basic File Structures, and Buffer Management
Database Implementation Issues
CS703 - Advanced Operating Systems
Database Internals: How Indexes Work
DATABASE IMPLEMENTATION ISSUES
ICOM 5016 – Introduction to Database Systems
Indexing 4/11/2019.
Chapter 14: File-System Implementation
File System Implementation
Database Implementation Issues
CS222/CS122C: Principles of Data Management UCI, Fall 2018 Notes #03 Row/Column Stores, Heap Files, Buffer Manager, Catalogs Instructor: Chen Li.
Database Implementation Issues
Database Implementation Issues
Presentation transcript:

Multithreaded ingestion of BUFR messages from the IDD John Caron Oct 8, 2008

Overview BUFR format IDD HRS BUFR data stream Multithreaded processing of IDD messages Indexing data

BUFR data format WMO standard for observational met data –circa 1988: “Table Driven Forms” (TDF) –Improvement over “character oriented codes” (eg metars) –Migration from previous forms still large WMO focus –Today: Edition 4 format, Version 13 of the tables Table driven (12000 entries in global tables) –Each record contains a set of data descriptors (dds) –Global WMO and local tables Simple “Compressed binary” –Packed bits, scale/offset covert to float –Fixed precision, no dynamic range –Difference from reference value

: tableD : tableD : WMO_block_number units=Numeric scale=0 refVal=0 nbits= : WMO_station_number units=Numeric scale=0 refVal=0 nbits= : Type_of_station units=Code table scale=0 refVal=0 nbits= : tableD : Year units=Year scale=0 refVal=0 nbits= : Month units=Month scale=0 refVal=0 nbits= : Day units=Day scale=0 refVal=0 nbits= : tableD : Hour units=Hour scale=0 refVal=0 nbits= : Minute units=Minute scale=0 refVal=0 nbits= : tableD : Latitude units=Degree scale=2 refVal=-9000 nbits= : Longitude units=Degree scale=2 refVal= nbits= : Height_of_station units=m scale=0 refVal=-400 nbits= : Short_station_or_site_name units=CCITT IA5 nchars= : Type_of_measuring_equipment_used units=Code table scale=0 refVal= : tableC-operators : tableC-operators : Mean_frequency units=Hz scale=-8 refVal=0 nbits= : tableC-operators : tableC-operators : Time_significance units=Code table scale=0 refVal=0 nbits= : Time_period_or_displacement units=Second scale=0 refVal=-4096 nbits= : replication : Delayed_descriptor_replication_factor units=Numeric scale=0 refVal= : Height_above_station units=m scale=0 refVal=0 nbits= : Wind_profiler_quality_control_test_results units=Flag table scale= : Wind_direction units=Degree true scale=0 refVal=0 nbits= : Wind_speed units=m s-1 scale=1 refVal=0 nbits= : tableC-operators : Standard_deviation_of_horizontal_wind_speed units=m s-1 scale=1 refVal=0 nbits= : tableC-operators : w-component units=m s-1 scale=2 refVal=-4096 nbits= : Standard_deviation_of_vertical_wind_speed units=m s-1 scale=1 refVal=0 nbits=8

BUFR problems (1) BUFR format is too complex: Looks like design by committee Specification not exact No coding/decoding reference implementation Mixture of data model / data encoding / standard quantities BUFR format is too simple: Fixed length tables (64 categories, 256 entries) eventually run out Fixed dynamic range (no exponents)

BUFR problems (2) Table-driven parsing is brittle No authoritative registry of local Tables WMO global table is not machine-readable Past versions are not available It seems that: Each provider has their own set of software and tables Often legacy Fortran

BUFR Table mismatch No way to be sure if coder/decoder use the same table If table entry missing, cant decode If wrong table entry is used –Bit size wrong, usually can detect with bit counting –Scale/Factor/Name/Units wrong = “silent failure” (expert/human may detect)

Table mismatches Each archive center probably has solved this coder/decoder matching internally NCEP encodes the tables in BUFR messages, and stores in the archive files Others???

BUFR progress As of 9/2008, WMO decided –Will make tables available in Microsoft Access format –Clarified versioning (sort of) Progress in detecting/fixing encoding errors Unidata nudge: group, validation web site BritMet effort to map BUFR to ISO, define XML version of tables

BUFR data on IDD 177 K messages / day 6.7 M observations / day 1.2 Gbytes / day Avg message size = 7227 bytes Avg obs/message = 37 Unique wmo Headers = 555 Unique dds = 125 wmoHeaders with multiple dds = 61

Originating Stations CWAO Montreal EDZW Offenbach (RSMC) (78.0) EGRR UK Meteorological Office Bracknell (RSMC) (74.0) EKMI Copenhagen (94.0), EUMG EUMETSAT Operation Centre (254.0) EUSR KBOU The NOAA Forecast Systems Laboratory (59.0) KKCI US National Weather Service (NCEP) (7.0) KNES US NOAA/NESDIS (160.0) KWBC US National Weather Service (NCEP) (7.0) KWNH US National Weather Service (NCEP) KWNO NCEP / Central Operations (7.3) LFPW Toulouse (RSMC) (85.0), RJTD Tokyo (RSMC), Japan Meteorological Agency (34.0) RKSL Seoul 40.0 SBBR Brazilian Space Agency ? INPE (46.0) VHHH Hong-Kong 110.0

Data heterogeneity Each BUFR record in principle could have its own data schema : 2M database schemas! In reality, there are much smaller number of groups of homogenous records –WMO headers are not sufficient –Can’t use pqact FILE by matching the header –Only the dds itself is reliable –So must crack the message to reliably group the records

Multithreaded Processing of IDD Messages

Overview Get messages from LDM pipe Process in memory, write out to disk Must be very fast, no blocking I/O Use java.util.concurrent library for multithreading

LDM pqact # Get all BUFR messages from HRS HRS ^[IJ] PIPE –metadata java –jar ldm.jar

Read contents Classify type by dds LDM stream Break into Separate messages Message Queue pipe pipeReadingThread (1) (io) ArrayBlockingQueue 1.extract messageThread (1?) (cpu) blocking take MessType processor MessType processor MessType processor 2.dispatch Step 1 and 2 Extract and dispatch

messageThread (1) (cpu) MessType processor Executor CompletionService MessageWriter implements Callable ConcurrentLinkedQueue Owns file eg bufr submit dispatch threadPool (n) (io) MessageWriter implements Callable Result call() { write message(s) } 3.write Step 3 Write message

indexThread (1?) (io) Executor Queue > MessageWriter implements Callable IndexTask call() { write message(s) } Step 4 Index Write message Return IndexerTask Add to Index blocking take

messageThread (1) (cpu) MessType processor Executor CompletionService MessageWriter implements Callable ConcurrentLinkedQueue Owns file bufr submit dispatch Step 5 cleanup cleanupThread (1) (io) Close files Concurrent hashMap ?

indexThread (1?) (io) Executor Queue > Step 6 Scour Add to Index blocking take Remove from Index Delete file scourThread (1) (io)

Why isnt Scouring part of LDM? LDM is message oriented – doesn’t know contents Decoders know about the contents of the messages Put scouring into the decoders

Threads 1.Read from LDM pipe 2.Read message content and dispatch 3.Write Messages to files 4.Index 5.Cleanup / close MessageWriters 6.Scour

(Thought) Experiments with Indexing

Design prejudices Keep data in original format –Data reliability Aggregate homogeneous data into files –Data locality Create external indices, with pointers into the files –Data recovery Scour entire files, not parts of a file

Indexing Need 1D indexes (B-trees) Want 2D indices for spatial data –Rtree (areas) –Quadtree (points) Index selectivity: seek vs. scan –Sequential access ~100x faster than random access –Index must select < 1% data to be useful

Possible Open Source Indexers Berkeley DB Java edition –Btree, very fast, no SQL –Dual GPL/commercial license Relational databases “SQL on Btrees” –Java (Derby, H2, many others) –C (MySQL, Postgres) Object databases –Db4o (dual GPL/commercial license)

High performance Embeddable in the decoder –Same process space –Not client/server Access from server answering queries –Multiprocess access or client/server –Bdb must sync periodically (perf?) Transactions probably too slow –Need recovery strategy

Test Assumptions Process IDD messages in memory (vs) write to file then postprocess Store in files – add external indexing (vs) store data in database One database vs many? Embedded vs client/server SQL vs specific queries –SQL allows ad-hoc queries - performance? 2D indexing

Conclusions Test/time various indexing strategies and technologies –Production –scouring Eventually part of IDD/TDS –Must be easy to maintain (Java) –Scale to large archives / data volumes