Presentation is loading. Please wait.

Presentation is loading. Please wait.

CERN – IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t Working with Large Data Sets Tim Smith CERN/IT Open Access and Research Data Session.

Similar presentations


Presentation on theme: "CERN – IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t Working with Large Data Sets Tim Smith CERN/IT Open Access and Research Data Session."— Presentation transcript:

1 CERN – IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t Working with Large Data Sets Tim Smith CERN/IT Open Access and Research Data Session

2 OAI8 [Jun 2013] - 2 Research Big DataResearch

3 OAI8 [Jun 2013] - 3 How big is BIG ? 1 PB 100 TB 10 TB 1 TB 100 GB 10 GB 1 GB 100 PB 10 PB 40M collisions /sec 150M readout channels 6 GB/s 100 PB 3 years LHC Data Trigger Filter 10 PB/s 10 TB/s

4 OAI8 [Jun 2013] - 4 Primary Storage 6 GB/s 80,000 Disks 88,000 CPU Cores 18,000 1GB NICs 2,700 10GB NICs

5 OAI8 [Jun 2013] - 5 Problems Size Brings Few places can store it –Lots of Copies ? –10% at 10 centres Hard to transport Aggregation easier for processing than storage Models of Networked Analysis at Regional Centres x 2 locations @ CERN

6 OAI8 [Jun 2013] - 6 Problems Size Brings Few places can store it –Lots of Copies ? –10% at 10 centres Hard to transport Aggregation easier for processing than storage Models of Networked Analysis at Regional Centres Tier-2s 140 Sites Tier-1s 11 Sites 350,000 Cores

7 OAI8 [Jun 2013] - 7 Worldwide LHC Computing Grid Distributed Data Management –Limited Network resources –Optimize / minimize movement –File placement logic –Deterministic / Static Site Data Management –HSMs –Transparent file access and movement Disk-Tape migration/recall

8 OAI8 [Jun 2013] - 8 Research Data Infrastructure of today Distributed Data Management –Network: a resource to schedule –Dynamic data placement –Data transfer services –Expt replica management rules Site Data Management –Indep. technology choices –Decoupled tiers –Disk caches Managed by owners –Bulk 3 rd party migration to tertiary by owners AAA: any data, any time, any where

9 OAI8 [Jun 2013] - 9 Data Reduction / Analysis Publication Reduced Reconstructed Raw Researchers T2s, T1s Analysis Coordinators T1s Production Managers T0, T1s File Size # Files

10 OAI8 [Jun 2013] - 10 Data Inflation Static: Storing 100 PB was a good challenge Dynamic: Analysing it means transformation, reduction, transport, replication, regeneration WAN: 300 PB LAN: 3 EB in 2012

11 OAI8 [Jun 2013] - 11 More than Data Papers Tabular Data Correlation Matrices Internal Notes Wikis Presentations Quality monitoring data Filter / selection algorithms Formatters Calibration Data Conditions Data Log Books Researchers T2s, T1s Analysis Coordinators T1s Production Managers T0, T1s Workflows Contextual metadata SW: 10M LoC

12 OAI8 [Jun 2013] - 12 Data for Tomorrow Digital Libraries –DOIs, ORCID –DataCite, OAIS Custom systems –DBs –HEP formats Mass Storage –GUIDs –Access –Bit Preservation 30 TBs !

13 OAI8 [Jun 2013] - 13 Bit Access / Bit Preservation Media Verification –Hot / Cold Data –Catching and correcting errors while you still can –10% of production drive capacity for 2.6 years –(0.000065% data loss) Bit Rot

14 OAI8 [Jun 2013] - 14 Bit Access / Bit Preservation Media Migration –Drive and Media obsolescence –50% of current drive capacity for 2 years

15 OAI8 [Jun 2013] - 15 Little Big Data Long tail of science Big facilities Data Size x (a small number) x (a large number)

16 OAI8 [Jun 2013] - 16 Concluding Remarks Data Management best done by the data owner –Make data management services available –Empower the users! Don’t assume Researchers behavioral patterns –Decouple the layers –Offer integratable building blocks, not an integrated solution Static Data is one thing, planning for Active Data quite another –(The number you know) x (a large factor) Housekeeping (media verification and migration) takes non-negligible resources Preservation means combining expertise –Storage managers, SW engineers, Librarians and Researchers

17

18 OAI8 [Jun 2013] - 18 Abstract Whether large or small, data sets need to be managed. Scale however makes standard data operations more challenging, especially when the data sets expand beyond the capacities of a single data centre. Consequently replication, migration and archival require optimized, sometimes domain specific solutions. Tim will illustrate the HEP approach by describing how CERN collected 100 petabytes of research data and how it organizes storage and access by researchers across the globe. He will run through the data reduction and analysis chains up to paper production, and finish with the challenges of archiving these large data sets.


Download ppt "CERN – IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t Working with Large Data Sets Tim Smith CERN/IT Open Access and Research Data Session."

Similar presentations


Ads by Google