Presentation is loading. Please wait.

Presentation is loading. Please wait.

Mark Holliman Wide Field Astronomy Unit Institute for Astronomy University of Edinburgh.

Similar presentations


Presentation on theme: "Mark Holliman Wide Field Astronomy Unit Institute for Astronomy University of Edinburgh."— Presentation transcript:

1 Mark Holliman Wide Field Astronomy Unit Institute for Astronomy University of Edinburgh

2 Summary Some Astronomical survey databases are becoming so large that some curation tasks are approaching unreasonable timeframes. This is evident in such surveys as the Galactic Plane Survey (GPS) for UKIDSS, and the VISTA Variables in the Via Lactea (VVV). The VVV alone contains >10 of TB of data and includes a detection table with >10 10 rows. While RDBMSs are capable of handling this much data, the execution times for curation activities can stretch into weeks at a time. To address this issue, there are two main approaches: 1. Switch from a RDBMS to a Key-Pair based model or Column oriented DB (i.e. Hadoop, MonetDB, etc) 2. Throw Hardware at it (i.e. SSDs, SAN, LUSTRE, GPFS, etc) As you can probably guess, this talk is about #2

3 The Big Database Problem In an RDBMS the main bottleneck on almost all operations is disk I/O. While parallel processing and increased RAM can address some performance issues, ultimate performance figures are dictated by how fast data can be moved to/from the storage medium. For large databases (>1TB) the most cost efficient storage medium are RAID5/6/10 arrays of spinning disks (HDD). These disks range in size from 1TB to 4TB at present. These arrays provide redundancy in case of disk failure, while at the same time speed up I/O by spreading disk operations across multiple devices simultaneously The rotational speed on the hard drives is the main factor in determining an HDDs performance. The faster the disk spins, the faster read/write operations can occur. Most enterprise disks run at 7200RPM, though 10000RPM and 15000RPM disks are available (at a serious jump in price) SSDs are just beginning to approach the size/price ratio necessary for large DBs.

4 Test Server Details Intel Xeon 2.8GHz, 24 cores 16GB RAM Disk Subsystem (6Gbps SAS RAID) HDD1,2: 7 x 1TB HDD RAID5 arrays SSD1: 1 x 512GB Crucial SSD SSD2: 6 x 512GB Crucial SSD RAID5 Array OS: Windows Server 2008 R2 DBMS: SQL Server 2008 R2

5 Full DB Tests The first tests involved placing an entire database on each particular disk subsystem and running a set of queries to measure performance. The queries were constructed to represent 3 specific use cases in order to identify exactly where the SSDs provide performance gains. Database: 2MASS Query 1, All Indexed Columns: Select * From twomass_psc Where ((ra> 240) AND (ra 120) AND (ra -47) AND (dec 120) AND (dec<123)) Query 2, All Nonindexed Columns: Select * From twomass_psc Where (j_m>16) AND (h_m>15.6 AND h_m 15.1 AND k_m<15.3) Query 3, Mixed Index Columns: Select * From twomass_psc Where (j_m>16) AND (h_m>15.6 AND h_m 15.1 AND k_m 230) AND (ra 110) AND (ra -47) AND (dec 120) AND (dec 1) OR (k_psfchi<0.4)) AND (dist_opt<.3) AND (k_msig_stdap<1)

6 Full DB Test Results

7 SSD Cost Issue While putting an entire Database on SSDs is certainly preferable, it is unfortunately cost prohibitive for very large archive databases (where 10s of TB are required). At recent pricing: 1TB on SSD = ~£500 vs 1TB on HDD = ~£87 So with this in mind we ran a second battery of tests to assess whether sufficient performance gains can be achieved with a hybrid system, whereby some database files are placed on SSDs while the rest remain on standard HDD arrays

8 SSD File Location Tests Database: UKIDSS DR5 4 different representative queries were run ( me if you want to see the actual SQL) #1: produce light curves #2: produce variable light curves #3: JOIN Source and Detection Tables (single index) #4: JOIN Source and Detection Tables (multi indices) 4 different file/disk configurations All files on HDD Indices on SSD (RAID) Source Tables on SSD (RAID) Detection Tables on SSD (RAID)

9 SSD File Location test results

10 Conclusions The biggest performance gains from SSDs are seen on queries involving non-indexed columns RAID arrays of SSDs do provide performance gains over individual disks, though these gains are not universal or consistently scaled Placing Source or Detection tables on SSDs both result in large performance gains Performance gains are heavily dependant on the query being run, and how much of the query uses the tables on the SSDs Due to SSD size constraints, Source tables are more likely to fit as they tend to be smaller Curation activities could be optimized by moving database files from HDD to SSD for intensive operations, then moved back to HDD once complete. This only works if the file movement speeds are high enough to overcome the extra time necessary for running operations on the HDD stored files Index file operations are actually slower on SSD than on HDD This is a serious surprise Further testing is necessary to determine if this is due to the inherent sequential access method of HDDs, or if this is due to some quirk of the MS SQL index file format


Download ppt "Mark Holliman Wide Field Astronomy Unit Institute for Astronomy University of Edinburgh."

Similar presentations


Ads by Google