Presentation on theme: "Applications of UDFs in Astronomical Databases and Research Manuchehr Taghizadeh-Popp Johns Hopkins University."— Presentation transcript:
Applications of UDFs in Astronomical Databases and Research Manuchehr Taghizadeh-Popp Johns Hopkins University
User Defined Functions (UDFs) Motivation: -Scientists need to execute own code/functions where the data is stored (databases) -Need fast code/algorithms no more complex than O(N log N), parallelizable if possible in 10 4 + threads. For astronomers: -Basic astronomical UDFs bring 3-Dimensional and temporal view of the universe. -Created Cosmological functions library (CfunBASE) written in C# (.NET framework). Library uploaded into SQL SERVER and code executed through CLR integration. -Used in CasJobs/SkyServer service hosting SDSS data archive. -Execute Functions/Stored procedures in simple SQL commands.
Functions for SQL Server -Cosmological Functions: -volume, distances and times as a function of redshift “z” (F=F(z)) -inverse functions z = F -1 (F(z)) also implemented. - Basic data exploratory and statistical functions also included: - Cumulative distribution and quantile functions (both scalar and aggregate) - Binning and grids (1-D streaming table valued function, linear/log-scaled) (for aggregation, table creation, etc) - N-Dimensional weighted histogram. -Numerical Methods: Integration, root finding, interpolation. Customizable for speed/precision. -Many functions in astronomy contain integrals/sums: many problems parallelizable with CUDA/GPU (to be done…)
Advanced Astronomical Examples -Galaxy clusters from Friends-of-Friends algorithm: 3D view of the Large Scale Structure. -Luminosity Function (1-D weighted histogram) SELECT dbo.fMathBin(v.AbsMag_r,-25, -15, 100,1, 1), sum(1/v.Vmax)/0.1, sqrt(sum( 1/(v.Vmax*v.Vmax) ) )/0.1, count(*) FROM( SELECT dbo.fCosmfAbsMag(m_r,z) AS AbsMag_r, Vmax FROM DR7 ) AS v GROUP BY dbo.fMathBin(v.AbsMag_r,-25, -15, 100,1, 1) ORDER BY dbo.fMathBin(v.AbsMag_r,-25, -15, 100,1, 1) -Color-Magnitude Diagram (2-D weighted histogram) EXECUTE spMathHistogramNDim ‘SELECT dbo.fCosmfAbsMag(m_r,z), Color_u_r, 1.0/Vmax FROM DR7’,2, '-25,0', '-15,5', '50,50',1 -Use query parsing function for preventing SQL injection when functions run user’s query.
Extreme Value Statistics (EVS) as a tool -Used widely in calculations of risk and the study of tails of distributions. -EVS predicts the biggest/smallest value we will ever observe. - Distribution φ(x) of extremes is known for the extremes of n i.i.d. random variables (of parent distribution P(x) ) when n ∞: - ξ defines 3 universal distributions depending on tail of parent distribution P(x): (1) (power law tail) ξ > 0 [ φ(x) called Frechet distribution] (2) (exponential tail) ξ = 0 [ φ(x) called Gumbel distribution] (3) ( x 0 >x ) (finite cutoff tail) ξ < 0 [ φ(x) called Weibull distribution] With large data sets, questions to answer: -Are maximal galaxy luminosities really Gumbel distributed [P(L) ~ exp(-L)] ? -Having lots of galaxies, can we observe the finite size correction of φ(x) due to having finite n?
Sampling luminosities from HealPIX cells -HealPIX tessellation library uploaded into database. -Can be used for spatial indexing. (use tree schema and bitshift on HealPIX ID) -Equal area cells. Applications for EVS: -Build HealPIX SDSS footprint on the sky. Use HTM spatial indexing library. -Each cell has 1 “realization” of the random variable (Luminosity) -Sample highest luminosity at each one of all n cells. -3 different spatial resolutions: N side =(16, 32, 64) n ~ (296, 1450, 6642)
RESULTS: tail classes and finite size correction -Tail index ξ from DEdH estimator η = normalized order statistics Test 4 different galaxy samples: Generally close to ξ = 0 [P(L) ~ exp(-L β )] -1 st time observation of finite size correction - x = Standardized maximal luminosities - Finite size correction Δ due to finite n: Δ = P(x) – StandardGumbel - Slow theoretical convergence: Δ(n) ~ 1/log n RESULT: Correction appears when n>6000 (tradeoff between noise/convergence)
Mining the space of Galaxy Properties How to classify galaxies in the n-dimensional cloud of Photometric/Spectral properties? -Use Principal Components Analysis (PCA) on properties and consider important eigenvectors. -Build PRINCIPAL CURVE: Smooth fit/projection to the cloud’s spine. Complexity of ~O(N 2 ) -Explore diverse statistics as a function of arc length. -Scalability for big N: Streaming PCA (T. Budavari) and randomized sampling for principal curve (P. curve not yet implemented in SQLCLR)
Final remarks -Algorithms useful if randomized, ~O(N log N), streaming capable and parallelizable -For analysis, an astronomer would like -A programming layer on the database (with the functionality of e.g R) -implementing matrix algebra, calculus, statistics, etc. -Including data visualization.