Presentation is loading. Please wait.

Presentation is loading. Please wait.

Technical overview Ameriflux Scientific Data Server Technology Overview Catharine van Ingen, MSFT 11 December 2006.

Similar presentations


Presentation on theme: "Technical overview Ameriflux Scientific Data Server Technology Overview Catharine van Ingen, MSFT 11 December 2006."— Presentation transcript:

1 technical overview Ameriflux Scientific Data Server Technology Overview Catharine van Ingen, MSFT (vaningen@microsoft.com) 11 December 2006

2 technical overview Outline Overview Database schema Data cube structure Data staging pipeline Work in progress and future plans

3 technical overview Overview

4 technical overview 4 Ameriflux Collaboration Overview 149 Sites across the Americas Each site reports a minimum of 22 common measurements. Communal science – each principle investigator acts independently to prepare and publish data. Data published to and archived at Oak Ridge. Total data reported to date on the order of 160M half-hourly measurements (includes 3 processing levels). http://public.ornl.gov/ameriflux/

5 technical overview Ameriflux Scientific Data Server - Goals Act as a local repository for data and metadata assembled by a small group of scientists from a wide variety of sources Simplify provenance by providing a common safe deposit box for assembled data Interact simply with existing and emerging Internet portals for data and metadata download, and, over time, upload Simplify data assembly by adding automation Simplify name space confusion by adding explicit decode translation Support basic analyses across the entire dataset for both data cleaning and science Simplify mundane data handling tasks Simplify quality checking and data selection by enabling data browsing

6 technical overview Ameriflux Scientific Data Server - Non-Goals Replace the large Internet data source sites The technology developed may be applicable, but the focus is on the group collaboration scale and usability Very large datasets require different operational practices Perform complex modeling and statistical analyses There are a lot of existing tools with established trust based on long track records Only part of a full LIMS (laboratory information management system) Develop a new standard schema or controlled vocabulary Other work on these is progressing independently Due to the heterogeneity of the data, more than one such standard seems likely to be relevant

7 technical overview Ameriflux Scientific Data Server - Workflows Staging: adding data or metadata New downloaded or field measurements added New derived measurements added Editing: changing data or metadata Existing older measurements re-calibrated or re-derived Data cleaning or other algorithm changes Gap filling Sharing: making the latest acquired data available rapidly Even before all the checks have been made Browsing new data before more detailed analyses Private Analysis: Supporting individual researchers (MyDB) Stable location for personal calibrations, derivations, and other data transformations Import/Export to analysis tools and models Curating: data versioning and provenance Simple parent:child versioning to track collections of data used for specific uses Large Data Archives Local measurements

8 technical overview Scientific Data Server - Logical Overview

9 technical overview Databases: SQL Server 2005 All data, descriptive ancillary data and metadata held in relational databases Access via SQL query Metadata is important too but handled differently than data While separate databases are shown, the datasets actually reside in a single database today Mapping is transparent to the scientist due to dataset grouping and versioning Separate databases used for performance Unified databases used for simplicity New ancillary data and data are staged with a temporary database Minimal quality checks applied All name and unit conversions Data may be exported to flat file, copied to a private MyDb database, directly accessed programmatically via ODBC connection, or ? Database may be replicated for availability and load balancing

10 technical overview Data cubes: SQL Server Analysis Services A data cube is a database specifically for data mining (OLAP) Simple aggregations (sum, min, max, count) can be pre-computed for speed Additional calculations (median) can be computed dynamically Both operate along dimensions such as time, site, or datumtype Constructed from a relational database via Visual 2005 project A specialized query language (MDX) is used Client tool integration is evolving Excel PivotTables allow simple data viewing More powerful charting with Tableaux or ProClarity (commercial mining tools) Management similar, but different to SQL Server Most tools are shared, but actual usage differs due to functional and organizational differences References: Professional SQL Server Analysis Services 2005 with MDX (Programmer to Programmer) by Sivakumar Harinath and Stephen Quinn MDX Solutions: With Microsoft SQL Server Analysis Services by George Spofford

11 technical overview Simple plots: SQL Server Reporting Services Server-side rendering for simple tables and charts Initially developed for static rendering of database reports such as quarterly sales Report layout via Visual Studio 2005 project Report description language is XML and stored in a relational database Accessed via ASP.NET executing in IIS Actual data source can be either relational database or data cube Reports can be downloaded in various formats (eg Excel, jpg). Not intended for large quantities of data (>50000 points) or highly customized graphics References: SQL Books Online (online product documentation)

12 technical overview Data Import/Export: SQL Server Integration Services Simple GUI-based tool generation for data migration, import, export Good for exploration and/or rapid script development Includes several format converters Eliminates the need for much of the glue code Weve had mixed luck with for more complex operations such as custom selection for export to flat file and/or ingesting all data from the ORNL web site. References: Extending SSIS 2005 with Script by Donald Farmer or Professional SQL Server 2005 Integration Services (Programmer to Programmer) by Brian Knight et al

13 technical overview bwc.berkeley.edu Today

14 technical overview Database Schema

15 technical overview Schema Overview Subsets will be considered in turn

16 technical overview L2 Data Decode Discovered column headings are represented as: [Datumtype] [repeat][_offset][_offset][extended datumtype][units] Datumtype: the measurement primary datumtype. Example: TA, PREC, or LE. Repeat: an optional number indicating that multiple measurements were taken at the same site and offset. Example: include TA2. [_offset][_offset]: major and minor part of the z offset. Example: SWC_10 (SWC at 10 cm) or TA_10_7 (TA at 10.7m). Extended datumtype: any remaining column text. Example: fir, E, sfc. Units: measurement units (should only be one per datumtype) Example: w/m2, or deg C.

17 technical overview L3/L4 European Data Decode European discovered column headings can be represented as: [ qualityprefix][Datumtype][count][extended datumtype][qualitypostfix] QualityPrefix: qf_ quality flags computed at level 3 processing Datumtype: measurement primary datumtype, although other text can be used (such as G for soil heat flux) Count: an optional number a site-specific offset depth Replaces offset from previous column decode Weve asked this be changed; likely an artifact of the original ORNL column format Extended datumtype: a datumtype-specific combination of additional modifying text OR deriviation algorithm identifier QualityPrefix: qc quality flags computed at level 4 processing Key issue: merging the quality data in columns with the correct data value column. Not all quality columns are 1:1 with data columns

18 technical overview Data Table Data values are stored in a tall normalized or atomized table Only valid data values are retained Each row contains exactly one value; all other columns are foreign keys to other tables Data are stored as single precision floating point value Horizontal partitioned tables are created for BigPlot Various views created to for cube building and report generation

19 technical overview Times, repeat, offset Times table algorithmically generated on half-hour intervals All times are site local Considering adding UTC via view with site-specific conversions or additional column Repeat table monitonically increases Offset table generated by discovery of staged data Views of offset and times used for cube and report menu generation Multiple time represented to support hierarchies Offsets ordered to simplify usage

20 technical overview Datumtype and Exdatumtype Datumtype holds primary (not other) data datumtypes and all ancillary datumtypes (eg LAI) Columns for cube building and per- datumtype handling intoCube: determines whether datumtype flows to cube dailyCalc: indicates whether daily value should be average or cumulative repeatCalc: indicates whether repeats are averaged, ignored, or summed when building default cube siteCalc: determines how numeric site ancillary data should flowed into cube. Eg LAI is max Ordered views used for cube building and report generation

21 technical overview Quality Holds datumtype specific European quality indicators Algorithmically generated given quality definitions Intended to be extensible qualityID indicates remaining column applicability Additional columns can be added as new quality metric emerge Flows into cube as a quality dimension Will support selection and plotting by quality Still being debugged

22 technical overview Site Ancillary Data Ancillary data holds site specific properties such as biome, climate, canopy height May be text or numeric Separate table necessary as multiple reported measurements exist due to different sources or changes over time Source column tracks provenance To flow into the cube, need an algorithm to select one and only one value Leaf area index uses max Over time, could use start/stop times to generate values and treat as data

23 technical overview Site, Investigator Site and investigator initially created by scrape of ORNL site Now maintained by manual query Spline connects the two and tracks the relevant start/stop time Investigator could be removed with GUID identifier to alternate (membership) database Deploying CasJobs or other workflow or authentication mechanism will necessitate this

24 technical overview Datasets, Sitesets Datasets used to expose data versions such as NEE_or_MDS or September download Sitesets used to build datasets and support incremental data staging or other changes to existing data Original single level versioning scheme replaced by this two level scheme due to better mix/match ability with different data sources across datumtypes

25 technical overview Decoding Tables The conversion from column header to decoded data is table driven Sites and column headers are discovered, then decoded. Allows for multiple different representations of the same site/datumtype. Checksums used to suppress load if newly scraped data file has not changed

26 technical overview Versioning: Sitesets and Datasets A siteset is a collection of similar data from a single site May include actual observations, measurements derived from observations by known algorithm, derived from observations by private (investigator specific algorithm) Derived from a single source – actual measurements, download from national data set, simulation or other The Ameriflux site is only one such source Sitesets are primarily internal bookkeeping A dataset is a collection of sitesets No restriction on sitesets in the collection, although some combinations wont make scientific sense Datasets are the external scientist-facing version Reduces older data churn, maps to (likely) scientist data usage, avoids even more complex queries and views

27 technical overview Server and Researcher Sets Sitesets and datasets may be maintained by the server or maintained by a researcher Server Sets are maintained on behalf of all researchers in the collaboration Accessible to all researchers Set change policy decisions determined by the server (administrator) on behalf of the collaboration Researcher Sets are maintained by a specific researcher or small subset group of researchers in the collaboration Accessible only to owning researcher and/or collaborators selected by the owner Set change decisions determined by the owner

28 technical overview Fixed, AppendOnly, and Dynamic Sets Fixed sets are used to define known good sets for long term data curation Contains data of a known quality as of a specific point in time Analogous to a software release Once fixed, always fixed A fixed dataset may only contain fixed sitesets AppendOnly sets are used to define latest good sets for common accesses Contains data of a known quality although additional data may be added at any time May not be necessary, but offers a little protection against inadvertent staging side effects An appendonly dataset may contain both fixed and append only sitesets Dynamic sets are used to define latest sets for rapid data access, active analysis or cleaning Contains data of a relatively well known quality. No restriction on changes A dynamic dataset may contain any siteset

29 technical overview Dataset and Siteset Timestamps Significant changes to a dataset or siteset are tracked by timestamps CreateTime: dataset or siteset creation LastAppendTime: last addition of new siteset to a dataset or new data to a siteset (includes new derived measurements, new measurement types and recent data additions) LastModifyTime: last deletion or substitution of a siteset member of a dataset or change to older previously existing data values of a siteset FixTime: dataset or siteset converted to fixed DeleteTime: all dataset or siteset data deleted Notes: Intended to be used as something has changed, query for more details indicators only DeleteTime retained for some period to help the oops cases to help scientists determine what happened in inadvertent or forgotten scenarios All timestamps are server local time

30 technical overview Dataset and Siteset provenance creatorid tracks investigator that created the dataset or siteset includes any software automation tracked via investigator table entry prepares for investigator private sets in MyDB/MyCube name is (relatively) short text string for simple identification Commonly used to select dataset when building a cube or other user interaction description is longer text string for reference Siteset howmade is a controlled vocabulary text string to track data origins Text used to simplify data viewing. Initial strings: ORNLsite: non-gap filled data downloaded from ORNL website ANN: gap-filled data using ANN method gap: gap-filled data using unknown or unspecified method Siteset path is the URL (if available) that was used when downloading data NULL used when URL unavailable Treated as a text string only, no sanity checking or format enforcement

31 technical overview Siteset and Dataset Scenarios Sitesets and datasets are just mechanisms, we also need policies to determine the actual behavior Usage cases of interest: New staged measurements Existing older measurements re-calibrated or re-derived Data cleaning or other algorithm changes New derived measurements added Gap filling Making the latest acquired data available rapidly Supporting individual researchers (MyDB) Data version provenance: parents and children

32 technical overview Example: data version provenance When making a new version, its good to track the original version Full provenance is possible, but very easily leads to poor usability Proposal: track siteset parents A siteset has exactly one parent A siteset may have any number of children due to gap filling, other algorithm change applied, or new data reported When a new siteset is made by combining two different sitesets, the combination can be determined by annotation only When a siteset is deleted, the youngest surviving ancestor become the parent to any surviving children Relationships between datasets are more complex and should be determined by the constituent siteset relations

33 technical overview Database schema learnings The tall table has both advantages and disadvantages Best for building a cube and most extensible as new datumtypes or exdatumtypes et al are encountered Queries are fairly complex to write and join multiple tables to specify siteset, site, datumtype, exdatumtype, offset, and repeat values Views can simplify the query, but not the performance Materialized views and pivot tables too restricted for practical application here Build infrastructure for machine schemas and use the right one for the right job 100+M rows in any table takes care Clustered unique index to improve performance of common queries Updates affecting many rows can fill the log; allocate 2-5x storage Full backups are far simpler operationally and footnetting often a good alternative

34 technical overview Data cube structure

35 technical overview Sample Cube Construction Visual Studio UI for construction, deployment, and browsing

36 technical overview Datacube structure Structure follows tall table organization Each dimension corresponds to a foreign key Time has multiple hierarchies Dataset replaces siteset as it is the scientist-facing concept Accumulating computed measures as we discover them Median, standard deviation, variance can be expensive HasDailyData, HasYearlyData, DailyValue, Average cheap and handy Development uses smaller subset cubes Harvard Forest for L2 data and SeptemberORNL development VairaWillow for L2/L3/L4 data and NovemberORNL development Spreadsheets for Q/A checking

37 technical overview Minimalist MDX Query SELECT NON EMPTY { [Measures].[Average]} ON COLUMNS, NON EMPTY { [Timeline].[Year To Day].[Yydoy].&[2002-001]: [Timeline].[Year To Day].[Yydoy].&[2002-365] } ON ROWS FROM [NovemberORNL] WHERE( [Datumtype].[Datumtype].&[42], [Site].[Site].&[29], [Exdatumtype].[Exdatumtype].&[1], [Dataset].[Dataset].&[6], [Offset].[Offset].&[0 (cm)]) Retrieves daily average GPP values for Tonzi Ranch for 2002

38 technical overview Report Generation Cube Query SELECT NON EMPTY { [Measures].[Average] } ON COLUMNS, NON EMPTY { ([Site].[Site].[Site].ALLMEMBERS * [Datumtype].[Datumtype].[Datumtype].ALLMEM BERS * [Timeline].[Year To Month].[Yymmdd].ALLMEMBERS ) } DIMENSION PROPERTIES MEMBER_CAPTION, MEMBER_UNIQUE_NAME ON ROWS FROM ( SELECT ( { [Offset].[Offset].&[0 (cm)] } ) ON COLUMNS FROM ( SELECT ( { [Exdatumtype].[Exdatumtype].&[1] } ) ON COLUMNS FROM ( SELECT ( STRTOSET(@TimelineYear, CONSTRAINED) ) ON COLUMNS FROM ( SELECT ( STRTOSET(@SiteSite, CONSTRAINED) ) ON COLUMNS FROM ( SELECT ( STRTOSET(@DatumtypeDatumtype, CONSTRAINED) ) ON COLUMNS FROM [SeptemberORNL]))))) WHERE ( IIF( STRTOSET(@TimelineYear, CONSTRAINED).Count = 1, STRTOSET(@TimelineYear, CONSTRAINED), [Timeline].[Year].currentmember ), [Exdatumtype].[Exdatumtype].&[1], [Offset].[Offset].&[0 (cm)] ) Site Daily Average Chart Datumtypes: GPP Sites: CA - Tonzi Ranch Years: 2004 Query used for report generation; STRTOSET in the above retrieves report parameters Still retrieves daily average GPP values for Tonzi Ranch for 2002

39 technical overview Excel Cube Query SELECT NON EMPTY HIERARCHIZE(Except({AddCalculatedMembers(Except({AddCalculatedMembers(DrillDownLevel({[Datumtype].[Da tumtype].[All]}))}, {[Datumtype].[Datumtype].&[56], [Datumtype].[Datumtype].&[40], [Datumtype].[Datumtype].&[39], [Datumtype].[Datumtype].&[38], [Datumtype].[Datumtype].&[37], [Datumtype].[Datumtype].&[36], [Datumtype].[Datumtype].&[33], [Datumtype].[Datumtype].&[32], [Datumtype].[Datumtype].&[31], [Datumtype].[Datumtype].&[59], [Datumtype].[Datumtype].&[55], [Datumtype].[Datumtype].&[29], [Datumtype].[Datumtype].&[58], [Datumtype].[Datumtype].&[57], [Datumtype].[Datumtype].&[26], [Datumtype].[Datumtype].&[25], [Datumtype].[Datumtype].&[24], [Datumtype].[Datumtype].&[23], [Datumtype].[Datumtype].&[22], [Datumtype].[Datumtype].&[54], [Datumtype].[Datumtype].&[21], [Datumtype].[Datumtype].&[20], [Datumtype].[Datumtype].&[19], [Datumtype].[Datumtype].&[18], [Datumtype].[Datumtype].&[17], [Datumtype].[Datumtype].&[13], [Datumtype].[Datumtype].&[60], [Datumtype].[Datumtype].&[11], [Datumtype].[Datumtype].&[9], [Datumtype].[Datumtype].&[7], [Datumtype].[Datumtype].&[6], [Datumtype].[Datumtype].&[5], [Datumtype].[Datumtype].&[2], [Datumtype].[Datumtype].&[1]}))}, {[Datumtype].[Datumtype].&[56], [Datumtype].[Datumtype].&[40], [Datumtype].[Datumtype].&[39], [Datumtype].[Datumtype].&[38], [Datumtype].[Datumtype].&[37], [Datumtype].[Datumtype].&[36], [Datumtype].[Datumtype].&[33], [Datumtype].[Datumtype].&[32], [Datumtype].[Datumtype].&[31], [Datumtype].[Datumtype].&[59], [Datumtype].[Datumtype].&[55], [Datumtype].[Datumtype].&[29], [Datumtype].[Datumtype].&[58], [Datumtype].[Datumtype].&[57], [Datumtype].[Datumtype].&[26], [Datumtype].[Datumtype].&[25], [Datumtype].[Datumtype].&[24], [Datumtype].[Datumtype].&[23], [Datumtype].[Datumtype].&[22], [Datumtype].[Datumtype].&[54], [Datumtype].[Datumtype].&[21], [Datumtype].[Datumtype].&[20], [Datumtype].[Datumtype].&[19], [Datumtype].[Datumtype].&[18], [Datumtype].[Datumtype].&[17], [Datumtype].[Datumtype].&[13], [Datumtype].[Datumtype].&[60], [Datumtype].[Datumtype].&[11], [Datumtype].[Datumtype].&[9], [Datumtype].[Datumtype].&[7], [Datumtype].[Datumtype].&[6], [Datumtype].[Datumtype].&[5], [Datumtype].[Datumtype].&[2], [Datumtype].[Datumtype].&[1]})) DIMENSION PROPERTIES PARENT_UNIQUE_NAME ON COLUMNS, NON EMPTY HIERARCHIZE(Except({AddCalculatedMembers(Except({AddCalculatedMembers(DrillDownMember({DrillDownLev el({[Timeline].[Year To Day].[All]})}, {[Timeline].[Year To Day].[Year].&[2004]}))}, {[Timeline].[Year To Day].[Year].&[2006], [Timeline].[Year To Day].[Year].&[2005], [Timeline].[Year To Day].[Year].&[2003], [Timeline].[Year To Day].[Year].&[2002], [Timeline].[Year To Day].[Year].&[2001], [Timeline].[Year To Day].[Year].&[2000], [Timeline].[Year To Day].[Year].&[1999], [Timeline].[Year To Day].[Year].&[1998], [Timeline].[Year To Day].[Year].&[1997], [Timeline].[Year To Day].[Year].&[1996], [Timeline].[Year To Day].[Year].&[1995], [Timeline].[Year To Day].[Year].&[1994], [Timeline].[Year To Day].[Year].&[1993], [Timeline].[Year To Day].[Year].&[1992], [Timeline].[Year To Day].[Year].&[1991], [Timeline].[Year To Day].[Year].&[1990]}))}, {[Timeline].[Year To Day].[Year].&[2006], [Timeline].[Year To Day].[Year].&[2005], [Timeline].[Year To Day].[Year].&[2003], [Timeline].[Year To Day].[Year].&[2002], [Timeline].[Year To Day].[Year].&[2001], [Timeline].[Year To Day].[Year].&[2000], [Timeline].[Year To Day].[Year].&[1999], [Timeline].[Year To Day].[Year].&[1998], [Timeline].[Year To Day].[Year].&[1997], [Timeline].[Year To Day].[Year].&[1996], [Timeline].[Year To Day].[Year].&[1995], [Timeline].[Year To Day].[Year].&[1994], [Timeline].[Year To Day].[Year].&[1993], [Timeline].[Year To Day].[Year].&[1992], [Timeline].[Year To Day].[Year].&[1991], [Timeline].[Year To Day].[Year].&[1990]})) DIMENSION PROPERTIES PARENT_UNIQUE_NAME ON ROWS FROM [NovemberORNL] WHERE ([Measures].[Average], [Dataset].[Dataset].&[6], [Site].[Site].&[29], [Offset].[Offset].&[0 (cm)], [Exdatumtype].[Exdatumtype].&[1]) Query as generated by Excel Pviot Table Still retrieves daily average GPP values for Tonzi Ranch for 2002

40 technical overview Datacube learnings Its fairly simple to build a cube Weve built lots The GUI tool is straight forward and the tutorials help Its not so simple to build a good cube Incorrectly specified hierarchies can give either incorrect answers or catastrophically poor performance or both Ordering all dimensions for good usability Build times can be hours and/or timeout Aggregations are designed in and take extra configuration Once aggregations built, still need to watch performance trace to ensure that theyre used Data cubes best for daily or larger time periods. Use the database for half-hour (leaf) measurements. Its easy to write a very slow MDX query Not Empty very important Its really easy to write an MDX query that is hard to understand Knowing more about what queries are of interest to the scientists will help us tune the existing cube and/or build targeted cubes

41 technical overview Data Staging Pipeline

42 technical overview Data Staging Pipeline Data can be downloaded from internet sites regularly Sometimes the only way to detect changed data is to compare with the data already archived The download is relatively cheap, the subsequent staging is expensive New or changed data discovered during staging Simple checksum before load Chunk checksum after decode Comparison query if requested Decode stage critical to handle the uncontrolled vocabularies Measurement type, location offset, quality indicators, units, derivation methods often encoded in column headers Different sites use different units Incremental copy moves staged data to one or more sitesets Automated via siteset:site:source mapping

43 technical overview Updates to Basic Data Checks Applied checks will be constantly increasing Waiting for descriptions of current ORNL checks Investigations pending: Per-site, per datumtype limits (eg Summer is hotter Florida than in Alaska) Systematic time errors (eg GMT rather than local time) Time dependent checks for datumtypes max, min, zero at night or day day Unit conversions or other simple conversions (eg relative humidity) Add a new table to log results of all checks Enables standardized reporting longer term Can be copied to active database as part of provenance metadata for user access Schema TBD

44 technical overview Copying Only Changed Data Proposed approach: throw processing at the problem File checksum can be used for older, retired sites or slowly reporting sites Only addresses the no new or changed data case Recommend as a filter to suppress load rather than suppress copy. Time (yearly?) chunked count and checksum can be used for older, unlikely to change data Reduces the total data that must be explicitly compared Recommended given the likely query time for full comparison Exhaustive timestep-by-timestep comparison between data in archive database and staging database used to determine the amount of change Limit amount of returned data to bound query time Compare only makes sense withlike data (same processing and gap fill method) Desired behavior for each site: If the target siteset is fixed, do nothing If the target siteset is append only and the data are not changed, append only the new data If the target siteset is flexible, append new data and replace any changed data Otherwise, create a new siteset and copy all data

45 technical overview L3/L4 European Quality and Gap-Filled Data Additional column headings to denote quality flags, derived quantity algorithm and gap fill algorithm Example: Rg_f, Rg_fqc, qf_NEE_st, NEE_st_fNDS, NEE_st_fMDSqc _f indicates fill qc indicates quality st, or, MDS, and ANN indicate algorithms Approach: load, decode, convert, delete Loading unchanged: quality flags treated as data values. Decoding upgraded: additional column headers decoded and flag/data indicator added Convert: builds quality flag row and connect with the corresponding measurement Delete: removes quality flag rows from the data table prior to data copy from the staging database

46 technical overview Work in Progress and Future Plans

47 technical overview Work in progress: improving current capabilities Debug and deploy quality flags on L3/L4 data Implement incremental data staging to enable speedy and simple data editing by an actual scientist (rather than a programmer) Implement expanded metadata handling to enable scientist to add site characteristics and sort sites on those expanded definitions Investigating integration with ArcGIS to get spatial data analyses Investigating MatLab, R2, and SPlus integration Direct SQL queries available from MatLab and R2 Given the query complexity, should we / can we build a wizard? Get BigPlot working again and invest in usability Last summers effort was really research into feasiblity

48 technical overview Future Plans: transitioning from prototype Add data download Current tablular report intended as stopgap/technology demonstration only Likely starting point: adapting Euroflux collaboration download to ftp zip file given the good usability and compression Alternate would be to write CSV file converter via SSIS CasJobs and MyDb to support individual scientist analysis including data editing for correction, recalibration, etc Implement expanded metadata handling to enable scientists to add site characteristics and sort sites on those expanded definitions Automate MyCube construction to pair with MyDb Internal Microsoft experience with real life data suggests this is possible and may even be fairly straight forward for cubes with with common dimensions. Review current deployment for true DMZ deployment

49 technical overview Longer Term Futures Handling imagery and other remote sensing information Curating images is different from curating time series data Using both together enables new science and new insights Graphical selection and display of data Support for user specified calculations within the database Weve done LE_Pot, but there are clearly others Support for direct connections to analysis and statistical packages to cube as well as database Leverage stored procedure (SQL) and nice wizards (MDX/SQL) to simplify query generation Linkage with models Additional (emerging) data standards such as NetCDF Handling just in time data delivery and model result curation Need for workflow and single sign on credentials Data mining subscription services Handling of a broader array of data types


Download ppt "Technical overview Ameriflux Scientific Data Server Technology Overview Catharine van Ingen, MSFT 11 December 2006."

Similar presentations


Ads by Google