Presentation on theme: "Speeding Science Solutions for Data Curation from Microsoft (Research)"— Presentation transcript:
1 Speeding Science Solutions for Data Curation from Microsoft (Research) Lee DirksDirector, Education & Scholarly CommunicationExternal Research DivisionMicrosoft Corporation
2 Microsoft External Research Division within Microsoft Research focused on partnerships between academia, industry and government to advance computer science, education, and research in fields that rely heavily upon advanced computingSupporting groundbreaking research to help advance human potential and the wellbeing of our planetDeveloping advanced technologies and services to support every stage of the research processMicrosoft External Research is committed to interoperability and to providing open access, open tools, and open technology
3 MissionOptimize and extend Microsoft software to meet the specific needs of the academic communityOur approach:Conduct applied projects to enhance academic productivity by evolving Microsoft’s scholarly communication offeringsMicrosoft External Research is uniquely positioned to drive this initiative across Microsoft
4 The Scholarly Communication Lifecycle Excel 2010Windows Server HPC“Astoria” / “Pop Fly”CollaborationSharePointLiveMeetingOffice LiveData Collection, Research & AnalysisAuthoringPublication & DisseminationStorage, Archiving & PreservationOffice 2010:WordPowerPointExcelOneNoteTablet PC/UMPCOffice OpenXMLXPS FormatSQL Server &Entity FrameworkRights ManagementData Protection ManagerDiscoverabilityFASTMSR Academic Search“Bookweb”SharePoint 2010Word PowerPoint 2010WPF & Silverlight“Sea Dragon” / “PhotoSynth” / “Deep Zoom”
5 Goal: Transform Scholarly Communication Interoperability is essentialActively lobby and drive for consensus around technical standards and standardized protocols proactively adopted by the community; enable broad community engagementCustomers have told Microsoft that interoperability is OUR responsibilityLeverage Existing Community Protocols, Practices, Guidelines, etc.Example – metadata conventions / taxonomies / ontologies: a traditional strength for libraries – and a critical component in enabling Web 2.0Optimize for data-driven researchTo both data (scientific) and to information (scholarly publications)Reproducible research + computational scienceProperly document / annotate scholarly outputData preservation (and provenance) should be baselineDocumentation of the data’s provenancePreservation needs to be like “accessibility” features – i.e., assumed as requiredSemantic knowledge discovery & social networkingHarnessing collective intelligence must be a consideration – since accessing research is a core step in the life-cycle. Enable knowledge discoveryOptimize for Web 2.0 scenarios and allow end-users/experts to find things easier
7 Membership / Participation DataCite is an international consortium to establish easier access to scientific research data on the Internet increase acceptance of research data as legitimate, citable contributions to the scientific record, and to support data archiving that will permit results to be verified and re-purposed for future study.The Open Planets Foundation has been established to provide practical solutions and expertise in digital preservation, building on the €15 million investment made by the European Union and Planets consortium. OPF members benefit from the Planets results, new developments and the growing OPF community that includes experts at some of the most prestigious research, technology and memory institutions in Europe.The Confederation of Open Access Repositories (COAR) is a not-for-profit association of repository initiatives launched in October It aims to enhance greater visibility and application of research outputs through global networks of Open Access digital repositories.The Coalition for Networked Information (CNI) is an organization dedicated to supporting the transformative promise of networked information technology for the advancement of scholarly communication and the enrichment of intellectual productivity. Membership includes some 200 institutions representing higher education, publishing, network and telecommunications, information technology, and libraries and library organizations.ICSTI, the International Council for Scientific and Technical Information, offers a unique forum for interaction between organizations that create, disseminate and use scientific and technical information. ICSTI's mission cuts across scientific and technical disciplines, as well as international borders, to give member organizations the benefit of a truly global community.CrossRef is a not-for-profit membership association whose mission is to enable easy identification and use of trustworthy electronic content by promoting the cooperative development and application of a sustainable infrastructure. CrossRef's general purpose is to promote the development and cooperative use of new and innovative technologies to speed and facilitate scholarly research.
9 GenePattern Reproducible Research Add-in Services: Connects to GenePattern databaseRelationships: Inline graphics are synchronized to datasetData: Control and execute query pipelines into GenePatternData: Resulting data (and provenance) stored within Word documentSource code and binary:
10 Creative Commons Add-in for Office 2007 Intent: Insert Creative Commons licenses from within Office 2007Services: Integrates with Creative Commons Web API to create new licensesRelationships: license information stored as RDF XML within the document OOXMLSource code and binary:
11 Ontology Add-in for Word 2007 Services: Ontology download web serviceJohn WilbanksPhil BourneLynn FinkIntent: Term recognition & disambiguationRelationships: Ontology browserSource code and binary:
12 Article Authoring Add-in for Word 2007 Services: repository deposit via SWORDStructure: Read, convert, and author NLM XML documentsRelationships: ORE Resource Map creationRelationships: Citation lookup and reference managementStructure: Client-side XML validationBinary (version 2.0):This work is licensed under a Creative Commons Attribution 3.0 United States License.
15 Project Trident: Scientific Workflow Workbench Author, Execute and Monitor WorkflowsView data products, performance metrics, and provenance dataCompose and modify workflows via drag & drop canvasOrganize collection of individual workflow activitiesAvailable now:
17 The Windows Azure platform offers a flexible, familiar environment for developers to create cloud applications and services. With Windows Azure, you can shorten your time to market and adapt as demand for your service grows. Windows Azure offers a platform that is easily implemented alongside your current environment.Offerings:Windows Azure: operating system as an online serviceMicrosoft SQL Azure: fully relational cloud database solutionWindows Azure platform AppFabric: connects cloud services and on-premises applicationsMicrosoft Codename “Dallas”: information marketplace for data and web services
18 Azure – Project “Dallas” Microsoft "Dallas" is a service allowing developers and information workers to easily discover, purchase, and manage premium data subscriptions in the Windows Azure platform.Dallas is an information marketplace that brings data, imagery, and real-time web services from leading commercial data providers and authoritative public data sources together into a single location, under a unified provisioning and billing framework.Dallas APIs allow developers and information workers to consume this premium content with virtually any platform, application or business workflow.More:
20 Microsoft’s “OData” Initiative What is it?The Open Data Protocol (OData) is a Web protocol for querying and updating data that provides a way to unlock your data and free it from silos that exist in applications today. OData does this by applying and building upon Web technologies such as HTTP, Atom Publishing Protocol (AtomPub) and JSON to provide access to information from a variety of applications, services, and stores. The protocol emerged from experiences implementing AtomPub clients and servers in a variety of products over the past several years. OData is being used to expose and access information from a variety of sources including, but not limited to, relational databases, file systems, content management systems and traditional Web sites.OData is consistent with the way the Web works - it makes a deep commitment to URIs for resource identification and commits to an HTTP-based, uniform interface for interacting with those resources (just like the Web). This commitment to core Web principles allows OData to enable a new level of data integration and interoperability across a broad range of clients, servers, services, and tools.OData is released under the Open Specification Promise to allow anyone to freely interoperate with OData implementations.Find out more&Contact Pablo Castro / Blog:
21 Microsoft’s Open Government Data Initiative The Open Government Data Initiative (OGDI) is a cloud-based collection of software assets that enables publicly available government data to be easily accessible. Using open standards and application programming interfaces (API), developers and government agencies can retrieve the data programmatically for use in new and innovative online applications, or mash-ups that can help:Improve citizen servicesEnhance collaboration between government agencies and private organizationsIncrease government transparencyOGDI promotes the use of this data by capturing and publishing re-usable software assets, patterns, and practices. The data repository already holds over 60 different government datasets that are readily available for use in new applications, and is continuously updated with additional government datasets.More:
22 Data Curation Add-in for Microsoft Excel PROPOSEDIn partnership with the California Digital Library’s Curation CenterIn collaboration with Tricia Cruse & John KunzePart of the DataONE (an NSF DataNet Project)
23 Data Curation Add-in for Microsoft Excel PROPOSEDProposed functionality under consideration:Support for versioning, so that revision history and the original raw data can be easily protected and recovered,Standardized date/time stamps so that researchers can easily determine when the data were created and last updated.A “workbook builder” allowing researchers to select from globally shared standardized layouts for capturing data,Ability to export metadata in a standard format (e.g., a DataCite citation or an EML document that describes the dataset(s) in a workbook) so that researchers can readily share their data,Ability to select from a globally shared vocabulary of terms for data descriptions (e.g., column names), and as needed to add new terms to the globally shared vocabulary, to enable wide collaboration between researchersAbility to import term descriptions from the shared vocabulary and annotate them locally to refine their definitions as used in the dataset,“Speed bumps” to discourage use of macros and customizations that would impede interoperation of data imported from Excel into other applications, andAbility to deposit data and metadata directly into a data archive to enable compliance with funding agency requirements to preserve and publish research data.
24 Questions? Lee Dirks firstname.lastname@example.org Director—Education & Scholarly CommunicationMicrosoft External ResearchURL –Facebook: Scholarly Communication at Microsoft