E-Science, the GRID and Statistical Modelling in Social Research Rob Crouchley Collaboratory for Quantitative e-Social Science University of Lancaster.

Slides:

Advertisements

Similar presentations

Distributed Systems Architectures

Advertisements

Ceti s c e t i s Report out from Personal Learning and Research Environments Oleg Liber, Sharon Perry, Phil Beauvoir, John Swannie,

NGS computation services: API's,

NGS computation services: APIs and.

Written by Liron Blecher

Welcome to Middleware Joseph Amrithraj

Web Service Architecture

Tuesday, June 10, 2003 Web Services Brief Overview & Security Assertion Coordinator Pattern by Mohammad Abushadi & Riaz Ahmed for Security Group CSE -

1 G2 and ActiveSheets Paul Roe QUT Yes Australia!

1 Understanding Web Services Presented By: Woodas Lai.

Web Services Darshan R. Kapadia Gregor von Laszewski 1http://grid.rit.edu.

Web Services Nasrullah. Motivation about web service There are number of programms over the internet that need to communicate with other programms over.

MTA SZTAKI Hungarian Academy of Sciences Grid Computing Course Porto, January Introduction to Grid portals Gergely Sipos

Distributed Processing, Client/Server, and Clusters

Latest techniques and Applications in Interprocess Communication and Coordination Xiaoou Zhang.

Technical Architectures

A New Computing Paradigm. Overview of Web Services Over 66 percent of respondents to a 2001 InfoWorld magazine poll agreed that "Web services are likely.

John Kewley e-Science Centre GIS and Grid Computing Workshop 13 th September 2005, Leeds Grid Middleware and GROWL John Kewley

Software Frameworks for Acquisition and Control European PhD – 2009 Horácio Fernandes.

2006 IEEE International Conference on Web Services ICWS 2006 Overview.

Ch 12 Distributed Systems Architectures

F2032 Fundamental of OS Chapter 1 Introduction to Operating System Part 4.

70-290: MCSE Guide to Managing a Microsoft Windows Server 2003 Environment Chapter 8: Implementing and Managing Printers.

MCTS Guide to Microsoft Windows Server 2008 Network Infrastructure Configuration Chapter 8 Introduction to Printers in a Windows Server 2008 Network.

Slide 1 of 9 Presenting 24x7 Scheduler The art of computer automation Press PageDown key or click to advance.

.NET, and Service Gateways Group members: Andre Tran, Priyanka Gangishetty, Irena Mao, Wileen Chiu.

Hands-On Microsoft Windows Server 2008 Chapter 1 Introduction to Windows Server 2008.

E-Social Science What is e-Science? E-Science and e-Social Science E-Social Science and Longitudinal Data Examples of the Computational Problems we Currently.

©Kwan Sai Kit, All Rights Reserved Windows Small Business Server 2003 Features.

C Copyright © 2009, Oracle. All rights reserved. Appendix C: Service-Oriented Architectures.

Web Services An introduction for eWiSACWIS May 2008.

11 MANAGING AND DISTRIBUTING SOFTWARE BY USING GROUP POLICY Chapter 5.

The Cluster Computing Project Robert L. Tureman Paul D. Camp Community College.

BLU-ICE and the Distributed Control System Constraints for Software Development Strategies Timothy M. McPhillips Stanford Synchrotron Radiation Laboratory.

Topaz : A GridFTP extension to Firefox M. Taufer, R. Zamudio, D. Catarino, K. Bhatia, B. Stearn University of Texas at El Paso San Diego Supercomputer.

An Introduction to Multivariate Multilevel GLMs Hello and welcome.

Contents 1.Introduction, architecture 2.Live demonstration 3.Extensibility.

Computer Emergency Notification System (CENS)

The PROGRESS Grid Service Provider Maciej Bogdański Portals & Portlets 2003 Edinburgh, July 14th-17th.

Remote Access Using Citrix Presentation Server December 6, 2006 Matthew Granger IT665.

1 Welcome to CSC 301 Web Programming Charles Frank.

Grid Chemistry System Architecture Overview Akylbek Zhumabayev.

Tool Integration with Data and Computation Grid GWE - “Grid Wizard Enterprise”

Introduction to Server-Side Web Development Introduction to Server-Side Web Development using JSP and Web Services JSP and Web Services 18 th March 2005.

NA-MIC National Alliance for Medical Image Computing UCSD: Engineering Core 2 Portal and Grid Infrastructure.

Grid Execution Management for Legacy Code Applications Grid Enabling Legacy Applications.

S imple O bject A ccess P rotocol Karthikeyan Chandrasekaran & Nandakumar Padmanabhan.

Creating and running an application.

Kemal Baykal Rasim Ismayilov

More complex event history analysis. Start of Study End of Study 0 t1 0 = Unemployed; 1 = Working UNEMPLOYMENT AND RETURNING TO WORK STUDY Spell or Episode.

S O A P ‘the protocol formerly known as Simple Object Access Protocol’ Team Pluto Bonnie, Brandon, George, Hojun.

Dispatching Java agents to user for data extraction from third party web sites Alex Roque F.I.U. HPDRC.

Simple Object Access Protocol

Web Services An Introduction Copyright © Curt Hill.

Intro to Web Services Dr. John P. Abraham UTPA. What are Web Services? Applications execute across multiple computers on a network.  The machine on which.

TOPIC 7.0 LINUX SERVICES AND CONFIGURATION. ROOT USER Root user is called “super user” because it has power far beyond those of mortal user. As root,

Tool Integration with Data and Computation Grid “Grid Wizard 2”

John Kewley e-Science Centre All Hands Meeting st September, Nottingham GROWL: A Lightweight Grid Services Toolkit and Applications John Kewley.

INFSO-RI Enabling Grids for E-sciencE Web Services Mike Mineter National e-Science Centre, Edinburgh.

An Alternative Package for Estimating Multivariate Generalised Linear Mixed Models in R Damon Berridge, Robert Crouchley & Daniel Grose, Lancaster University,

PROGRESS: GEW'2003 Using Resources of Multiple Grids with the Grid Service Provider Michał Kosiedowski.

Copyright 2007, Information Builders. Slide 1 iWay Web Services and WebFOCUS Consumption Michael Florkowski Information Builders.

NGS computation services: APIs and.

PARALLEL AND DISTRIBUTED PROGRAMMING MODELS U. Jhashuva 1 Asst. Prof Dept. of CSE om.

SOAP, Web Service, WSDL Week 14 Web site:

GWE Core Grid Wizard Enterprise (

Hosting and Accessing Objects via Persistent Web Services

Study course: “Computing clusters, grids and clouds” Andrey Y. Shevel

Module 01 ETICS Overview ETICS Online Tutorials

Presentation transcript:

E-Science, the GRID and Statistical Modelling in Social Research Rob Crouchley Collaboratory for Quantitative e-Social Science University of Lancaster

Contents The Problem/Motivation: Some Background on Statistical Methods and Social Research; A Solution to part of the Problem? GRID Enabling the Analysis of Multiprocess Random Effect Response Data Questions.

Part 1. Some Background on Statistical Methods and Social Research Some Features of Social Science Research Complications A computationally demanding example Sabre and Stata/MP

Some Features of Quantitative Social Science Research We often want to develop evidence based substantive theory. We want to know “what determines what”, e.g. long term unemployment and social exclusion And we want to explore the consequences of policy changes on individual behaviour, e.g. encouragement to stay on at school on educational attainment, truancy, and social exclusion Our data sets are often very small (<10GB) Our data sets are small relative to those that can occur in particle physics In physics they now measure data in PetaBytes (a PetaByte is equivalent to the data in a pile of CDs (not in their cases) of over 2.3 km high). Social science data sets are currently often less than 8 GB (13 CDs=1.82 cm) though there may be exceptions, e.g. in geographical databases. Social science data sets may be small but they are very complex =>one of the justification for e-Science

Some of the Complexities of non experimental data Cluster effects, random and fixed effects; Contextual effects; Measurement Error; Missing data, dropout and selection; Parametric Assumptions; Endogenous Effects;

Some of the Consequent Issues Disentangling the contributions created by the different complexities for our results is computationally intensive; Results really change as our model becomes more comprehensive; e.g. direct effects change sign, other become NS; Problems of Large Scale Fixed Effects Analysis, sparse matrices; To tackle these complexities we could use GRID enabled tools, resources and services.

Social Science Research Randomised experiments offer the most powerful tool to understand social processes, but outside of psychology, they are infeasible, unethical or inappropriate (e.g. for instance we can not allocate pupils to different levels of education); Social scientists must therefore rely on observational data from longitudinal and other surveys e.g. YCS, NCDS, BHPS, The analysis of non experimental data involves complications.. Our data sets are small relative to those that can occur in particle physics In physics they now measure data in PetaBytes (a PetaByte is equivalent to the data in a pile of CDs (not in their cases) of over 2.3 km high). Social science data sets are currently often less than 8 GB (13 CDs=1.82 cm) though there may be exceptions, e.g. in geographical databases. Social science data sets may be small but they are very complex =>one of the justification for e-Science

Complication 1. Cluster Effects (CE) Most large scale surveys use multi-stage sample designs to obtain 'representative' samples; this procedure often creates cluster effects, e.g. BHPS (households), YCS (schools); Pupils in the same class are often more behaviourally alike than pupils in different classes (even in the same school) Cluster effects – like students in the same class People living in the same village

Complication 1. Cluster Effects (CE) Procedures have been developed to model cluster effects by means of shared random effects - MLwiN, Stata (Gllamm), SAS, AML; The estimation of non-identity link (and non nested CE) models, e.g. probit, can be computationally demanding; Cluster effects – like students in the same class People living in the same village Endogenous – variation within the variable

Complication 2. Measurement Errors (ME) In observational studies, it is rarely possible to measure all relevant covariates accurately, e.g. age, educational attainment; Ignoring ME can seriously mislead the quantification of the link between explanatory and response variables; ME in one covariate can bias the association between other covariates and the response variable, even if those other covariates are measured without error; Women lie about age Some people lie about their education attainment

Complication 2. Measurement Errors (ME) Also, some important determinants of behaviour are either not measured (i.e. omitted) or are unmeasurable (e.g. motivation); Repeated measures and longitudinal data provide the opportunity to deal with ME in explanatory variables, this adds to the computational demands of the analysis. Repeated measure – like the BHPS were same households are questioned every year

Complication 3. Missing Data, Dropout and Selection All of the major longitudinal data sets available to the British social science community, (e.g. YCS, BHPS and NCDS), contain missing data and dropout; Ignoring this could create bias in the model estimated on the data; We need to model, as realistically as possible, the process by which the observed subjects have been retained in the sample, otherwise we will not know how much bias is present in our results; Also, some sample designs create selection effects of their own, e.g. by using a subset of locations, or oversampling the poor; These add to the computational demands of the analysis.

Complication 4. Parametric Assumptions Our statistical tools are assumption rich: Parametric linear predictors, Parametric link functions and error structures; What if the assumed parametric relationships do not hold? BUT - Nonparametric statistical models are computationally intensive. Our tools assume a lot: Follow normal (gaussian distribution), etc.

Complication 5. Endogenous effects The curse of endogenous effects, everything seems to depend on everything else; We need multiprocess models (simultaneous equations) to disentangle this complexity, adds to computation; Truancy depends on family background, background depends on wage, wage depends on educational attainment, education attainment depends on truancy,…

Disentangling complexity with existing tools: an example This is the kind of example that got me interested in e-Science.

Disentangling complexity with existing tools: an example endogenous effects The YCS is a multi-stage stratified clustered random sample of individuals ages 16-17; I use YCS6 which covers young people eligible to leave school in 1990-91, who are then observed over the 1992-94 period.

Part-time work and truancy are potential determinants of educational attainment A comprehensive model will allow us to disentangle the observable, direct, effects of truancy on educational attainment from any effects that arise from correlation in the errors (unobserved effects).

Educational Attainment

Level of truancy

Part Time Work

Trivariate Ordered Probit Model (Path Diagram) Independent Errors (ep, et, eq) Part-time work Educational Attainment Truancy

Independent Errors (ep, et, eq) This model is quick (1-2 seconds) to estimate, 3 linear predictors: - Probit for PT work, - Ordered Probits for Truancy and Qualifications; We can use standard software, e.g. Stata.

Correlated Errors

Correlation Structure

Problems and Model Extensions Cant use standard software to fit the model via MLE; I used NAG software library, it has special routines to evaluate high dimensional multivariate normal integrals; Even so, this Model can take 2-3 weeks to estimate on a P4, 3 linear predictors, 169 parameters, 8,496 trivariate integrals for each function evaluation; Results from this model are quite different to those estimated under independence; e.g. one direct effect changes sign, another becomes NS;

What is happening? Evaluating lots of 3 dimensional integrals in order to compute our likelihood functions is computationally demanding; We could: Try other methods for evaluating integrals such as Gibbs sampling and MCMC, Use approximations: Laplace expansions with many terms Pseudo and Quasi Likelihood Methods Estimate fixed effects versions of the models; Use Instruments for the endogenous covariates All can be computationally demanding, and each approach has its own problems;

If we want to go this way, what can we do? Use parallel algorithms on the Grid Use faster Hardware, e.g. HPCx, (also part of the Grid) Both

In the education example I’ve assumed Particular directions for the direct effects No Non Ignorable dropout in the YCS No School Cluster effects present MVN Error structure Linear predictor, additive function No measurement error in observed covariates We do not yet have the computational power (on the GRID) to relax all the assumptions simultaneously in this model.

The Grid… some Definitions "…is the Web on steroids." "…is distributed computing across multiple administrative domains" Dave Snelling, senior architect of UNICORE […provides] “flexible, secure, coordinated resource sharing among dynamic collections of individuals, institutions, and resource” From “The Anatomy of the Grid: Enabling Scalable Virtual Organizations” "…enables communities (“virtual organizations”) to share geographically distributed resources as they pursue common goals.."

SABRE – Software for the Analysis of Binary Recurrent Events What is it ? Programme for analyising multivariate binary, ordinal, count and recurrent events data. Employs fast numerical algorithms. Uses Gaussian Quadrature and NPMLE for the REs Some typical application areas. Infertility in humans, animal husbandry. Voting, trade union membership, economic activity and migration. Absenteeism studies.

SABRE Why use it ? >6 months >1 week Data is administrative records covering the duration in employment in the workforce of a major Australian state government to investigate the determinants of quits and separations amongst permanent and temporary workers. NP base line hazard, quadrature for the REs

An Alternative: Stata/MP

What about SABRE and Stata/MP Stata/MP is 1.7 times faster on 2 processors Stata/MP is 2.8 times faster on 4 processors Stata/MP is 4 times faster on 8 processors Sabre can have a bit faster speedup, but the big difference is probably the base from which Stata/MP starts. Using the previous example on our HPC we could have (in minutes)

An empirical analysis of vacancy duration using micro data from Lancashire Careers Service over the period 1985–1992, NP base line hazard, quadrature for the REs

What have I said so far? That the estimation (via maximum likelihood) of some statistical models can be very computationally demanding and beyond what you can usefully do on your desktop.

Ways of running Sabre on the GRID Directly via the operating system, e.g. Globus Via a Portal, e.g. Science Gateway Via a desktop application, like the tip of an iceberg (I’m going to concentrate on this for the rest of the talk)

Using the Grid Via a Desktop Application Separation of Client and Server Logic Why ? Implementation of Service Logic may change to allow for improved algorithms, models or scheduling policies and so on However, user interface stays the same!!

Using the Grid Via a Desktop Application Take as an example: SABRE : Using GROWL : Grid Resources on a Workstation Library. 3 Integration of SABRE functionality into Statistics Software (R and Stata)

Solution - How Host Sabre as Secure Web Service Difficult to do !! Service needs to be secure Service needs to be persistent Many services provided via a single host on a single port Multiple clients Difficult to do !! Above features easy to host by employing generic GROWL server – allows the developer to concentrate just the service logic (algorithms, scheduling etc)

Web services A software system designed to support interoperable machine-to-machine interaction over a network. It has an interface that is described in a machine-processable format such as WSDL. Other systems interact with the Web service in a manner prescribed by its interface using messages, which may be enclosed in a SOAP envelope, or follow a RESTful approach. These messages are typically conveyed using HTTP, and normally comprise XML in conjunction with other Web-related standards. Software applications written in various programming languages and running on various platforms can use web services to exchange data over computer networks like the Internet in a manner similar to inter-process communication on a single computer. This interoperability (for example, between Java and Python, or Microsoft Windows and Linux applications) is due to the use of open standards. OASIS and the W3C are the primary committees responsible for the architecture and standardization of web services.

Client Client Client Client First Tier Second Tier Configuration GROWL Server Agent Agent Agent Agent This is like an iceberg, you only see the bits on top The 2nd and 3rd layers are completely hidden from the user. Client Stata or R or SPSS or word on the desktop. Third layer, Agent service factory, if the 4 client select different sevrices, depends on what the client requests Sabre on the NGS, Sabre on 16 process, sabre on the desktop, all controlled through a common interface. Growl server is publically available, no reason why a dept cant take a GROWL server to access all the PCs in their OFFice, not restricted to running on the GRID Third Tier Services

Example: Using Sabre on a GRID from Stata User gets a Stata plugin (unzip it in the users ado directory) This adds some items to the Stata menus And provides a series of dialogue boxes

GROWL SERVICES Could contain lots of other software, e.g. MCMC software on the Grid Could use lots of different systems, NGS, NWG, etc

Integration

Integration

Integration

Integration

Authentication required for a Fit

SABRE – Availability and Support Web Site http://sabre.lancs.ac.uk Full Command Documentation Tutorials Example Data Publications Downloads “SabreR” binary R packages including documentation (end 06/2006) “SabreStata” Stata plugin including documentation (end 07/2006) Sabre source code

What have I said in part 2 . There are beginning to be some tools that can make a lot more resources (Grid) available to you from within desktop applications.

Lancaster’s Statistical Software for e-Social Scientists SABRE Software for the Analysis of Binary Recurrent Events www.sabre.lancs.ac.uk Grid Resources On Workstation Library www.growl.org.uk e-science. lancs.ac.uk/cqess/ SABRE SABRE is a program specifically designed for the analysis of binary, ordinal, count recurrent events as are common in many surveys. SABRE’s dedicated soft-ware ensures fast response times. SABRE + R Adding SABRE as a plug-in to R allows Sabre commands to be processed from the R user interface. Configuration of models and preparation of data is then undertaken using the extensive functionality of R SABRE+ R+GROWL Using GROWL Components, SABRE commands invoked in R are executed in parallel on the GRID, making SABRE an excellent e-Social Science tool. Application area’s Studies of voting behavior, trade union membership, economic activity and migration. Demographic surveys. Studies of infertility in humans. Animal husbandry. Absenteeism studies. Clustered sampling schemes. R Commander The familiar R interface is being maintained by using SABRE as a plug in Grid Resources on Work Stations GROWL employs a client/server architecture that hides the complexity of GRID middleware from the user. Client access to GROWL employs a secure (PKI/SSL) connection to a single port on the host system and clients are authenticated using the distinguished name extracted from their certificate. The use of a persistent server to access grid resources allows all of the service logic to be hosted by the server, making the client application, library or plugin extremely lightweight. Sabre was originally developed by Lancaster University’s Centre for Applied Statistics, further development and use cases have been funded by the EPSRC, and ESRC as part of the NCeSS CQeSS node Acknowledgements: Future developments Course material for the use of Sabre is currently being developed. It is planned to launch a Sabre/GROWL service on the North West Grid within the coming year. This will provide a utility based grid resource. Research into labour markets using Sabre/Growl. SABRE will become available as a plug in for STATA SABRE Specifications Mover stayer models, conventional logistic, logistic-normal and logistic-normal with end-points models to binary data. Ordered probit and logit random effect response models. Fits conventional log-linear, log-linear normal and log-linear normal with end-point models to count data. Substantial control is available over the parameters of the algorithm for the sophisticated user. Very long sequences of data. Multi-process data, where each response sequence is of a different type, limited to the simultaneous analysis of trivariate correlated sequences. Capable of running in a parallel computing environment Further information: http://www. sabre.lancs.ac.uk R Commander Sabre can be added as a library to R so that R is menu driven, rather than command driven. This makes R easier to use. Invoking a computational intensive and parallelised method on a Grid R Program OGSA client invoked as a method call Local O/S e.g. workstation OGS A Remote O/S, e.g. parallel computer Componentised Parallel Algorithm Middleware for e-Social Science Development of a parallel, multilevel, multiprocess (OGSA) implementation of SABRE as an R object to enable the Social Scientists to disentangle the full stochastic complexity of socio-economic processes. SABRE and GROWL SABRE development GROWL provides a client-side lightweight library as a plug in to R, providing easy user friendly access to Grid resources and computational power, providing

You can watch a more detailed presentation about Growl by Dan Grose at the NCeSS conference on line at http://redress.lancs.ac.uk/Workshops/Presentations.html

Version on my PC Any Questions ? C:\2005-6 laptopfiloes\CQeSS\Oxford RMF\imp\dan_grose_large Any Questions ?