Presentation on theme: "C Q e S S 1 E-Science and Statistical Modelling in Social Research Daniel Grose Audrienne Cutajar Bezzina CQeSS University of Lancaster."— Presentation transcript:
C Q e S S 1 E-Science and Statistical Modelling in Social Research Daniel Grose Audrienne Cutajar Bezzina CQeSS University of Lancaster
C Q e S S 2 Contents Some Background on Statistical Methods and Social Research; Disentangling Complexity: Educational attainment, truancy and PT work (NCDS) ReDReSS RELOAD and CopperCore Demo of SAKAI Questions.
C Q e S S 3 Some Background on Statistical Methods and Social Research
C Q e S S 4 Objectives of Social Science Research To develop evidence based substantive theory. We want to know “what determines what”, e.g. the (wage) returns to education; To explore the consequences of policy changes on individual behaviour, e.g. the impact of increasing the staying on rate at school on educational attainment & wages;
C Q e S S 5 Objectives of Social Science Research Randomised experiments offer the most powerful tool to meet these objectives, but outside of psychology, they are infeasible, unethical or flawed (e.g. for instance we can not allocate pupils to different levels of education); Social scientists must therefore rely on observational data from longitudinal and other surveys e.g. YCS, NCDS, BHPS, this raises complications.
C Q e S S 6 Complication 1. Cluster Effects (CE) Most large scale surveys use multi-stage sample designs to obtain 'representative' samples; this procedure often creates cluster effects, e.g. BHPS (households), YCS (schools); Pupils in the same class are often more behaviourally alike that pupils in different classes (even in the same school) some non nested cluster structures can also be present e.g. siblings (children of the same family) at different schools;
C Q e S S 7 Complication 1. Cluster Effects (CE) Procedures have been developed to take cluster effects into account by means of shared random effects in the model - MLwiN, Stata (Gllamm www.gllamm.org/ ); The estimation of non-identity link and non nested CE models, e.g. probit, can be computationally demanding;
C Q e S S 8 Complication 2. Measurement Errors (ME) Ignoring ME can seriously mislead the quantification of the link between explanatory and response variables; In observational studies, it is rarely possible to measure all relevant covariates accurately, e.g. age, educational attainment; ME in one covariate can bias the association between other covariates and the response variable, even if those other covariates are measured without error;
C Q e S S 9 Complication 2. Measurement Errors (ME) Also some important determinants of behaviour are either not measured (i.e. omitted) or are unmeasurable (e.g. motivation); Repeated measures and longitudinal data provide the opportunity to deal with ME in explanatory variables, this adds to the computational demands of the analysis.
C Q e S S 10 Complication 3. Missing Data, Dropout and Selection All of the major data sets available to the British social science community, (e.g. YCS, BHPS and NCDS), contain missing data and dropout; This creates bias in the data; We need to model, as realistically as possible, the process by which the observed subjects have been retained in the sample, otherwise we will not know how much bias is present in our results; Some sample designs create selection effects, e.g. by using a subset of locations, or oversampling the poor; These add to the computational demands of the analysis.
C Q e S S 11 Complication 4. Parametric Assumptions Our statistical tools are assumption rich: –Parametric linear predictors, –Parametric link functions and error structures; What if the assumed parametric relationships do not hold, (no gaussian errors?) We need more robust alternatives; BUT - Nonparametric statistical models are usually computationally intensive.
C Q e S S 12 Complication 5. Endogenous effects The curse of endogenous effects, everything seems to depend on everything else; We need multiprocess models (simultaneous equations) to disentangle this complexity, adds to computation;
C Q e S S 13 Disentangling complexity with existing tools: an example These are the kind of examples that got me interested in e- Science. As we start to more fully acknowledge the stochastic complexity of social processes our results will change.
C Q e S S 14 Example 1: Allowing for Cluster effects Stata, e.g. dprobit with the cluster option (http://www.stata.com/help.cgi?dprobit)http://www.stata.com/help.cgi?dprobit MlWin (http://multilevel.ioe.ac.uk/index.html) AMl, SAS What happens if we have more than one response, training and promotion? Standard software can’t do it. What happens if we have previous outcomes in the model? standard software can’t do it.
C Q e S S 15 Example 2: Allowing for Endogenous effects Simultaneous equation systems Commands in Stata Commands in Aml
C Q e S S 16 Nesstar allows 66 major datasets to be explored online(http://www.nesstar.com/); Only uses one data set at a time; Has very limited facilities for sub-setting and none for fusing; Restricted statistical facilities, e.g. descriptive analysis, linear regression; No facilities for handling missing data. Some existing web based tools
C Q e S S 17 Joining Up the Analysis Cycle Main ESDS Data Sets Select Data Set and Appropriate Variables: TTWA Data, NOMIS Merge Files: Add Variables Working Data Contextual Data Results
C Q e S S 18 Portals make all our e-tools easier to use Portals provide a framework to deploy our e- tools (aka rectangles), they focus on how the user wants to arrange these “rectangles”; The portal allows component integration, the goal is for the tools to work together closely and seem to really be parts of a larger “tool”;
C Q e S S 19 SAKAI Provides our VRE Portal Sakai = Collaboration & Research/Learning Environment Portal Res 1 Discussion, Video Conf and VOIP GE Resource Discovery E-Collaboration Portlets Res 2Res 3Res 4 GE DBMS GE Statistical Analysis Quantitative Methods Portlets Res 5Res 6
C Q e S S 20 Sakai Sakai is open source, it’s the hosting framework of choice for VLE and VRE (OGCE) development in the US; Big investment from Mellon Foundation and Ivy League Universities ($6.8M); Sakai 2.0 (release 10th June 05) will take WSRP compliant portlets. http://redress.lancs.ac.uk:8080/portal
C Q e S S 21 HTTP Sakai WSRP tool Portal Non-Sakai Non-Java Tools tool WSRP Non-Sakai Tool Sakai tool HTTP WSRP Sakai tool HTTP WSRP Using WSRP and to Federate across sites and provide extreme user flexibility in presentation
C Q e S S 22 LDCue for Structuring Content LDCue integrates content created by most standard authoring systems (incl. video) that is visible on the web; A resource discoverer will be able to specify where am I now and where I want to be, then the are supplied, by the LDCue tool, with a list of potentially suitable learning object URIs; The metadata on these URIs are then used to create learning designs that sequence material (read this first, then this, etc ).
C Q e S S 23 Reload & CopperCore Just like a musician, Reload is used to compose the structure for the learning design. The learner is the deejay who plays back the learning design created in Reload.
C Q e S S 24 Reload & CopperCore (cont) CopperCore is the medium used to play back the learning design created in Reload. CopperCore gives a structure to the learning modules, and keeps track of what has been covered by the learner.
C Q e S S 25 Reload Structure The IMS Learning Design package within Reload is made up of the following tabs: –General –Roles –Environment –Activities –Methods –Resources
C Q e S S 29 Roles (cont) This tab allows the user to choose input learner and staff, both with different characteristics. Various information can be added, such as minimum and maximum size of group.
C Q e S S 30 Environment This describes the environments in which the learning occurs.
C Q e S S 43 Advantages of LDCue over a search engine on the web Search engines do not sequence material by difficulty/complexity; With Learning Design you get semantically coherent content; Search Engines (e.g. Google) typically gives associative learning, which can be inefficient, especially when you get a lot of hits;
C Q e S S 44 Some of the VRE Tools we have written E-Collaboration Distributed Whiteboard; Voice and Video over IP; Broadcast Display (e.g. word and ppt). E-Discovery LDCue for Structuring Content.
C Q e S S 45 ReDReSS ReDRess is a joint project between Lancaster University and CCLRC Daresbury. It is a training and awareness project in eScience and eSocial Science. We are commissioning social scientists to write material for our portal http://redress.lancs.ac.uk
C Q e S S 46 ReDRESS NCeSS NCeSS Conference paper Other Content Jan-May 2005
C Q e S S 47 Finished NCeSS Other NCeSS/ ReDReSS Content Jan-May 2005 (cont)
C Q e S S 48 Content May – Aug 2005 ReDRESS NCeSS NCeSS Conference Paper ReDReSS/ NCeSS
C Q e S S 49 Content May–Aug 2005 (cont) ReDReSS NCeSS Other ReDReSS/ NCeSS