Presentation is loading. Please wait.

Presentation is loading. Please wait.

IT Issues for large surveys IASC / IASS Summer School Knowledge Discovery in Large Surveys June 2001 Andrew Westlake Survey & Statistical Computing www.sasc.co.uk.

Similar presentations


Presentation on theme: "IT Issues for large surveys IASC / IASS Summer School Knowledge Discovery in Large Surveys June 2001 Andrew Westlake Survey & Statistical Computing www.sasc.co.uk."— Presentation transcript:

1 IT Issues for large surveys IASC / IASS Summer School Knowledge Discovery in Large Surveys June 2001 Andrew Westlake Survey & Statistical Computing

2 © Survey & Statistical Computing 2001 Structure of the Presentation Motivation »What is the Problem? Database Issues: Relational Databases »Things every Statistician should know Modelling Issues »Useful additional tools for Projects © Survey & Statistical Computing 2001

3 What is the problem? Why should statisticians need to know about IT? Is there anything important to know? Why not leave it all to the IT specialists? © Survey & Statistical Computing 2001

4 Statistical Analysis is Simple © Survey & Statistical Computing 2001

5 Statistical Analysis: not so simple © Survey & Statistical Computing 2001

6 Statistical Analysis: the Survey View © Survey & Statistical Computing 2001

7 Statistical Analysis: the Statistical Office Process View © Survey & Statistical Computing 2001

8 Why not use Packages? Standard packages provide Functionality »Often adequate for particular tasks >Elements of analysis >Questionnaire design >Data Storage »Implements a particular view of a problem area >Often helpful, but can be limiting Not sufficient for implementing Processes »Sequence, Control and Validation (Knowledge) Potential as Components in overall System © Survey & Statistical Computing 2001

9 Why get Involved? We need good data »Cannot do good analysis with bad data We need good access to data »Systems often designed inflexibly, to limited initial targets It can save time in the long run »We’ll get involved anyway in cleaning up the mess »Easier to work with properly focussed system IT people have useful tools and concepts »Can help us to do our ordinary tasks © Survey & Statistical Computing 2001

10 Development Opportunities Continuous Processes »Continuous or repeated surveys, business enquiries Repeated Processes »Design and analysis service for small surveys »Regular reporting, with adjustments and estimation Sharing »Secondary analysis of data (through Data Archive?) »Dissemination systems Very Large Projects »E.g. Census © Survey & Statistical Computing 2001

11 Applications to statistical problems Statistics is not just about analysis »Need to be concerned about good data in and good use of conclusions Examples of development areas »Processing >HIV and AIDS notification system >Processing and Manipulation of results from a Demographic Survey in Pakistan >Result processing from Construction Industry inquiries »Statistical Databases >Support for dissemination to users and policy makers, and for further analysis »Metadata >Initiatives to standardise concepts and structures »Integrated statistical analysis systems >Analysis tools seen as a component that integrates with other components, data store, metadata, dissemination, etc. »Distributed resource systems >Data owners retain control, but users can see distributed resources as an integrated whole © Survey & Statistical Computing 2001

12 Developing new Software and Systems Don’t do it unless it’s essential »Difficult to do well, Expensive, Time-consuming Do it properly »Don’t leave it all to the IT experts >They don’t know what you want >They don’t know what is important »User-Centred design (HCI) is not enough >Important advance, but leaves the power with the IT people »Learn the Concepts and Jargon >Useful tools for thinking about structure and design »Take part in the development process >Not too much detail (but enough) >Concentrate on functionality >Use proper tools and a proper methodology © Survey & Statistical Computing 2001

13 Tools IT People use Concepts about Structure »Relational database ideas »Object-oriented concepts »Modelling of structure and process Implementation standards »OLAP – handling aggregate data »XML – Interchange of complex structures »Component architecture – Cooperating tools, not monolithic structures Design and Development »UML – for system design based on objects »Methodologies >Contextual Inquiry – for User Requirements >Feature-Driven Development, Rational Unified Process – for managing the development process © Survey & Statistical Computing 2001

14 Database Issues: Relational Databases Relational Model and SQL »Components »Operations Data Warehouses »What is different? Aggregate Data and Data Cubes »Structure »Functionality Examples »Processing the PFFPS database in MS Access and SQL © Survey & Statistical Computing 2001

15 Database Schema Levels © Survey & Statistical Computing 2001

16 Relational databases Relational structures and operations »Agreed model of components and behaviour (Codd) »Standardised implementations through SQL Views »Definitions of extractions and combinations can be stored and used as though they were physical tables Physical model choice depends on intended usage as well as logical structure Normalisation: method to avoid duplication and dependencies »Important for transaction systems »Easier to be consistent, faster updates ODBC standard for access to micro data from applications © Survey & Statistical Computing 2001

17 Structure of the PFFPS database © Survey & Statistical Computing 2001

18 SQL Data Manipulation SELECT {DISTINCT} FROM WHERE GROUP BY HAVING ORDER BY All manipulation (retrieval) of data uses the single Select statement, which has various components corresponding to different relational operations. The result of a SELECT statement is a relational table, which is displayed (by default) or can be stored or processed in another statement. Retrieval is usually done through a Query interface, which generates the SQL. Basis of ODBC standard for linking to relational databases. © Survey & Statistical Computing 2001

19 Data Manipulation – Project and Restrict Project operation chooses columns SELECT B, C FROM R1 SELECT SEX, AGE_AT_DIAGNOSIS, DIAGNOSIS FROM CANCER_REGISTRATION Restrict chooses rows (Where clause) SELECT * FROM R2 WHERE A < 6 SELECT * FROM CANCER_REGISTRATION WHEREAGE_AT_DIAGNOSIS <= 65 R1 ABCD 1b1c1d1 2b2c2d2 3b3c3d3 R2 ABCD 3b3c3d3 4b4c4d4 5b5c5d5 6b6c6d6 7b7c7d7 © Survey & Statistical Computing 2001

20 Data Manipulation - Join The Join operation is the combination of a Product operation with Restrict to select the rows of the result SELECT R1.*, R3.* FROM R1, R3 WHERE A = X or SELECT R1.*, R3.* FROM R1 INNER JOIN R3 ON A = X This is an Equi-Join Natural Join is based on columns with the same name SELECT R1.*, X, E FROM R1, R3 WHERE R1.D = R3.D or SELECT R1.*, R3.* FROM R1 NATURAL JOIN R3 Join A=X ABCR1. D XR3. D E 1b1c1d11d3e1 2b2c2d22d5e2 3b3c3d33 e3 Natural Join ABCDXE 3b3c3d31e1 3b3c3d33e3 © Survey & Statistical Computing 2001

21 RDBMS Strengths Data Modelling »Useful tools for understanding data structures and flows »Entity – Relationship (ER) model widely used for structure Relational Model »Precise, formal mathematical specification of structure and behaviour SQL »International Standard (SQL/92), widely implemented Current Implementations »Widely available, well supported, good implementations, integration with other products, add-on market for tools © Survey & Statistical Computing 2001

22 Data warehouse Most RDB systems are designed to support Transactions »Fast access to single (or few) records for updating »Not imposed by the Relational Model Data Warehouse systems designed for analysis, not transactions »Different physical model for access, but can still follow relational principles >Many (selected) records, structured classification variables and measures >Different ideas about redundancy (normalisation) »Extensions for manipulating Aggregated Data »Tools for gathering and cleaning data from source databases »Analysis tools (Data Mining) >Hypothesis generation >Inference >Often weak on statistical principles (particularly over-fitting), but improving © Survey & Statistical Computing 2001

23 Source: MS OLAP pages Star Schema © Survey & Statistical Computing 2001

24 Aggregate data Can produce with aggregation functions in SQL »Count, Sum, Group By Induces new concepts, not in relational model »Data Cube (Multi-way table), Dimensions, Classifications, Levels, Measures Requires new functionality »Exploration, Manipulation, Presentation Commercial products developing »OLAP (anticipated by Codd), usually within Warehouse »Usually limited to simple aggregation »Some specialised Statistical products © Survey & Statistical Computing 2001

25 Aggregate Results, as Multi-way Table Detail Minor Group Major Group Disease Classification (ICD) Location Country Region District Period Year Week Month Day Measures Reports received Population at risk Estimated Incidence rate SD of Incidence rate This example has three dimensions (so that it can be visualised). In reality, for this application, we would need at least two more, Age and Gender. { © Survey & Statistical Computing 2001

26 Functionality for Cubes Focussing on Subsets »Slice and Dice Change level of Detail »Drill down, Roll up, implies a structure of levels over classifications for each dimension Aggregation rules for measures »May not be sums, may have different base Derived measures »What is compatible, sensible Manipulation between cubes Presentation issues »Layout on 2 dimensions »Annotations and descriptions And so on – only basic issues addressed in OLAP products © Survey & Statistical Computing 2001

27 Relational Databases for Statistical Data Can support complex structure Can support complex processing Can link easily to many statistical packages Can do more data manipulation than most statistical packages »Examples from Pakistan Fertility survey © Survey & Statistical Computing 2001

28 Structure of the PFFPS database © Survey & Statistical Computing 2001

29 RBDMS: Pakistan FFPS in Access © Survey & Statistical Computing 2001

30 PFFPS: Sex distribution for children by Age © Survey & Statistical Computing 2001

31 PFFPS: Using the Query Interface © Survey & Statistical Computing 2001

32 PFFPS: Generated SQL © Survey & Statistical Computing 2001

33 PFFPS: Crosstab in Access © Survey & Statistical Computing 2001

34 PFFPS: Joining Tables Parity for all women of Childbearing Age © Survey & Statistical Computing 2001

35 PFFPS: Parity for all women of Childbearing Age © Survey & Statistical Computing 2001

36 PFFPS: Fertility Rate Calculations Numerators »Count Births by Period and Mother’s Age Denominators »Years of ‘exposure’ by Period and Age at the time »Months give adequate precision © Survey & Statistical Computing 2001

37 PFFPS: Births by Mother’s Current Age and Years Ago of Birth © Survey & Statistical Computing 2001

38 PFFPS: Fertility Numerator Calculations Years Ago: Int( ([interview cmc]-[child cmc]-1)/12 ) Women: Sum( IIF ( NZ([line_wbh])<2, [WT_EW], 0 ) ) Children: Sum( IIF ( IsNull([line_wbh]), 0, [WT_EW] ) ) © Survey & Statistical Computing 2001

39 Modelling Issues ER Models for Databases Objects and Object-Oriented Concepts UML: Unified Modelling Language XML: eXtended Markup Language Examples: Processes analysis for HIV and AIDS Notifications © Survey & Statistical Computing 2001

40 Models and Modelling Development process for systems »Different from statistical models Explicit sets of concepts and procedures »For identifying and describing the components of some system »Often have associated notation »Different levels and purposes: Semantic, conceptual, logical, physical »Can provide useful ways of thinking about and representing data and structure ER models for database »Conceptual approach to entities and relationships, more than the relational model. Not just structure, but purpose (semantics). Object models »Components of a system, what is their structure and behaviour, how do they work, how are they related, how do they interact, what are the sequences in which things happen? © Survey & Statistical Computing 2001

41 Structure of the PFFPS database © Survey & Statistical Computing 2001

42 Object-based methods Representation of complex content »Process, not just structure »Central idea of a generic Class definition, with >Complex elements, including other objects and collections >Behaviour, implemented as functions (methods) >Interfaces, that control external access to attributes and behaviour Object »Instance of a Class (Person, Account) »Has identity, that can be transmitted >Set CurrentCust = New Person >Set CurrentCust.Account = New Account >Set Operator.Customer = CurrentCust »Can be interrogated and asked to perform operations >CurrentCust.Age; CurrentCust.Account.Balance >CurrentCust.Print; CurrentCust.Account.Debit(£50) © Survey & Statistical Computing 2001

43 Some Statistical Objects Dataset »Matrix of Cases and Fields Scale »Set of codes and meanings for the values used in a Field Variable »Combination of a Field and a Scale Classification »Defined over a Scale, but allowing re-grouping Classification Set »Hierarchy (or Tree) of Classifications, with mappings between levels Measure »Single-valued expression derived from one or more fields, with the derivation formula Dimension »Combination of a Variable and a Classification set Summary Table (Data Cube) »Combination of Statistical Population, a set of Dimensions and a set of Measures © Survey & Statistical Computing 2001

44 Object-Oriented Concepts Inheritance »One class can be defined as an extension of another »Inherits all the structure and methods, but can extend or alter as required >Eligible Woman > Household Member >Head of Household > Household Member Polymorphism »Behaviour of a method depends on the class for which it is invoked, eg Print >The Class is responsible for providing a suitable method (can be inherited) Encapsulation »Attributes are private to the object, only exposed through methods Rich way of thinking about structure »Pervasive for programming »Appearing in database systems (some special products) »Supported by modelling tools © Survey & Statistical Computing 2001

45 Unified Modelling Language (UML) Standard (OMG) way to represent object models »Collection of diagram types and components to represent various types of object and behaviour »Formal specification with semantics and conventions for representation Recognises complexity »Same objects can participate in multiple diagrams, with different emphasis or different level of detail or abstraction »Multiple Levels >From User Requirements (Use Case diagrams) down to coding and implementation (Statechart, Activity, Sequence, Component and Deployment diagrams). Emphasis on software implementations »But much wider application for design »Rich and complex. Extensible. © Survey & Statistical Computing 2001

46 Package Diagram Types of Information in a Statistical Database © Survey & Statistical Computing 2001

47 Use Case Diagram Database Users © Survey & Statistical Computing 2001

48 Extension in Use Case Diagram © Survey & Statistical Computing 2001

49 Activity Diagram Similar to familiar Flowchart © Survey & Statistical Computing 2001

50 Source: Chris Nelson Class Diagram ebXML Registry Model © Survey & Statistical Computing 2001

51 IQML Registry Model Source: Chris Nelson © Survey & Statistical Computing 2001

52 UML Summary Rich system of elements and diagrams for expressing designs »Can be complex to learn, but rewarding »Formal definition and semantics of elements, so designs can be precise Focussed on software development »Round-trip development tools >Rational Rose, Together, … (particularly for Java) »Wider application for any design or specification context »Lots of diagram software, from free to expensive »Model building can take time, but if done in detail, implementation in software can be fast © Survey & Statistical Computing 2001

53 XML: Purpose Designed to express complex structures of information in a way that can easily be moved between applications Markup Language (based in SGML) »Text with Tags ( field contents ) »Nested Tags => multiple hierarchies Sex 1 male 2 female © Survey & Statistical Computing 2001

54 XML: Structure Generic language »Tags not defined, only the language structure »Generic tools to read and write XML >Interface tools for application developers, (DOM, SAX) >Presentation and transformation tools, style sheets (XSL) »Tolerant applications >Can detect omissions and skip additions Schema and DTD - Document Type Definition »Rules about the specific tags and structures allowed in a specific context >Can have generic tools to check conformance © Survey & Statistical Computing 2001

55 XML: Uses and Standards Linear text, solves problem of exchange between applications Still have to define and agree on structures »Agreement can be complex, but then easy to generate XML Schema (eg from UML) Many proposals and agreements on standard structures in XML »DDI, Triple-S, MathML, GML, ebXML, … Only handles structure, not semantics or behaviour © Survey & Statistical Computing 2001

56 HIV and AIDS Reporting System System »Separate file for each type of Notification Problems »Duplicate Notifications »People have records in more than one file »Difficult to identify and match individuals »Analysis based on People Solution in Use »Cross-linking identifiers in all files But »Difficult to maintain, not reliable © Survey & Statistical Computing 2001

57 Use Case diagram of system activities What is really important, so should be efficient HAP System: Processes Can afford to spend time getting information right © Survey & Statistical Computing 2001

58 Redesign database to focus on Patients HAP System: Data Stores Need components for the various stages (may not appear in order) Important to keep record of Notifications as Sources © Survey & Statistical Computing 2001

59 HAP System: Process Notifications © Survey & Statistical Computing 2001

60 Summary Good systems produce better data »Easier to analyse »Better quality results Good systems save work »Less effort on cleaning data »Better and more metadata »Can move repetitive tasks into the system Get involved »Learn the conceptual ideas »Use the tools »Contribute to the design and development process © Survey & Statistical Computing 2001

61 References: Databases Date (2000). Introduction to Database systems, 7th Edition. Addison Wesley. ISBN: »This is the standard ‘bible’ for relational database systems, hard work, but important if you want a deep understanding of the strengths and limitations of relational systems. Date & Darwen (1997). A guide to the SQL Standard, 4th Edition, Addison-Wesley Dowling. Database design and management using Microsoft Access. Letts. ISBN: »A cheap book that works through a development project using Access © Survey & Statistical Computing 2001

62 References: Systems Design and UML Booch, Jacobson, & Rumbaugh. The Unified Modelling Language User Guide. Addison Wesley. ISBN: »Further information about UML in general and in the context of Rational Software products is available at Coad, Lefebvre & De Luca. Java Modelling in Colour with UML: Enterprise Components and Process. Prentice Hall. ISBN: X. »See also Holtzblatt. Contextual Design : A Customer-Centered Approach to Systems Designs. Academic Press. ISBN: Kruchten. The Rational Unified Process – an Introduction. Addison Wesley. ISBN: McConnel, Rapid Development – taming wild software development. Microsoft Press. ISBN: Reed. Developing Applications with Visual Basic and UML. ISBN: Sheridan & Sekula. Iterative UML development using VB6. ISBN »This book introduces the ideas of iterative development and works through some projects © Survey & Statistical Computing 2001

63 References: UML Software Microsoft Visio 2000 (Professional or Enterprise edition) »Contains UML and database modelling tools, as well as general diagram facilities. It links in with MS Visual Studio, which has further UML code design and development tools, based on Rational Rose. A free 60-day evaluation version of Visio is available from Microsoft. New version (2002) just released. Rational Rose »The market leader in UML diagram and development support, and various suites are available (such as for Analysts or Developers) containing additional tools, including a requirements management database and code generation tools. Various presentations and evaluations are available from Rational Software Ltd, or Together »Another UML modelling tool, though more focussed on Java and round-trip code generation. ArgoUML »Open Source UML tool. argouml.tigris.org © Survey & Statistical Computing 2001

64 References: Other Software Beyond 20/20 »A dissemination and manipulation tool for multi-way tables (data cubes) aimed at statistical users. There are versions for independent use with downloaded files, and for building web dissemination servers. It is being used by a number of statistical offices, including ONS, Unesco, Statistics Canada and the US Census Bureau. The UK distributors are Forvus, at and the developer site, at has various demos and descriptions, including downloads. Bridge »A repository database for statistical metadata, developed under the Eurostat project IMIM (Integrated Metadata Information Management). See Microsoft Office »The Pivot Table component in Excel 2000 is a good demonstration of the manipulation facilities developed by non-statisticians for data cubes (the earlier versions have less general functionality). »Access (all versions) is a good example of a relational database system, suitable for projects up to moderate scale (in terms of complexity and number of users, as well as physical size). Microsoft Project »Provides the classic project management tools, including PERT and Gantt charts. Evaluation versions are available from Microsoft. There are also a number of heavyweight project management systems available from other companies © Survey & Statistical Computing 2001


Download ppt "IT Issues for large surveys IASC / IASS Summer School Knowledge Discovery in Large Surveys June 2001 Andrew Westlake Survey & Statistical Computing www.sasc.co.uk."

Similar presentations


Ads by Google