Presentation on theme: "SIR UK Conference London, June 2007 Multiple Databases or Tabfiles? Frances Williams ISER, University of Essex."— Presentation transcript:
SIR UK Conference London, June 2007 Multiple Databases or Tabfiles? Frances Williams ISER, University of Essex
SIR UK Conference London, June 2007 Introduction Overview of problem Previous solution Proposed solution Reluctance to Change Why we took the plunge How things went Are tabfiles redundant? Conclusion
SIR UK Conference London, June 2007 Overview of problem British Household Panel Study (BHPS) –Longitudinal study –Interview same people every year 16 th wave –Two separate database structures Survey Database for cleaning User database for release to research community –Used SIR2K
SIR UK Conference London, June 2007 Overview of problem (cont) Survey database (SDB) –One for each wave –Organised for cleaning purposes –Highly structured –Reflects questionnaire Variable names reflect question numbers
SIR UK Conference London, June 2007 Overview of problem (cont) –User database (UDB) One database incorporating all waves Flattened structure –Easier for users Mnemonic variable names –Consistent across time –Year prefix indicates the wave »A – wave 1, B - wave 2 etc Lots of non-questionnaire variables –Derived variables –Imputed variables –Weighting variables
SIR UK Conference London, June 2007 Overview of problem (cont) Need to convert Survey database into User database –Done on annual basis –When data are clean –Large number of variables to transpose –40% variables change each year –Some variables merged Numeric and DK/Ref variables ~1500 variables on 40+ record types -> ~1300 variables on 7 record types –Some values changed DK, Refused, Missing and Inapplicable given negative values
SIR UK Conference London, June 2007 Previous solution Developed 15 years ago in SIR2.2 Uses intermediate tabfile Stage 1 –Variables from SDB converted to UDB vars (minus wave prefix) –Saved in table equivalent to UDB record Stage 2 –Tabfile rows converted to UDB database records
SIR UK Conference London, June 2007 Previous solution (cont) Disadvantages –Had to be done in 2 stages Stage 1 using SDB Stage 2 building UDB –Time consuming Each process took over an hour to run If SDB changed, both process had to be re-run –Error prone Two places for errors to occur –Extra maintenance Annual changes to be done in two places
SIR UK Conference London, June 2007 Proposed solution Change to SIR2002 –Allows multiple databases in a retrieval –Would allow one-stage transposition Quicker Easier Less error prone Less annual maintenance Less space intensive
SIR UK Conference London, June 2007 Reluctance to Change Lots of work to convert system –15 databases –28 tabfiles SIR 2K doesn’t export correctly –Numerics can get split over two lines –Need to hand-edit export files –Tedious process to fix Would I be able to get the multiple database code to work? –Solaris –Caseless databases Will existing code compile? –Conversion from SIR 2.2 to SIR 2K required lots of code revision
SIR UK Conference London, June 2007 Why we took the plunge SIR Oz would not fix SIR 2K export problem –Either stick with SIR 2K indefinitely or deal with problem –The longer we left it the more databases and tabfiles there would be to convert Multiple databases and other new features in SIR2002 would be very useful in other areas of our systems
SIR UK Conference London, June 2007 How things went It was very long and tedious converting all the databases and tabfiles –Got there in the end There were some compilation problems –Not as many as when we went from SIR 2.2 to SIR 2K –Do repeat var = A to B doesn’t work when A and B are the same
SIR UK Conference London, June 2007 How things went (cont) Some problems in getting multiple databases to work –Lots of help from SIR Oz Resultant code much better –One stage process –More maintainable –Much faster Well worth the effort –Wouldn’t want to do it again –But glad we did do it!
SIR UK Conference London, June 2007 Are tabfiles redundant? Three common situations where I have used tabfiles –Transposition of data from one database to another Worth updating to SIR 2002 or SIR XS and rewriting code –Static file containing data to be accessed in several different ways by different applications In the BHPS we have a large number of these In a new system I would use databases Not worth rewriting
SIR UK Conference London, June 2007 Are tabfiles redundant? (cont) Temporary file to hold external data –Read in from text file – e.g.csv file –Stored in tabfile and indexed as required –Dropped at end of application –Much easier to use tabfile than database No schema required Can be created and dropped ‘on the fly’
SIR UK Conference London, June 2007 Conclusion Multiple databases are great –Remove need for intermediate tabfiles when accessing data from more than one database Tabfiles have their uses –Temporary store for external data Horses for courses –SIR Oz - don’t ever get rid of tabfiles!
Your consent to our cookies if you continue to use this website.