Presentation on theme: "Empirical Evaluation in End-User Software Engineering Janice Singer National Research Council Canada."— Presentation transcript:
Empirical Evaluation in End-User Software Engineering Janice Singer National Research Council Canada
Singer WEUSE IV2 Outline Summary of papers wrt empirical evaluation looking at: –Research Question –Domain –Method –Subjects/Objects of study –Results Themes/cross-cutting issues Questions for Discussion
Singer WEUSE IV3 Spreadsheet debugging behaviour of expert and novice end-users - Bishop, McDaid Research Question –4 Basic RQ concerning performance of expert vs. novice users in detecting and correcting errors, debugging behaviour, and cell inspection coverage Domain –Spreadsheet Method –Experiment, Qualitative inquiry Subjects/Objects of study –13 professionals and 34 accounting and finance students (experts and novices)
Singer WEUSE IV4 Results Experts perform better than novices at detecting errors that require deep understanding Cell coverage correlates with performance - experts look at more cells than novices There is a specific pattern of cell inspection depending on the characteristics and place of the cell in the spreadsheet A tool whose aim was to increase cell inspection coverage showed a trend, but did not significantly improve performance.
Singer WEUSE IV5 Gender in EUSE - Burnett, et al. Research Question –Are the strategies employed by male and female EUSE in debugging different? Domain –Spreadsheets Method –Experiment, qualitative study Subjects/Objects of study –Males, females, professionals, students
Singer WEUSE IV6 Results There are significant gender differences in strategies for approaching testing and debugging Some of the strategies preferred by females are not well supported in end-user environments Modeling of problem solving behaviour may improve females confidence, and therefore their performance on tasks Gender matters
Singer WEUSE IV7 End users as unwitting software developers - Costabile et al. Research Question –Are our theories/descriptions accurate Domain –Group of companies who cooperate in candy distribution Method –Not clear… Subjects/Objects of study –Users of a web-based portal
Singer WEUSE IV8 Results In industrial practice, we see a variety of end-users with their own attendant needs.
Singer WEUSE IV9 An EU oriented graph based visualization for spreadsheets - Kankuzi, Ayalew. Research Question –Is our tool technically correct? Domain –Spreadsheet Method –Generate and test ? Subjects/Objects of study –Generated visualizations
Singer WEUSE IV10 Results For the most part, the algorithm that drives the visualization is correct Need to evaluate whether the visualization is actually useable by and useful to end-users
Singer WEUSE IV11 Using two heads in practice - Karlsson Research Question –Are dyads more effective than singletons (or nominal dyads) in debugging spreadsheets? Is there a process loss for two people working together? Domain –spreadsheets Method –Field experiment - experiment conducted in the field Subjects/Objects of study –Professionals
Singer WEUSE IV12 Results Dyads make fewer spreadsheet development errors than monads There is no significant difference in performance between nominal dyads and real dyads - therefore unable to determine whether there is a process loss or not
Singer WEUSE IV13 TDD: can it work for spreadsheets? McDaid, Rust, Bishop Research Question –Is test-driven development an appropriate process for spreadsheet development? Domain –spreadsheet Method –Developed tool to support TDD then case studies (real professionals using tool to develop spreadsheets) Subjects/Objects of study –Professionals
Singer WEUSE IV14 Results TDSD easy to understand and use Development time increase Overall, participants seemed to believe TDSD effective in reducing errors Some improvements to the tool suggested
Singer WEUSE IV15 Software support for building EUP environments in the automation domain - Prähofer, et al. Research Question –Is our solution technically correct/feasible? Domain –Automation Method –Case study, Reimplementation of existing systems Subjects/Objects of study –Developers of system/Two existing systems
Singer WEUSE IV16 Results Existing systems were able to be implemented in framework, with a great reduction in the code size
Singer WEUSE IV17 Patterns in mash-ups - Wong, Hong Research Question –Are there typical application domains for mash-ups? Domain –Web programming/mash-ups Method –Survey (in the sense of categorization of) of and qualitative analysis mash-ups Subjects/Objects of study –Popular GreaseMonkey scripts and 22 recommended mash- ups
Singer WEUSE IV18 Results Mashups can be categorized according to their functionality. These patterns include personalization, search, aggregation amongst others
Singer WEUSE IV19 Summary of empirical studies Wide variety of research questions –Not so much use of theory (is it necessary?) Not so wide variety of domain –Mostly spreadsheet Wide variety of methods –Experiments, surveys, case studies, tool correctness Subjects/Objects of study –Varied and related to research question
Singer WEUSE IV20 Themes and Questions - THEME: Domain What other domains should we be looking at in terms of empirical evaluation or tool support? Related to this, shouldnt we be doing more qualitative and observational work in real settings?
Singer WEUSE IV21 THEME: End-User Characteristics What other end-user characteristics do we need to be aware of when studying and designing tools for end-users? Are there general cognitive limitations in terms of abilities or is it mostly poor tool support that limits the ability of end-users
Singer WEUSE IV22 THEME: Technical Correctness How do we help end-users test, debug, determine the technical correctness of their solutions? What methods can we use to test the technical correctness/quality of our solutions - can experimentation with humans alone do this?
Singer WEUSE IV23 THEME: Software Engineering What can studies in EUSE tell us about SE, and vice versa? Given the changing world of SE (e.g., SOA, component and model based development, interoperability and integration issues), is there any longer a difference between EUSE and SE?
Singer WEUSE IV24 THEME: Building the Research Area What are the critical/BIG research questions? Is there enough information to start building meta-theories, do meta-analyses?
Singer WEUSE IV25 Discussion Already using theories, but many of them are implicit, and what we need to do is be explicit about them. Concept of theories have many meanings - but how can we generalize findings from individual cases to more general findings. So many of our findings are context dependent, it is very difficult to generalize to other contexts. So, in this case, how do we transfer knowledge. But cant work with simply theories. Do we really need qualitative studies first - yes, they can provide much information. Who are end-users? Can we characterize them. There is a lot of diversity and those points of diversity matter in terms of how to help people.
Singer WEUSE IV26 In EUSE, there is a lot of qualitative work, and highly regarded. Question bears on difference significance and meaning. Ask about two classes - e.g., male vs. female. But really there is a huge overlap in the two populations. But changing for one population will often help the other. Is it true for experiments reported here. Can tools for EUSE benefit software engineers. Huge individual differences - no typical male and no typical female. So when find differences, really, what finding is a barrier that affects performance. Statistics lie - must accept that the numbers only make sense in particular situations. Must be explicit about what the numbers mean, how we apply the generalizations to discover meaning External validity - even if you create a tool that is useable in a lab experiment and performs well according to a set of tasks - it doesnt mean that the general effect is there. Is it possible to run field studies by releasing the tool and seeing if it works in context.
Singer WEUSE IV27 But difficult because if you fail, you cant always be sure exactly why it went wrong. Other people had experience with trying to measure success in the field, share experience May need to look at a certain kind of EU applications. Spreadsheets are mainly defined tasks, whereas tailoring is a different issue. These areas may not have comparable measures across instances of the case. Here you need to use other methods and types of measures. By emphasizing one side of the method question, you are only asking certain kinds of questions One of the major difficulties in introducing tool in field is that users get annoyed when it doesnt work. You cant get good results with a semi-finished prototype. Need to have almost product-quality prototype
Singer WEUSE IV28 Important to use real spreadsheets with real errors. Powell, et al. made conclusion that you shouldnt look at cell errors, but rather cell error instances. EUSPRIG.com more practitioner focused than this group. Connection between SE and EUSE - strong connection on two levels - system that are to be developed by non-professionals need to modularized more. Very difficult to decide what is the right decomposition. In the future SE will need to watch what users and how requirements will evolve. Techniques are needed there. SE can perhaps benefit from the methodological practices of EUSE - where there is a huge degree of expertise in looking at practice. Understanding of how practices in software are evolved.
Singer WEUSE IV29 Speculate about how conclusions would be different if gender wasnt a factor. Could not have found the same things if hadnt thought about gender. Was a basic emphasis in all work because of theoretical perspective. Professional and EUSE - Good case in point was curb cuts. Now we all benefit from curb cuts even though they were designed for disabled people. Some times when you look at a case that you havent looked at before, you can come back to the majority and improve their tools. Whyline is another case of this. Used differentiation between EU and professional for many reasons. Also postulated that there is a continuum. Are the empirical studies too broad - perhaps we should try to figuring out which part of this continuum we can apply our results to
Singer WEUSE IV30 About 10K hours necessary to distinguish someone as an expert. May be that experts are really rare. Small percentage of people that are very good, and large continuum of skills Point out some differences. EUSE is need-driven which means that I am a user using technology and I encounter a problem or see innovation problem, and try to solve this problem using technology. Not necessary for end-user to implement it him/herself. For a SE, he is bound to the artifact he is about to create. Considerations that lead to the choice are different in both cases. What type of technological structure is necessary for problems in practice? 84% of people surveyed by Umarji are self-taught. Perhaps educating people or creating a set of guidelines is one of the low-hanging fruits.