Presentation is loading. Please wait.

Presentation is loading. Please wait.

Searching peoples' digital footprints A new avenue in sociology and what are the problems with it János Kertész Budapest University of Technology and Economics.

Similar presentations

Presentation on theme: "Searching peoples' digital footprints A new avenue in sociology and what are the problems with it János Kertész Budapest University of Technology and Economics."— Presentation transcript:

1 Searching peoples' digital footprints A new avenue in sociology and what are the problems with it János Kertész Budapest University of Technology and Economics

2 Outline Problems with the classical primary data collection in soc. – an example Abundance of data: Digital footprints – new era in social sciences? Examples Data availability Ethical issues Summary

3 Primary Methods of Data Collection Interviewing People Designing a questionnaire Observing people Content analysis Designing an experiment to carry out Case study Focus group

4 Primary Methods of Data Collection Interviewing People Designing a questionnaire This method is best for discovering factual information about people … Observing people Content analysis Designing an experiment to carry out Case study Focus group Statistics about primary data collection: Papers over 10 years in American Sociological Review: Interpretative: 17% Survey: 80% Experiment: 3%

5 An example: The Add Health database „The (US) National Longitudinal Study of Ado- lescent Health (Add Health) is a nationally repre- sentative study that explores the causes of health-related behaviors of adolescents in grades 7 through 12 and their outcomes in young adulthood. Add Health seeks to examine how social contexts (families, friends, peers, schools, neigh- borhoods, and communities) influence adolescents' health and risk behaviors.” Designed by J. R. Udry, P. S. Bearman, and K. M. Harris, started 1994, still going on.; funded by National Institute of Child Health and Human Development (P01-HD31921) Contact:

6 DATA (cont.) Data based on questionnaires and medical tests ~ 1700 publications (inc. dissertations) We used the data from Wave I (1994-95): 75871 students were asked in 84 high schools 68 questions, including 10 friendship related ones: >> Name 5 best male and 5 best female friends. >> For each friend select from the list those, which apply. During the last 7 days you 1. visited each other 2. met after school 3. spent time together during last weekend 4. talked with him/her about a problem 5. talked with him/her on the phone

7 Threshold analysis Strength of ties characterized by discrete weights Links are a priori directed, corresponding to the nominations Strong asymmetry may occur: A B but B A 5 1 G/N : order parameter of percolation  s 2 n s : „susceptibility” Black line: w=(w + w )/2 mutuality required Red line: no mutuality required, missing nomination is taken as 0 Gonzales, et al 2007

8 Other ways of finding data for scientific research: Huge datasets due to IT Official data collections (open or can be made available) Statistical Institutes (e.g. P. Hedeström’s Stockholm data) Fiscal data (income distributions etc.) Medical Data (e.g., Finnish diabetes data, mortality data) … Work related: Commercial data (e.g. point collections, trading data of companies) secret, property of companies Financial data (e.g., stock and other markets, banks) partly open (free or for purchase) … Science related (open): Human Genome Project Chemical Data Banks Archives Bibliographies… These data are produced either for analysis or we assume that they would be used for that purpose

9 Data generated in our everyday lives A new avenue for social sciences: Digital footprints

10 T his collection of data raises Legal Ethical issues (see later) At the same time it provides a gold mine for research!


12 Until now, social science has struggled to obtain tools that do more than scratch the surface of some of its questions. These range from identifying the driving forces behind violence, to the factors influencing how ideas, attitudes and prejudices spread through human populations. The available tools have largely remained in a time warp, consisting of analyses of national censuses, small-scale surveys, or lone researchers with a notebook observing interactions within small groups. Being able to automatically and remotely obtain massive amounts of continuous data opens up unprecedented opportunities for social scientists to study organizations and entire communities or populations. NATURE|Vol 449|11 October 2007

13 Communications leave detailed information about who with whom, when and where… phone (mobile and fixed line) sms, mms MSN email In a broader sense all kinds of activities can be used, which leave electronic records, including commercial activities (ebay, point collecting cards, credit cards, etc) open collaborative environments (Wikipedia, gnu, etc) E-communities (Facebook, MySpace, etc) E-games (Roleplaying, Where is George, etc)

14 Enron Email Dataset (free: www.cs.cmu.edi/~enron/)www.cs.cmu.edi/~enron/ 150 users, (Enron management) 0.5M messages made public (including content!) by Fed. Energy Regulatory Commission The presently available corpus does not include attachments and some messages have been deleted (due to requests of affected employees) Triggered much interesting work, e.g.: Berkeley Enron Email Analysis (testing methods) J. Shetty and J. Adibi: The Enron Email Dataset: Database Schema and Brief Statistical Report Z. Eisler, I Bartos and J.K. : Fluctuation scaling Huberman et al: HP data (publicly not available) Related: Microsoft report MSR-TR-2006-186 (2007): on 30X10 9 MSN messages

15 J. Shetty and J. Adibi

16 Fluctuation scaling:  ~  Eisler et al. 2008

17 Over 7 million private mobile phone subscriptionsOver 7 million private mobile phone subscriptions Focus: voice calls within the home operatorFocus: voice calls within the home operator Data aggregated from a period of 18 weeksData aggregated from a period of 18 weeks Require reciprocity (X  Y AND Y  X) for a linkRequire reciprocity (X  Y AND Y  X) for a link Customers are anonymous (hash codes)Customers are anonymous (hash codes) Data from an European mobile operatorData from an European mobile operator Constructing social network from mobile phone data Constructing social network from mobile phone data J.-P. Onnela, et al. PNAS 104, 7332-7336 (2007) J.-P. Onnela, et al. New J. Phys. 9, 179 (2007) Y X 15 min 5 min 20 min X Y

18 Huge network: proxy for network at societal level Largest connected component dominates 3.9M / 4.6M nodes 6.5M / 7.0M links

19 Possible to ask unprecedented questions and even find the answers to them Study revealed the structure of the network, the interplay btw weigths and communities, the relations btw local, mesoscopic and global structure (See JP Onnela’s talk)

20 New data (continuously supplied): records of each call, sms, mms information about subscribers age, gender, ZIP code New studies started on data from Belgium (+information about location of the call) France, Hungary (fixed lines) India With some efforts individuals could be identified! No data sharing possible: Confidentiality agreement with the provider. Contracts regulate publication rights like in an industrial R & D project

21 I. Yang, E. Oh, B. Kahng: Phy. Rev. E 74, 016121 2006 A: collectibles, B: clothing, sport, office C: home decoration, electronics, D: art, hobby E: books, toys, F: valuables (jewelry, stamps, …) 1) 2) eBay data Traditional classification scheme (2) can be improved by hierarchical agglomeration algorithm (1)

22 Where is George? Zip code

23 („Where is George”) The scaling laws of human travel D. Brockmann, L. Hufnagel and T. Geisel Nature 439, 462-465 (26 January 2006) doi:10.1038/nature04292

24 Diapers and beer Standard story in data mining courses: An investigation of 1.2M baskets of consumers of Osco Drug showed that between 5 and 7 pm significantly many bought diapers and beer together (suggesting that bored young fathers were sent to the shop) (It is an urban legend that as a consequence the management let put diapers and beer closer to each other. But they could have…) One should not have illusions about (mis)use of point collector cards, great winning actions etc…

25 LAMENTS OF A SORROWFUL MAN They've entered me in books of every kind, I'm registered and checked in every way. I'm kept in musty, ink-stained offices, in folders that are growing grizzly-grey. Oh, gnashing of teeth, oh, humiliation, that I am captive till my dying day, that they dispose of me from top to toe, that I am just a record, filed away. I'd much prefer to live in the Sahara or rot beneath a mound of heavy clay, for I am kept in books of every kind, and registered and checked in every way. D. Kosztolányi, 1924

26 Google has all tools to be Big Brother. It has control over your Clicks (interest, taste, purchases, pictures…) Mail Travel plans etc. These data would be of much interest for research but they contain too much information. Google definitely uses them, e.g. for targeted advertising. Ethical issues „When web provider AOL’s research division published an analysis of search behaviour on the Internet last year, it had what it thought was a bright idea: it would reach out to academics by making an anonymized version of the data freely available for download from its website. But within hours, it had to pull the site, after bloggers managed to infer many identities from the data and view the associated search histories.” NATURE|Vol 449|11 October 2007

27 Two problems related to „computational social science”: i) Privacy issues Data are not produced for scientific evaluation, in contrast to questionnaires, where the target person can decide about delivering data or cases where data handling is expected. Moreover, in the latter case the utilization of the data is strongly regulated by law and by organizations (e.g. Consortium for Political and Social Research). ii) Controllability and reproducability of research Since data are not public (sometimes even the actual source must not be named) the general criterion of controllability of scientific research is violated. As seen on the AOL example, this is related to i), or to commercial interests. A good counterexample is Enron Email Database, which can serve as a benchmark for related studies.

28 Measures? So far no real scandal… caused by scientific use of data. Institutional framework needed? Putting People on the Map: Protecting Confidentiality with Linked Social-Spatial Data html (Natl Acad. Sci., Washington DC, 2007) concluded: “Institutional solutions involve establishing tiers of risk and access, and developing data-sharing protocols that match the level of access to the risks and benefits of the planned research.” However, “Businesses seem more prone to misuse private data than scientists of any stripe.” (Marshall Van Alstyne, BU) But „trust is of crucial importance to the contract between scientific expertise and the broader society that supports it” NATURE editorial, 2007 October

29 If we are careless…

30 Summary: Fantastic new possibilities for computational social science Multidisciplinary efforts needed More open, shared data needed. Benchmarking. Experiments??? Artificial data? Ethical and legal issues: Privacy, commercial interest and scientific reproducibility Institutionialization? Surveys cannot be substituted!

Download ppt "Searching peoples' digital footprints A new avenue in sociology and what are the problems with it János Kertész Budapest University of Technology and Economics."

Similar presentations

Ads by Google