Presentation is loading. Please wait.

Presentation is loading. Please wait.

Strategy for systematic anonymisation of multi-lingual interaction corpora. C. Reffay 1, F.-M. Blondel 1, E. Giguet 2 1 STEF – ENS-Cachan / IFÉ – ENS-Lyon.

Similar presentations


Presentation on theme: "Strategy for systematic anonymisation of multi-lingual interaction corpora. C. Reffay 1, F.-M. Blondel 1, E. Giguet 2 1 STEF – ENS-Cachan / IFÉ – ENS-Lyon."— Presentation transcript:

1 Strategy for systematic anonymisation of multi-lingual interaction corpora. C. Reffay 1, F.-M. Blondel 1, E. Giguet 2 1 STEF – ENS-Cachan / IFÉ – ENS-Lyon 2 GREYC, Université Caen Basse-Normandie, CNRS

2 IC' C Reffay, F-M Blondel, E Giguet2 Outline Introduction Anonymisation process –Marking process –Finding new forms –Replacement process Testing the process on a Galanet session What did we learn? What works? Next step…

3 IC' C Reffay, F-M Blondel, E Giguet3 The corpus Galanet Session :Nômades...nomadi...nómades... des langues (Resp.: SandrineD) 4 teams : Italy, Brazil, France & Spain During 3.5 months, 103 teenagers, 83 authors wrote… 915 Messages containing (message body) Volume: forms, characters Lexicon: distinct forms

4 IC' C Reffay, F-M Blondel, E Giguet4 The objective is to share! But anonymisation is a hard work (by hand) –The corpus may be enormous –Subtleties: homonyms & synonyms Personal data are not sharable Anonymisation… the solution? Need a software to support

5 IC' C Reffay, F-M Blondel, E Giguet5 Anonymisation purpose systematicallyHide personal information systematically –Names (first names, last names, usernames…) –Identifiers (Passport, National Student Number, …) –Locations (city, street, address, coordinates) –Institution/Workplace (school, sport club, firm, …) –Contact references ( , mobile, MSN, skype, twitter, telephone/fax) –Explicit references (URL of homepages, blogs) –Social media usernames (facebook, MySpace, Hi5, Soundcloud, Badoo, Bebo, Friendster, Netlog, …) Maintaining text coherence and consistency

6 IC' C Reffay, F-M Blondel, E Giguet6 Personal data: examples {(f331s2970m2) T19:24 Gabibr Re: Quelques informations... answers SandrineD (f331s2970m1)} Eu amo a língua Francesa! Quem sabe falar francês me adiconem no meu FACEBOOK;) J'aime parler français! Qui peut parler français? M'ajouter dans FACEBOOK;) Nom: GABRIELA MEDEIROS. {(f333s3016m2) T09:25 Miche Re: Les stéréotypes culinaires answers SandrineD (f333s3016m1)} inviate i vostri documenti alla mia mail grazie!!!;) {(f330s2914m8) T19:52 PBS Re: Por que me chamo assim?! answers Eloandrade (f330s2914m1)} Yo me llamo Peimikà Bibiana. Como mi madre es tailandesa y mi padre es italiano, mi primer nombre, Peimikà, es tailandés y significa " dueña del amor ", mientras mi según nombre, Bibiana, es italiano y procede del etrusco " vibius " que significa " vida ". Me gusta mucho tener dos nombres (en Italia es más usual tener un nombre) y sobre todo estoy orgullosa de los orígenes diferentes que tienen y que hacen mi nombre aún más particular (además Peimikà no es muy difundido en tampoco en Tailandia y tampoco Bibiana en Italia

7 IC' C Reffay, F-M Blondel, E Giguet7 Just google it!

8 IC' C Reffay, F-M Blondel, E Giguet8 Peimikà Bibiana… google search (2)

9 IC' C Reffay, F-M Blondel, E Giguet9 Anonymisation Principles 1.All identified lexical forms must be (computationally) marked even if not modified by a replacement form. 2.Any reference (e.g.: name, institution or location) may be imprecise enough to encompass several hundreds people. Original lexical formReplacement form Replaced by Mark Once anonymised, no participant may be identifiable by an external person

10 IC' C Reffay, F-M Blondel, E Giguet10 Anonymisation Before: {(f330s2880m3) T08:22 KellyM Re: Qui sommes- nous? answers CarlaN (f330s2880m1)} Bonjour, je m'appelle Kellly. J'ai 16 ans, je suis une élève en 1ère S dans le lycée Rosa Luxemburg à Canet, non loin de Perpignan… After: {(f330s2880m3) T08:22 FLG01 Re: Qui sommes- nous? answers ILG02 (f330s2880m1)} Bonjour, je m'appelle Kittty*. J'ai 16 ans, je suis une élève en 1ère S dans le lycée Margherita Duras* à Aigues- Vives*, non loin de Perpignan… Before After

11 IC' C Reffay, F-M Blondel, E Giguet11 Hypotheses A fully automated method does not exist for all corpora Some decisions have to be taken by the researcher, not by the software Accuracy of the method will be achieved only for a given context (ex: Galanet) Named entities do not occur randomly Lets find the regularities Interactively with the expert: the researcher

12 IC' C Reffay, F-M Blondel, E Giguet12 Concepts manipulated Institution, Participant, Public person, Relative, Street, City … Existing objects Named entitiesLexical forms Name, Surname, Username, First name, Last name, Addresses, Tel. number, MSN… Pedro, KellyM, Eli, Elô, Kelly, Bergamo, Canet, Rosa Luxembourg, , Real world CorpusReference

13 IC' C Reffay, F-M Blondel, E Giguet13 Anonymisation process Corpus to anonymise Corpus with marked Entities Named entities transformation table Initial list of participants, usernames, institution… Process/Rules Discovering new forms Marking Process Anonymised Corpus Replacement Process

14 IC' C Reffay, F-M Blondel, E Giguet14 Transformation table: example Synonyms: the same entity has different forms = Homonyms: the same form refers to different entities

15 IC' C Reffay, F-M Blondel, E Giguet15 Marking one form: Example (Kelly) A- List of all occurrences (with their context) with a concordancer

16 IC' C Reffay, F-M Blondel, E Giguet16 Marking one form: Example (Kelly) + B- Update the transformation table (ex: Public person Gene Kelly)

17 IC' C Reffay, F-M Blondel, E Giguet17 Marking one form: Example (Kelly) C- Associate each occurrence to the appropriate entity ( => In the corpus: Surround the occurrence by XML tags) Last name, Normal form, unchanged refers to the public person Gene Kelly First name, Normal form, to be changed refers to the participant KellyM

18 IC' C Reffay, F-M Blondel, E Giguet18 Detecting new forms: 2 strategies Lexical rules: similar forms –Eli -> Elô Ely ELY Seli –Gabriela -> GABRIELA –José -> Jose Context rules: Similar context –First names: mi chiamo …, accord avec … –Cities: Soy de …, vivo en …, jhabite à …

19 IC' C Reffay, F-M Blondel, E Giguet19 1 st Strategy: Lexical variation rules adriana Alexia Antonhy baptiste Cleisa Elô Ely ELY Seli Louise MAnuel Federiac fran Fran GABRIELA guillem iñigo Jacqueline jean Jose Kellly Leo léo MariAna mary May Miche michelina moni olalla oleguer Adriana Alèxia Anthony Baptiste Cleissa Eli… Elouise Emmanuel Federica Ferran Gabriela Guillem Iñigo Jaqueline Jean José Kelly Léo Mariana Mary Michela Monica Olalla Oleguer 103 Known forms 31 New forms

20 IC' C Reffay, F-M Blondel, E Giguet20 2 nd Strategy : Context rules 103 Known first names (Adrià, …, Veronica) 145 contexts: Left/Right Total: more than 250 tested rules 15 good new forms Antonhy Belle Bet Christine Fede Federiac Kellly Leo Line Maria May Peimikà Regina fran jean léo 47 rules approved

21 IC' C Reffay, F-M Blondel, E Giguet21 Replacing process Before: {(f330s2880m3) T08:22 KellyM Re: Qui sommes- nous? answers CarlaN (f330s2880m1)} Bonjour, je m'appelle Kellly. J'ai 16 ans, je suis une élève en 1ère S dans le lycée Rosa Luxemburg à Canet, non loin de Perpignan… After: {(f330s2880m3) T08:22 FLG01 Re: Qui sommes- nous? answers ILG02 (f330s2880m1)} Bonjour, je m'appelle Kittty*. J'ai 16 ans, je suis une élève en 1ère S dans le lycée Margherita Duras* à Aigues- Vives*, non loin de Perpignan…

22 IC' C Reffay, F-M Blondel, E Giguet22 Conclusion 1.A new process/algorithm for anonymisation 2.Confront hypotheses to a first corpus –47 rules approved for first names => 15 new forms –103 first names => 31 existing derivations –Anonymisation not 100% auto => confirmed 3.Anonymisation possible? in a world with Google –Use Google to evaluate the frequency of a first name!

23 IC' C Reffay, F-M Blondel, E Giguet23 Next steps… Finalize concrete anonymisation of this corpus –Discuss some choices with SandrineD for: –Usernames, cities, addresses,… –Get feedback from SandrineD Verify on a bigger (Galanet) corpus: –The process –The rules Co-develop the tool : –within the research community… –in the (ANR) CORDIAL project?

24 Grazie !

25 More precisely

26 IC' C Reffay, F-M Blondel, E Giguet26 New forms discovering: 2 strategies 103 Known first names (Adrià, …, Veronica) Lexical Rules Context Rules 317 candidates 145 contexts: Left/Right Left: One form: 75 => occ. Left: 2 forms seq.: 123 => 1700 occ. Total: more than 250 tested rules 50 Auto 34 frequent words 16 known 200 Easy 180 common words 20 username 67 Tests 5 common 31 good new forms 1 relative new: Maria 30 public names 47 rules approved 15 good new forms

27 IC' C Reffay, F-M Blondel, E Giguet27 Contexts of 145 occ. of 103 first names (using TXM, case insensitive)

28 IC' C Reffay, F-M Blondel, E Giguet28 The corpus lexicon A list of (lexical forms Frequence) –de 1015 –que 965 –la 673 –… –porque 48 –… –Addams unique forms

29 IC' C Reffay, F-M Blondel, E Giguet29 Who is concerned? « Les applications informatiques à des fins pédagogiques et éducatives mobilisent des données permettant didentifier directement mais aussi indirectement les personnes physiques. Une attention particulière doit être portée sur la collecte de données sensibles ainsi que sur les procédés danonymisation des données. » (Mallet-Poujol 2004: p 21) For more information, see the European Commission Directive (95/46/EC)

30 IC' C Reffay, F-M Blondel, E Giguet30 Legal context (95/46/EC) (Art7) Member States shall provide that personal data may be processed only if: the data subject has unambiguously given his consent;… (Art8) Member States shall prohibit the processing of personal data revealing sensitive information (racial or ethnic origin, political opinions, religious or philosophical beliefs, trade-union membership, and the processing of data concerning health or sex life) (Art8) […] Inform the data subject on: –The identity of the controller of the data collection, –The purposes of the processing –The recipients or categories of recipients of the data, –The existence of the right of access to and the right to rectify the data concerning him

31 IC' C Reffay, F-M Blondel, E Giguet31 Text coherence and consistency {(f330s2914m11) T16:43 M_Cavalcanti Re: Por que me chamo assim?! Answers Eloandrade (f330s2914m1)} aaah, o meu é uma homenagem a uma de minhas tias e minha avó que se chamam Ana e ao resto de minhas tias que se chamam Maria. Daí, Mariana:) {(f330s2914m10) T21:06 Eloandrade Re: Por que me chamo assim?! answers Eloandrade (f330s2914m1)} Gostei da criatividade da sua mãe MariAna! Rsrsrs {(f330s2914m3) T00:54 LineCosta Re: Por que me chamo assim?! answers Eloandrade (f330s2914m1)} Ah meu nome é em homenagem a Jacqueline Kennedy, esposa do ex- presidente dos EUA, e também porque sempre foi um dos nomes preferidos do meu pai.: D {(f330s2914m18) T20:36 Eloandrade Re: Por que me chamo assim?! answers Eloandrade (f330s2914m1)} Bem, minha mãe queria que meu nome começasse com a letra E (como o dela!), um certo dia ela viu o nome de uma atriz brasileira chamada Louise Cardoso. Gostou do " Louise ", mas queria com a letra E, então ficou " Elouise "! Só depois, quando eu cresci é que descobri que meu nome era de origem francesa.. Hahaha

32 IC' C Reffay, F-M Blondel, E Giguet32 TXM:

33 IC' C Reffay, F-M Blondel, E Giguet33 Named entities A named entity is a lexical form identifying a precise object (first/last name, communication ref., city, institution, etc.) Examples: Names: Christophe, Blondel, Giguet, Paris, Communication ref.: , … Location: Grenoble, Paris, Parigi, … Institution: ENS Cachan, CNRS, …

34 IC' C Reffay, F-M Blondel, E Giguet34 Managing named entities Homonyms refer to different objects –In the corpus we have 2 participants named Guillem: The same first name refers to different persons. –In Gene Kelly, Kelly = public person last name –in Galdric, Kelly et Antonhy, its a participant first name Different synonyms refer to the same object –Kellly & Kelly, –Anthony & Antonhy, –Elô & Elouise

35 IC' C Reffay, F-M Blondel, E Giguet35 Referring to global entities

36 IC' C Reffay, F-M Blondel, E Giguet36 Overall method and tools 1.Define a process/algorithm for anonymisation 2.Confront hypotheses to a first corpus –Using existing tools (Excel, TXM/Calico, Notepad++) –Doing many work by hand (having automation in mind) –Facing/solving/avoiding problems –Evaluating/Suggesting (new) hypotheses 3.Discuss the result with the original researcher 4.Verify on a second (bigger corpus) 5.Co-develop the tool within the research community

37 IC' C Reffay, F-M Blondel, E Giguet37 Find Nei/nei with a concordancer All occurrences refer to the Italian common word nei

38 IC' C Reffay, F-M Blondel, E Giguet38 Another example {(f330s2914m5) T21:52 CR_Martins Re: Por que me chamo assim?! answers Eloandrade (f330s2914m1)} Meu nome é Cleissa Regina, Cleissa porque minha mãe viu na tv uma repórter chamada Cleisa e achou parecido com o nome dela, Cléia e Regina porque o nome do meu pai é Reginaldo. Assim como a PBS gosto muito de ter 2 nomes e Cleissa é bem raro, nunca conheci ninguém chamado assim.

39 IC' C Reffay, F-M Blondel, E Giguet39 Peimikà Bibiana… a unique case? No! Lets try Cleissa Regina…

40 IC' C Reffay, F-M Blondel, E Giguet40 How to detect new forms? Lexical rules (look for similar forms): –Ignoring accents (ex: José, Jose) –Ignoring case (ex: José, jose, JOSÉ, …) –Levenstein distance between 2 forms: number of extra/missing/inversion of characters –For graphy size <5 : Dist<=1 –For graphy size >=5 : Dist<=2 Context rules: (ex: mi chiamo …, merci …)

41 IC' C Reffay, F-M Blondel, E Giguet41 Lexical variations 1/2 UPPERExact Levenstein nb differences KnownNewdistance CaseaccentsAdd/Sup/Inv Adrianaadriana011 AlèxiaAlexia11 1 AnthonyAntonhy22 2 Baptistebaptiste011 CleissaCleisa11 1 EliElô11 1 EliEly11 EliELY12 1 EliSeli121 1 ElouiseLouise121 1 EmmanuelMAnuel242 2 FedericaFederiac22 2 Ferranfran231 2 FerranFran22 2

42 IC' C Reffay, F-M Blondel, E Giguet42 Lexical variations 2/2 UPPERExact Levenstein nb differences KnownNewdistance CaseaccentsAdd/Sup/Inv GabrielaGABRIELA077 Guillemguillem011 Iñigoiñigo011 JaquelineJacqueline11 1 Jeanjean011 JoséJose11 1 KellyKellly11 1 LéoLeo11 1 Léoléo011 MarianaMariAna011 2 Marymary011 1 MaryMay11 1 MichelaMiche22 2 Michelamichelina231 2 Monicamoni231 2 Olallaolalla011 Olegueroleguer011

43 IC' C Reffay, F-M Blondel, E Giguet43 Some good context rules (1/3) ContextTotalKnownNewNew forms detectedAccuracy sou % appelle 94 1Kelly56% Cara 71 1May29% Ciao 61 17% Merci 91 2Belle, léo44% soy 52 40% equipe 51 20% Hombre 41 25% dicho 31 33% llamo 32 1Peimikà100% appel 31 33% raison 31 33% choix 31 33% chamam 21 1Maria100% tampoco 21 50%

44 IC' C Reffay, F-M Blondel, E Giguet44 Some good context rules (2/3) ContextTotalKnownNewNew forms detectedAccuracy {BOM},628 1Fede15% je mappelle % Accord avec 94 1Bet56% Concordo com a 32 1Line100% meu nome é 32 67% moi cest %, ho82 25%, jhabite22 100%, je82 25% je mappel 10 1jean100% suis avec 21 50% a dit11 100% dit el % diu el % nombre, 21 1Peimikà100%

45 IC' C Reffay, F-M Blondel, E Giguet45 Generic context rules ContextTotalKnownNewNew forms detectedAccuracy, 152 1Regina20% i 31 33% i % et 62 2Antonhy, Leo67% et 32 1Federiac100% e 31 33% e 31 33%


Download ppt "Strategy for systematic anonymisation of multi-lingual interaction corpora. C. Reffay 1, F.-M. Blondel 1, E. Giguet 2 1 STEF – ENS-Cachan / IFÉ – ENS-Lyon."

Similar presentations


Ads by Google