Presentation is loading. Please wait.

Presentation is loading. Please wait.

Musings at the Crossroads of Digital Libraries, Information Retrieval, and Scientometrics Guillaume Cabanac

Similar presentations


Presentation on theme: "Musings at the Crossroads of Digital Libraries, Information Retrieval, and Scientometrics Guillaume Cabanac"— Presentation transcript:

1 Musings at the Crossroads of Digital Libraries, Information Retrieval, and Scientometrics Guillaume Cabanac March 28th, 2012

2 Outline of these Musings 2 Musings at the Crossroads of DL, IR, and SCIM Guillaume Cabanac Digital Libraries Digital Libraries  Collective annotations  Social validation of discussion threads  Organization-based document similarity Information Retrieval Information Retrieval  The tie-breaking bias in IR evaluation  Geographic IR  Effectiveness of query operators Scientometrics Scientometrics  Recommendation based on topics and social clues  Landscape of research in Information Systems  The submission-date bias in peer-reviewed conferences

3 3 Musings at the Crossroads of DL, IR, and SCIM Guillaume Cabanac Digital Libraries Digital Libraries  Collective annotations  Social validation of discussion threads  Organization-based document similarity Information Retrieval Information Retrieval  The tie-breaking bias in IR evaluation  Geographic IR  Effectiveness of query operators Scientometrics Scientometrics  Recommendation based on topics and social clues  Landscape of research in Information Systems  The submission-date bias in peer-reviewed conferences Outline of these Musings

4 4 Musings at the Crossroads of DL, IR, and SCIM Guillaume Cabanac Digital Libraries Digital Libraries  Collective annotations  Social validation of discussion threads  Organization-based document similarity Question DL-1 How to transpose paper-based annotations into digital documents? IRDL SCIM Guillaume Cabanac, Max Chevalier, Claude Chrisment, Christine Julien. “Collective annotation: Perspectives for information retrieval improvement.” RIAO’07 : Proceedings of the 8th conference on Information Retrieval and its Applications, pages 529–548. CID, may 2007.

5 5 Characteristics of paper annotation  Secular activity:older than 4 centuries  Numerous applicative contexts: theology, science, literature …  Personal use:“active reading” (Adler & van Doren, 1972)  Collective use:review process, opinion exchange … From Individual Paper-based Annotation … US students (Marshall, 1998) 1541 Annotated bible (Lortsch, 1910) Fermat’s last theorem (Kleiner, 2000) Annotations from Blake, Keats… (Jackson, 2001) Les Misérables Victor Hugo Musings at the Crossroads of DL, IR, and SCIM Guillaume Cabanac

6 6 … to Collective Digital Annotations author 87% reader 13% ComMentor … iMarkup … Yawas … Amaya … > 20 annotation systems (Cabanac et al., 2005) Web servers (Ovsiannikov et al., 1999) Annotation server a discussion thread Hard to share  ‘lost’ hardcopy Musings at the Crossroads of DL, IR, and SCIM Guillaume Cabanac

7 7 W3C Annotea / Amaya (Kahan et al., 2002) Digital Document Annotation: Examples a reader’s comment discussion thread Arakne, featuring “fluid annotations” (Bouvin et al., 2002) Musings at the Crossroads of DL, IR, and SCIM Guillaume Cabanac

8 8 Collective Annotations Reviewed 64 systems designed during 1989–2008 Collective Annotation  Objective data Owner, creation date Anchoring point within the document. Granularity: all doc, words…  Subjective information Comments, various marks: stars, underlined text… Annotation types: support/refutation, question… Visibility: public, private, group… Purpose-oriented annotation categories Annotation remark Annotation reminder Annotation argumentation Personal Annotation Space Musings at the Crossroads of DL, IR, and SCIM Guillaume Cabanac

9 9 Digital Libraries Digital Libraries  Collective annotations  Social validation of discussion threads  Organization-based document similarity Question DL-2 How to measure the social validity of a statement according to the argumentative discussion it sparked off? IRDL SCIM Guillaume Cabanac, Max Chevalier, Claude Chrisment, Christine Julien. “Social validation of collective annotations : Definition and experiment.” Journal of the American Society for Information Science and Technology, 61(2):271–287, feb. 2010, Wiley. DOI: /asi.21255

10 10 Scalability issue  Which annotations should I read? Social validation = degree of consensus of the group Social Validation Social Validation of Argumentative Debates Musings at the Crossroads of DL, IR, and SCIM Guillaume Cabanac

11 11 Social Validation of Argumentative Debates Before Annotation magma After Filtered display Informing readers about how validated each annotation is Musings at the Crossroads of DL, IR, and SCIM Guillaume Cabanac

12 12 Overview Two proposed algorithms  Empirical Recursive Scoring Algorithm (Cabanac et al., 2005)  Bipolar Argumentation Framework Extension based on Artificial Intelligence research works (Cayrol & Lagasquie-Schiex, 2005) Social Validation Algorithms validity 0 socially neutral – 1 socially refuted 1 socially confirmed case 1case 2case 3case 4 A B A B Musings at the Crossroads of DL, IR, and SCIM Guillaume Cabanac

13 13 Example Computing the social validity of a debated annotation Social Validation Algorithm Musings at the Crossroads of DL, IR, and SCIM Guillaume Cabanac

14 14 Validation with a User-study Design  Corpus: 13 discussion threads = 222 annotations + answers  Task of a participant Label opinion type Infer overall opinion  Volunteer subjects Aim: social validation vs human perception of consensus Musings at the Crossroads of DL, IR, and SCIM Guillaume Cabanac

15 15 Q 1 Do people agree when labeling opinions?  Kappa coefficient (Fleiss, 1971; Fleiss et al., 2003) Inter-rater agreement among n > 2 raters  Weak agreement, with variability  subjective task Experimenting the Social Validation of Debates Debate Id Fair to good Poor Musings at the Crossroads of DL, IR, and SCIM Guillaume Cabanac Value of Kappa agreement

16 16 Q 2 How well SV approximates HP?  HP = Human Perception of consensus  SV = Social Validation algorithm 1.Test whether PH and VS are different (p (  = 0,05) 2.Correlate HP et SV  Pearson’s coefficient of correlation r r(HP, SV) = 0.48 shows a weak correlation Experimenting the Social Validation of Debates HP – SV Density y = p(HP – SV) example: HP = SV for 24 % of all cases Musings at the Crossroads of DL, IR, and SCIM Guillaume Cabanac Density

17 17 Musings at the Crossroads of DL, IR, and SCIM Guillaume Cabanac Digital Libraries Digital Libraries  Collective annotations  Social validation of discussion threads  Organization-based document similarity Question DL-3 How to harness a quiescent capital present in any community: its documents? IRDL SCIM Guillaume Cabanac, Max Chevalier, Claude Chrisment, Christine Julien. “Organization of digital resources as an original facet for exploring the quiescent information capital of a community.” International Journal on Digital Libraries, 11(4):239–261, dec. 2010, Springer. DOI: /s

18 18 Personal Documents  Filtered, validated, organized information…  … relevant to activities in the organization Paradox: profitable, but under-exploited  Reason 1 –  folders and files are private  Reason 2 –  manual sharing  Reason 3 –  automated sharing Consequences  People resort to resources available outside of the community  Weak ROI  why would we have to look outside when it’s already there? Documents as a Quiescent Wealth Musings at the Crossroads of DL, IR, and SCIM Guillaume Cabanac

19 19 Mapping the documents of the community  SOM [Kohonen, 2001] Umap [Triviumsoft] TreeMap [Fekete & Plaisant, 2001] … Limitations same topics  Find the documents with same topics as D use  Find documents that colleagues use with D grouping documentskeeping stuff in common  concept of usage: grouping documents ⇆ keeping stuff in common How to Benefit from Documents in a Community? Musings at the Crossroads of DL, IR, and SCIM Guillaume Cabanac

20 20 Organization-based similarities  inter-folder  inter-document  inter-user Musings at the Crossroads of DL, IR, and SCIM Guillaume Cabanac How to Benefit from Documents in a Community?

21 21 Purpose: Offering a global view of  … people and their documents Based on document contents Based on document usage/organization  Requirement: non-intrusiveness and confidentiality Operational Operational needs  Find documents With related materials With complementary materials  Seeking people ⇆ seeking documents Managerial Managerial needs  Visualize the global/individual activity  Work position  required documents How to Help People to Discover/Find/Use Documents? community Musings at the Crossroads of DL, IR, and SCIM Guillaume Cabanac

22 22 4 views = {documents, people}  {group, unit} 1.Group of documents  Main topics  Usage groups 2.A single document  Who to liaise with?  What to read? 3.Group of people  Community of interest  Community of use 4.A single people  Interests  Similar users (potential help) Proposed System: Static Aspect Musings at the Crossroads of DL, IR, and SCIM Guillaume Cabanac

23 23 Musings at the Crossroads of DL, IR, and SCIM Guillaume Cabanac Digital Libraries Digital Libraries  Collective annotations  Social validation of discussion threads  Organization-based document similarity Information Retrieval Information Retrieval  The tie-breaking bias in IR evaluation  Geographic IR  Effectiveness of query operators Scientometrics Scientometrics  Recommendation based on topics and social clues  Landscape of research in Information Systems  The submission-date bias in peer-reviewed conferences Outline of these Musings

24 24 Musings at the Crossroads of DL, IR, and SCIM Guillaume Cabanac Question IR-1 Is document tie-breaking affecting the evaluation of Information Retrieval systems? IRDL SCIM Information Retrieval Information Retrieval  The tie-breaking bias in IR evaluation  Geographic IR  Effectiveness of query operators Guillaume Cabanac, Gilles Hubert, Mohand Boughanem, Claude Chrisment. “Tie-breaking Bias : Effect of an Uncontrolled Parameter on Information Retrieval Evaluation.” M. Agosti, N. Ferro, C. Peters, M. de Rijke, and A. F. Smeaton (Eds.) CLEF’10 : Proceedings of the 1st Conference on Multilingual and Multimodal Information Access Evaluation, volume 6360 de LNCS, pages 112–123. Springer, sep DOI: / _13

25 25 Measuring the Effectiveness of IR systems User-centered vs. System-focused [Spärck Jones & Willett, 1997] Evaluation campaigns  1958 Cranfield, UK  1992 TREC (Text Retrieval Conference), USA  1999 NTCIR (NII Test Collection for IR Systems), Japan  2001 CLEF (Cross-Language Evaluation Forum), Europe  … “Cranfield” methodology  Task  Test collection Corpus Topics Qrels  Measures : MAP, using trec_eval [Voorhees, 2007] Musings at the Crossroads of DL, IR, and SCIM Guillaume Cabanac

26 26 Runs are Reordered Prior to Their Evaluation Qrels =  qid, iter, docno, rel  Run =  qid, iter, docno, rank, sim, run_id  Reordering by trec_eval qid asc, sim desc, docno desc Effectiveness measure = f (intrinsic_quality, ) MAP, MRR… Musings at the Crossroads of DL, IR, and SCIM Guillaume Cabanac

27 27 Consequences of Run Reordering Measures of effectiveness for an IRS s  RR(s,t)1/rank of the 1 st relevant document, for topic t  P(s,t,d)precision at document d, for topic t  AP(s,t)average precision for topic t  MAP(s)mean average precision   Tie-breaking bias  Is the Wall Street Journal collection more relevant than Associated Press?   Problem 1comparing 2 systemsAP(s 1, t) vs. AP(s 2, t)   Problem 2 comparing 2 topicsAP(s, t 1 ) vs. AP(s, t 2 ) Chris Ellen  Sensitive to document rank Musings at the Crossroads of DL, IR, and SCIM Guillaume Cabanac

28 28 What we Learnt: Beware of Tie-breaking for AP Poor effect on MAP, larger effect on AP Measure bounds AP Realistic  AP Conventionnal  AP Optimistic Failure analysis for the ranking process  Error bar = element of chance  potential for improvement padre1, adhoc’94 Musings at the Crossroads of DL, IR, and SCIM Guillaume Cabanac

29 29 Musings at the Crossroads of DL, IR, and SCIM Guillaume Cabanac Question IR-2 How to retrieve documents matching keywords and spatiotemporal constraints? IRDL SCIM Information Retrieval Information Retrieval  The tie-breaking bias in IR evaluation  Geographic IR  Effectiveness of query operators Damien Palacio, Guillaume Cabanac, Christian Sallaberry, Gilles Hubert. “On the evaluation of geographic information retrieval systems: Evaluation framework and case study.” International Journal on Digital Libraries, 11(2):91–109, june 2010, Springer. DOI: /s z

30 30 Geographic Information Retrieval Query = “Road trip around Aberdeen summer 1982” Search engines  Topic term  {road, trip, Aberdeen, summer} spatial  {AberdeenCity, AberdeenCounty…}  Geographic temporal  [21-JUN SEP-1982] term  {road, trip, Aberdeen, summer}  1/6 queries = geographic queries  Excite (Sanderson et al., 2004)  AOL (Gan et al., 2008)  Yahoo! (Jones et al., 2008)  Current issue worth studying Musings at the Crossroads of DL, IR, and SCIM Guillaume Cabanac

31 31 The Internals of a Geographic IR System 3 dimensions to process  Topical, spatial, temporal 1 index per dimension  Topicbag of words, stemming, weighting, comparing with VSM…  Spatialspatial entity detection, spatial relation resolution…  Temporaltemporal entity detection… Query processing with sequential filtering  e.g., priority to theme, then filtering according to other dimensions Issue: effectiveness of GIRSs vs state-of-the-art IRSs? Hypothesis: GIRSs better than state-of-the-art IRSs Musings at the Crossroads of DL, IR, and SCIM Guillaume Cabanac

32 32 Case Study: the PIV GIR System Indexing: one index per dimension  Topical = Terrier IRS Spatial = tiling Temporal = tiling Retrieval  Identification of the 3 dimensions in the query  Routing towards each index  Combination of results with CombMNZ [Fox & Shaw, 1993; Lee 1997] Musings at the Crossroads of DL, IR, and SCIM Guillaume Cabanac

33 33 Case Study: the PIV GIR System Principle of CombMNZ and Borda Count Musings at the Crossroads of DL, IR, and SCIM Guillaume Cabanac

34 34 Case Study: the PIV GIR System Gain in effectiveness Musings at the Crossroads of DL, IR, and SCIM Guillaume Cabanac

35 35 Musings at the Crossroads of DL, IR, and SCIM Guillaume Cabanac Question IR-3 Do operators in search queries improve the effectiveness of search results? IRDL SCIM Information Retrieval Information Retrieval  The tie-breaking bias in IR evaluation  Geographic IR  Effectiveness of query operators Gilles Hubert, Guillaume Cabanac, Christian Sallaberry, Damien Palacio. “Query Operators Shown Beneficial for Improving Search Results.” S. Gradmann, F. Borri, C. Meghini, H. Schuldt (Eds.) TPDL’11 : Proceedings of the 1st International Conference on Theory and Practice of Digital Libraries, volume 6966 de LNCS, pages 118–129. Springer, sep DOI: / _14.

36 Various Operators  Quotation marks, Must appear (+), boosting operator (^), Boolean operators, proximity operators… 36 Information need “I’m looking for research projects funded in the DL domain” Regular queryQuery with operators Search Engines Offer Query Operators Musings at the Crossroads of DL, IR, and SCIM Guillaume Cabanac

37 Our Research Questions Q = Do query operators lead to improved search results? Q1 = Maximum gain in effectiveness when enriching a query with operators? Q2 = Do users succeed in formulating better queries involving operators? 37 Musings at the Crossroads of DL, IR, and SCIM Guillaume Cabanac

38 38 Our Methodology in a Nutshell Regular queryV1: Query variant with operators  V3 V2 V4 VN... Musings at the Crossroads of DL, IR, and SCIM Guillaume Cabanac

39 39 Effectiveness of Query Operators TREC-7 per Topic Analysis: Boxplots  ‘+’ and ‘^’ Musings at the Crossroads of DL, IR, and SCIM Guillaume Cabanac

40 40 Effectiveness of Query Operators Per Topic Analysis: Box plot AP of TREC’s regular query Query variant highest AP 32 Topics AP (Average Precision) Query variant lowest AP Musings at the Crossroads of DL, IR, and SCIM Guillaume Cabanac

41 41 Effectiveness of Query Operators TREC-7 Per Topic Analysis  ‘+’ and ‘^’ MAP  = MAP ┬ = % Musings at the Crossroads of DL, IR, and SCIM Guillaume Cabanac

42 42 Musings at the Crossroads of DL, IR, and SCIM Guillaume Cabanac Digital Libraries Digital Libraries  Collective annotations  Social validation of discussion threads  Organization-based document similarity Information Retrieval Information Retrieval  The tie-breaking bias in IR evaluation  Geographic IR  Effectiveness of query operators Scientometrics Scientometrics  Recommendation based on topics and social clues  Landscape of research in Information Systems  The submission-date bias in peer-reviewed conferences Outline of these Musings

43 43 Musings at the Crossroads of DL, IR, and SCIM Guillaume Cabanac Question SCIM-1 How to recommend researchers according to their research topics and social clues? IRDL SCIM Scientometrics Scientometrics  Recommendation based on topics and social clues  Landscape of research in Information Systems  The submission-date bias in peer-reviewed conferences Guillaume Cabanac. “Accuracy of inter-researcher similarity measures based on topical and social clues.” Scientometrics, 87(3):597–620, june 2011, Springer. DOI: /s

44 44 Recommendation of Literature (McNee et al., 2006) Collaborative filtering preferences  Principle: mining the preferences of researchers  those who liked this paper also liked…  Snowball effect / fad  Innovation?  Relevance of theme? Cognitive filtering contents  Principle: mining the contents of articles  profile of resources (researcher, articles)  citation graph Hybrid approach      ??  ?? Musings at the Crossroads of DL, IR, and SCIM Guillaume Cabanac

45 45 Foundations: Similarity Measures Under Study Model  Coauthors graph authors  auteurs  Venues graph authors  conferences / journals Social similarities  Inverse degree of separationlength of the shortest path  Strength of the tienumber of shortest paths  Shared conferencesnumber of shared conference editions Thematic similarity  Cosine on Vector Space Model d i = (w i 1, …, w i n ) built on titles (doc / researcher) Musings at the Crossroads of DL, IR, and SCIM Guillaume Cabanac

46 46 Computing Similarities with Social Clues Task of literature review  Requirementtopical relevance  Preferencesocial proximity (meetings, project…)  re-rank topical results with social clues Combination with CombMNZ (Fox & Shaw, 1993) Final result: list of recommended researchers CombMNZ Degree of separation Strength of ties Shared conferences Social list Topical list  CombMNZ TS list Musings at the Crossroads of DL, IR, and SCIM Guillaume Cabanac

47 47 Evaluation Design Comparison of recommendations and researchers’ perception  Q1 : Effectiveness of topical (only) recommendations?  Q2 : Gain due to integrating social clues? IR experiments: Cranfield paradigm (TREC…)  Does the search engine retrieve relevant documents? Doc relevant? assessor relevance judgments {0, 1} binary [0, N] gradual qrels trec_eval Effectiveness measures Mean Average Precision Normalized Discounted Cumulative Gain topicS1S ……… avg improvement+12.3 % significativity p < 0.05 (paired t-test) search engine x input topic corpus Musings at the Crossroads of DL, IR, and SCIM Guillaume Cabanac

48 48 Evaluating Recommendations Adaptation of the Cranfield paradigm (TREC…)  Is the search engine rec. sys. Retrieving relevant documents researchers? doc relevant ? assessor relevance judgments {0, 1} binary [0, N] gradual qrels trec_eval Effectiveness measures Mean Average Precision Normalized Discounted Cumulative Gain topicS1S ……… avg improvement+12.3 % significativityp < 0.05 (paired t-test) search engine x input topic corpus name of a researcher researcher « With whom would you like to chat for improving your research? » recommender system topical topical + social #subjects Top 25 Musings at the Crossroads of DL, IR, and SCIM Guillaume Cabanac

49 49 Experiment Features  Datadblp.xml (713 MB = 1.3M publications for 811,787 researchers)  Subjects90 researchers-contacts contacted by mail 74 researchers began to fill the questionnaire. 71 completed it Interface for assessing recommendations    Musings at the Crossroads of DL, IR, and SCIM Guillaume Cabanac

50 50 Experiments: Profile of the Participants Experience of the 71 subjectsMdn = 13 years 74 Productivity of the 71 subjectsMdn = 15 publications Musings at the Crossroads of DL, IR, and SCIM Guillaume Cabanac Number of participants Seniority Number of publications

51 51 Empirical Validation of our Hypothesis Strong baseline  effective approach based on VSM % = significant improvement (p < 0.05 ; n = 70) of topical recommendations by social clues productivityexperience +8,49 % +10,39 % +7,03 % +6,50 % +10,22 % NDCG Musings at the Crossroads of DL, IR, and SCIM Guillaume Cabanac TopicalTopical + social years

52 52 Musings at the Crossroads of DL, IR, and SCIM Guillaume Cabanac Question SCIM-2 What is the landscape of research in Information Systems from the perspective of gatekeepers? IRDL SCIM Scientometrics Scientometrics  Recommendation based on topics and social clues  Landscape of research in Information Systems  The submission-date bias in peer-reviewed conferences Guillaume Cabanac. “Shaping the landscape of research in Information Systems from the perspective of editorial boards : A scientometric study of 77 leading journals.” Journal of the American Society for Information Science and Technology, 63, to appear in 2012, Wiley. DOI: /asi.22609

53 53 Landscape of Research in Information Systems The gatekeepers of science Musings at the Crossroads of DL, IR, and SCIM Guillaume Cabanac

54 54 Landscape of Research in Information Systems The 77 core peer-reviewed IS journals in the WoS Musings at the Crossroads of DL, IR, and SCIM Guillaume Cabanac

55 55 Landscape of Research in Information Systems Exploratory data analysis Musings at the Crossroads of DL, IR, and SCIM Guillaume Cabanac

56 56 Landscape of Research in Information Systems Exploratory data analysis Musings at the Crossroads of DL, IR, and SCIM Guillaume Cabanac

57 57 Landscape of Research in Information Systems Topical map of the IS field Musings at the Crossroads of DL, IR, and SCIM Guillaume Cabanac

58 58 Landscape of Research in Information Systems Most influential gatekeepers Musings at the Crossroads of DL, IR, and SCIM Guillaume Cabanac

59 59 Landscape of Research in Information Systems Number of gatekeepers per country Musings at the Crossroads of DL, IR, and SCIM Guillaume Cabanac

60 60 Landscape of Research in Information Systems Geographic and gender diversity Musings at the Crossroads of DL, IR, and SCIM Guillaume Cabanac

61 61 Musings at the Crossroads of DL, IR, and SCIM Guillaume Cabanac Question SCIM-3 What if submission date influenced the acceptance of conference papers? IRDL SCIM Scientometrics Scientometrics  Recommendation based on topics and social clues  Landscape of research in Information Systems  The submission-date bias in peer-reviewed conferences Guillaume Cabanac. “What if submission date influenced the acceptance of conference papers?” Submitted to the Journal of the American Society for Information Science and Technology, Wiley.

62 62 Conferences Affected by a Submission-Date bias? Peer-review Musings at the Crossroads of DL, IR, and SCIM Guillaume Cabanac

63 63 The Submission-Date bias Dataset from the ConfMaster conference management system Musings at the Crossroads of DL, IR, and SCIM Guillaume Cabanac

64 64 The Submission-Date bias Influence of submission date on bids Musings at the Crossroads of DL, IR, and SCIM Guillaume Cabanac

65 65 The Submission-Date bias Influence of submission date on average marks Musings at the Crossroads of DL, IR, and SCIM Guillaume Cabanac

66 Conclusion 66 Musings at the Crossroads of DL, IR, and SCIM Guillaume Cabanac Digital Libraries Digital Libraries  Collective annotations  Social validation of discussion threads  Organization-based document similarity Information Retrieval Information Retrieval  The tie-breaking bias in IR evaluation  Geographic IR  Effectiveness of query operators Scientometrics Scientometrics  Recommendation based on topics and social clues  Landscape of research in Information Systems  The submission-date bias in peer-reviewed conferences

67 Thank you


Download ppt "Musings at the Crossroads of Digital Libraries, Information Retrieval, and Scientometrics Guillaume Cabanac"

Similar presentations


Ads by Google