Presentation is loading. Please wait.

Presentation is loading. Please wait.

Australian Document Computing Conference Dec 3 2011 Information Retrieval in Large Organisations Simon Kravis Copyright 2010 Fujitsu Limited.

Similar presentations

Presentation on theme: "Australian Document Computing Conference Dec 3 2011 Information Retrieval in Large Organisations Simon Kravis Copyright 2010 Fujitsu Limited."— Presentation transcript:

1 Australian Document Computing Conference Dec 3 2011 Information Retrieval in Large Organisations Simon Kravis Copyright 2010 Fujitsu Limited

2 FUJITSU CONFIDENTIAL Large Organisations Can’t rely on personal contacts to obtain information Have difficulty in storing and retrieving information Often use multiple systems for storing information Paper Files Shared Filesystems Document Management Systems Intranets (SharePoint) Specialised Systems (eg TRIM, Documentum, Alfresco) Are only interested in Internet style search to meet legal challenges 2Copyright 2010 Fujitsu Limited

3 FUJITSU CONFIDENTIAL Paper files Well understood Easy to manage Can be stored over hundreds of years Expensive to store and search Most documents now ‘born digital’ 3Copyright 2011 Fujitsu Limited

4 FUJITSU CONFIDENTIAL Electronic Documents Cheap to create, exchange and store in the short term Price of powerful applications is poor management 4Copyright 2011 Fujitsu Limited

5 FUJITSU CONFIDENTIAL Filesystems Files are building blocks of –Operating Systems –Applications Desktop applications commonly store electronic documents as files Hardware costs of storage have become very low Difficult to model statistically –many attributes follow power laws (files/folder, file size, subfolders, file types) 5Copyright 2011 Fujitsu Limited

6 FUJITSU CONFIDENTIAL Why shared filesystems? Cheap & simple Access to documents from different computers Support collaborative work 6Copyright 2011 Fujitsu Limited

7 FUJITSU CONFIDENTIAL Shared Filesystem Organisation Multiple volumes, often based on organisational structure Tree structure of folders and files User and Group areas Permissions based on user ID and group membership Higher levels of folder trees usually controlled by administrators 7Copyright 2011 Fujitsu Limited

8 FUJITSU CONFIDENTIAL Are shared filesystems unstructured? Folder tree represents a high degree of structure created by users Local but not global consistency Users structure folder trees to facilitate their own work Structures are usually highly efficient information stores Small survey of users in an IT service company in 2005 showed that only 1 user out of 12 had spent more than 15 mins/day looking for files on share drives over past week 8Copyright 2011 Fujitsu Limited

9 FUJITSU CONFIDENTIAL Filesystem volume growth & effect of quotas 9Copyright 2010 Fujitsu Limited 3000 users, 90 volumes Basically linear with small acceleration Linear component= 190 Gbytes/Month 600 Mbytes/month/user Growth acceleration =7 Gbytes/month 2 22,000 users, 328 user and group volumes Quadratic fit to cleaned data before quotas Linear component= 160 GBytes/month 7 Mbytes/month/user Growth acceleration =0.07 GBytes/ month 2

10 FUJITSU CONFIDENTIAL Volume and count profiles (Financial Services) 10Copyright 2010 Fujitsu Limited

11 FUJITSU CONFIDENTIAL File Size and Count Profile 11Copyright 2010 Fujitsu Limited Size range covers 5 orders of magnitude 50% of volume used by 3% of files

12 FUJITSU CONFIDENTIAL Why filesystems are like poorly sorted soil 12Copyright 2010 Fujitsu Limited Most of volume taken up by large particles

13 FUJITSU CONFIDENTIAL Duplication by count and volume 13Copyright 2010 Fujitsu Limited Volume and count spectra usually different – vol savings seldom > 20% from de-duplication

14 FUJITSU CONFIDENTIAL File Use Profiles – 6500 accesses to 3.5 million files over 21 days by 145 users 14Copyright 2010 Fujitsu Limited 2 accesses per user per day About 3 read accesses for every modification Files on share drives not frequently shared between users Files accessed many times by many users are applications

15 FUJITSU CONFIDENTIAL Text Documents in Large Organisations Mainly created by desktop applications (Office) Usually comprise 15-20% of file count, 10-15% of volume Collections used by different parts of the organisation Small collections often very intensively used Collateral for service companies 15Copyright 2011 Fujitsu Limited

16 FUJITSU CONFIDENTIAL Duplication in 12,00 text documents from software development project 16Copyright 2010 Fujitsu Limited Exact Near (Document Vector Comparison) Similar cluster spectra for 40,000 text documents from Govt. Department

17 FUJITSU CONFIDENTIAL Evaluating Measures of Near-Duplication 17Copyright 2010 Fujitsu Limited Very large parameter space to test Document vector generation, matching algorithm, matching level False positives detected by sampling cluster Very difficult to detect false negative clustering Do documents with similar names have similar content? Trigram matching – very compute-intensive Most clusters are versions of documents

18 FUJITSU CONFIDENTIAL Example of correct clustering 18Copyright 2010 Fujitsu Limited 10 versions of the same file, all in same folder

19 FUJITSU CONFIDENTIAL Example of incorrect clustering 19Copyright 2010 Fujitsu Limited RfA Diagram2.rtfUI navigation diagrams 010210.RTF Same 3 words – different pictures

20 FUJITSU CONFIDENTIAL Information Retrieval by Search for Internal Collections Few or no hyperlinks Composite documents are common Documents frequently have implicit content High level of near duplication Search terms are often commonly occurring words or phrases -> Poor search results when compared to Internet search Users prefer to ask people or browse 20Copyright 2011 Fujitsu Limited

21 FUJITSU CONFIDENTIAL Is tagging the answer? Sparse access means that common tags don’t emerge 21Copyright 2011 Fujitsu Limited

22 FUJITSU CONFIDENTIAL What might help? Automated tagging Training sets Synonym groups Learning required to adapt to rapidly changing vocabulary Extraction of document headings & captions “Find a good paragraph on reporting capability” Clustering of similar documents “Find the most recent version of this document” is a very common requirement Using a document management system with version control Presence of a capability doesn’t mean it will be used Cluster spectra of documents in DMS very similar to filesystem for software development docs 22Copyright 2011 Fujitsu Limited


Download ppt "Australian Document Computing Conference Dec 3 2011 Information Retrieval in Large Organisations Simon Kravis Copyright 2010 Fujitsu Limited."

Similar presentations

Ads by Google