Australian Document Computing Conference Dec 3 2011 Information Retrieval in Large Organisations Simon Kravis Copyright 2010 Fujitsu Limited.

FUJITSU CONFIDENTIAL Large Organisations Can’t rely on personal contacts to obtain information Have difficulty in storing and retrieving information Often use multiple systems for storing information Paper Files Shared Filesystems Document Management Systems Intranets (SharePoint) Specialised Systems (eg TRIM, Documentum, Alfresco) Are only interested in Internet style search to meet legal challenges 2Copyright 2010 Fujitsu Limited

FUJITSU CONFIDENTIAL Filesystems Files are building blocks of –Operating Systems –Applications Desktop applications commonly store electronic documents as files Hardware costs of storage have become very low Difficult to model statistically –many attributes follow power laws (files/folder, file size, subfolders, file types) 5Copyright 2011 Fujitsu Limited

FUJITSU CONFIDENTIAL Shared Filesystem Organisation Multiple volumes, often based on organisational structure Tree structure of folders and files User and Group areas Permissions based on user ID and group membership Higher levels of folder trees usually controlled by administrators 7Copyright 2011 Fujitsu Limited

FUJITSU CONFIDENTIAL Are shared filesystems unstructured? Folder tree represents a high degree of structure created by users Local but not global consistency Users structure folder trees to facilitate their own work Structures are usually highly efficient information stores Small survey of users in an IT service company in 2005 showed that only 1 user out of 12 had spent more than 15 mins/day looking for files on share drives over past week 8Copyright 2011 Fujitsu Limited

FUJITSU CONFIDENTIAL Filesystem volume growth & effect of quotas 9Copyright 2010 Fujitsu Limited 3000 users, 90 volumes Basically linear with small acceleration Linear component= 190 Gbytes/Month 600 Mbytes/month/user Growth acceleration =7 Gbytes/month 2 22,000 users, 328 user and group volumes Quadratic fit to cleaned data before quotas Linear component= 160 GBytes/month 7 Mbytes/month/user Growth acceleration =0.07 GBytes/ month 2

FUJITSU CONFIDENTIAL File Use Profiles – 6500 accesses to 3.5 million files over 21 days by 145 users 14Copyright 2010 Fujitsu Limited 2 accesses per user per day About 3 read accesses for every modification Files on share drives not frequently shared between users Files accessed many times by many users are applications

FUJITSU CONFIDENTIAL Text Documents in Large Organisations Mainly created by desktop applications (Office) Usually comprise 15-20% of file count, 10-15% of volume Collections used by different parts of the organisation Small collections often very intensively used Collateral for service companies 15Copyright 2011 Fujitsu Limited

FUJITSU CONFIDENTIAL Duplication in 12,00 text documents from software development project 16Copyright 2010 Fujitsu Limited Exact Near (Document Vector Comparison) Similar cluster spectra for 40,000 text documents from Govt. Department

FUJITSU CONFIDENTIAL Evaluating Measures of Near-Duplication 17Copyright 2010 Fujitsu Limited Very large parameter space to test Document vector generation, matching algorithm, matching level False positives detected by sampling cluster Very difficult to detect false negative clustering Do documents with similar names have similar content? Trigram matching – very compute-intensive Most clusters are versions of documents

FUJITSU CONFIDENTIAL Information Retrieval by Search for Internal Collections Few or no hyperlinks Composite documents are common Documents frequently have implicit content High level of near duplication Search terms are often commonly occurring words or phrases -> Poor search results when compared to Internet search Users prefer to ask people or browse 20Copyright 2011 Fujitsu Limited

FUJITSU CONFIDENTIAL What might help? Automated tagging Training sets Synonym groups Learning required to adapt to rapidly changing vocabulary Extraction of document headings & captions “Find a good paragraph on reporting capability” Clustering of similar documents “Find the most recent version of this document” is a very common requirement Using a document management system with version control Presence of a capability doesn’t mean it will be used Cluster spectra of documents in DMS very similar to filesystem for software development docs 22Copyright 2011 Fujitsu Limited

Australian Document Computing Conference Dec 3 2011 Information Retrieval in Large Organisations Simon Kravis Copyright 2010 Fujitsu Limited.

Similar presentations

Presentation on theme: "Australian Document Computing Conference Dec 3 2011 Information Retrieval in Large Organisations Simon Kravis Copyright 2010 Fujitsu Limited."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Australian Document Computing Conference Dec 3 2011 Information Retrieval in Large Organisations Simon Kravis Copyright 2010 Fujitsu Limited.

Similar presentations

Presentation on theme: "Australian Document Computing Conference Dec 3 2011 Information Retrieval in Large Organisations Simon Kravis Copyright 2010 Fujitsu Limited."— Presentation transcript:

Similar presentations

About project

Feedback