Document Data Mining Design Review November 18, 2010 1 Team Members: Dallas Stinger, Wenlong Huang, Aaron Phillips Advisor: Gregory Donohoe, Ph.D.

Slides:



Advertisements
Similar presentations
MapReduce With a heavy debt to: Google Map Reduce OSDI 2004 slides code.google.com.
Advertisements

Version 6.1. Old Vs New Input Format - PDF Meta-Scan & Structured Folder Output Features & Benefits Examples – Other Verticals Content.
Chapter 5: Introduction to Information Retrieval
Copyright © 2014 Pearson Education, Inc. Publishing as Prentice Hall
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
Companies can suffer numerous problems due to poor management of resources and careless decisions. In real-world decision- making, many organizations lack.
A review on “Answering Relationship Queries on the Web” Bhushan Pendharkar ASU ID
Lucene Part3‏. Lucene High Level Infrastructure When you look at building your search solution, you often find that the process is split into two main.
SOFTWARE PRESENTATION ODMS (OPEN SOURCE DOCUMENT MANAGEMENT SYSTEM)
Computer Information Technology – Section 3-2. The Internet Objectives: The Student will: 1. Understand Search Engines and how they work 2. Understand.
Network and Server Basics. 6/1/20152 Learning Objectives After viewing this presentation, you will be able to: Understand the benefits of a client/server.
Web- and Multimedia-based Information Systems. Assessment Presentation Programming Assignment.
Information Retrieval in Practice
Search Engines and Information Retrieval
Basic IR: Queries Query is statement of user’s information need. Index is designed to map queries to likely to be relevant documents. Query type, content,
Search Strategies Online Search Techniques. Universal Search Techniques Precision- getting results that are relevant, “on topic.” Recall- getting all.
Access 2007 Product Review. With its improved interface and interactive design capabilities that do not require deep database knowledge, Microsoft Office.
File Systems and Databases
CS 337 Final Project Presentation Asset Management and Tracking Developers: –Jimmy Hoo –Edwin Panameno –Manuel Segura –Sheng-Tian Lin Customers –Alexandre.
Natural Language Query Interface Mostafa Karkache & Bryce Wenninger.
Information retrieval Finding relevant data using irrelevant keys Example: database of photographic images sorted by number, date. DBMS: Well structured.
Computer comunication B Information retrieval Repetition Retrieval models Wildcards Web information retrieval Digital libraries.
Copyright 2003 The McGraw-Hill Companies, Inc CHAPTER Application Software computing ESSENTIALS    
Copyright © 2006 by The McGraw-Hill Companies, Inc. All rights reserved. McGraw-Hill Technology Education Copyright © 2006 by The McGraw-Hill Companies,
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
LGU Document Management Solution. What is it? A Web-based Centralized Document Management Solution to keep track of digital documents Instantly search.
A Billiards Point of Sale Application Christopher Ulmer CS 470 Final Presentation.
Xpantrac connection with IDEAL Sloane Neidig, Samantha Johnson, David Cabrera, Erika Hoffman CS /6/2014.
Class 3 Data and Business MIS 2000 Updated: January 2014.
Computer for Health Sciences
Project Proposal: Academic Job Market and Application Tracker Website Project designed by: Cengiz Gunay Client: Cengiz Gunay Audience: PhD candidates and.
4 OFFICE WEEKLY MEETING Why 4 Office DMS?. Challenge Companies today are overwhelmed with information that comes to them on many formats: , electronic.
Objectives Learn what a file system does
CHAPTER 9 DATABASE MANAGEMENT © Prepared By: Razif Razali.
Software All parts of the computer people can NOT touch, such as programs, files, documents and any other data.
Search Engines and Information Retrieval Chapter 1.
what is contacts? In-contacts is an online contacts database designed from the ground up to be compatible with modern business needs.
OCLC Online Computer Library Center CONTENTdm ® Digital Collection Management Software Ron Gardner, OCLC Digital Services Consultant ICOLC Meeting April.
1 CSBP430 – Database Systems Chapter 1: Databases and Database Users Mamoun Awad College of Information Technology United Arab Emirates University
With Windows 7 Introductory© 2011 Pearson Education, Inc. Publishing as Prentice Hall1 Windows 7 Introductory Chapter 2 Managing Libraries Folders, Files.
A Survey of Patent Search Engine Software Jennifer Lewis April 24, 2007 CSE 8337.
VISUAL STUDIO 2010 TEAM SYSTEM CAPABILITIES WITH DYNAMICS AX Advisor - Simanta Mitra Client - Shawn Hanson & Dave Froslie (Microsoft) Group - Dec10-08.
Component 4: Introduction to Information and Computer Science Unit 4: Application and System Software Lecture 3 This material was developed by Oregon Health.
NCSU Libraries Kristin Antelman NCSU Libraries June 24, 2006.
1 Some initial Design suggestions… Getting started… where to begin? Find out whether your design architecture will work… as soon as possible. If you need.
Chapter 6: Information Retrieval and Web Search
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
Module 10 Administering and Configuring SharePoint Search.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Session III. Information Systems A system, whether automated or manual, that comprises people, machines, and/or methods organized to collect, process,
The Anatomy of a Large-Scale Hyper textual Web Search Engine S. Brin, L. Page Presenter :- Abhishek Taneja.
Logic Analyzer ECE-4220 Real-Time Embedded Systems Final Project Dallas Fletchall.
Web- and Multimedia-based Information Systems Lecture 2.
WebFOCUS Magnify: Search Based Applications Dr. Rado Kotorov Technical Director of Strategic Product Management.
Human Centric Computing (COMP106) Assignment 2 PROPOSAL 23.
1 Automatic indexing Salton: When the assignment of content identifiers is carried out with the aid of modern computing equipment the operation becomes.
 LAN ◦ A LAN (Local Area Network) is a system whereby individual PCs are connected together within a company or organization.  WAN ◦ A WAN (Wide Area.
Text Retrieval and Spreadsheets Session 4 LBSC 690 Information Technology.
Chapter. 3: Retrieval Evaluation 1/2/2016Dr. Almetwally Mostafa 1.
Lesson 10—Networking BASICS1 Networking BASICS The Internet and Its Tools Unit 3 Lesson 10.
Information Retrieval in Practice
Information Architecture
Search Engine Architecture
Information Retrieval and Web Search
File Systems and Databases
CS246: Information Retrieval
Computer Programming-1 CSC 111
What is a System? A system is a collection of interrelated components that work together to perform a specific task.
Presentation transcript:

Document Data Mining Design Review November 18, Team Members: Dallas Stinger, Wenlong Huang, Aaron Phillips Advisor: Gregory Donohoe, Ph.D.

The Problem State Board collects meeting minutes and other documents recording decisions made Board members want to retrieve text from old documents that relate to current issues – May not recall when issue was discussed – May not know exact keywords to search for 2

The Existing Solution Currently, all files exist on a large, unorganized shared network drive. Finding information recorded in documents requires knowing when it was recorded, and in which document. 3

Requirements / Design Decisions 4

Multiple File types System limited to more major file types – Word documents (.doc,.docx) – PDF files (.pdf) – Excel (.xls,.xlsx) – Text (.txt) Lacking – WordPerfect (.wpd) – PDF files that were scanned in – Open Office document types 5

Multi-User Access Web Based Pros: – Information searchable anywhere – Only one index required – Index on regular basis without interrupt Cons: – File permissions Individual User Application Pros: – Can be programmed to learn user behavior – Apply more emphasis to files he/she used before (Looks at search history to aid in new searches) Cons: – Software package installed on each users machine 6

Search Collection of Documents Efficiently Real Time Searching – Pros: Easy No initial overhead – Cons : Time consuming (> 100,000 words) Unable to find non- exact search results Reverse Indexing – Pros: Fast and efficient Able to find useful information without exact search text known – Cons: Large initial overhead (pre-analyze all documents) Keep index file up to date Storage space necessary Results displayed in less than a second 7

8

Find Useful Information Without Exact String Specification (A: Stemming) Create our own – Pros: Pay attention to details that may be lacking in existing algorithms (aglet vs. readable) More efficient Define special cases – Cons: Requires a lot of time Use existing algorithm – Pros: Readily available Spend more time on other important details – Cons: Special cases incorrect Some root words are truncated 9

Porter Stemming Algorithm Large set of steps based on English Natural Language to determine root of word Extensively used in programs Outdated: Results not always correct 10

Find Useful Information Without Exact String Specification (B: Thesaurus) Own Model – Pros : Fine tune thesaurus to have only relevant terms (terms that exist inside our index file) – Cons: Very time consuming and complex Using pre-built Thesaurus – Pros: Quick and easy to use Very extensive – Cons: Has irrelevant search term results Unnecessary terms for State Board 11

Searching User types in a search criteria – Determine whether they want Narrow Search results or Broad Search Results May retrieve too many results in Broad Search Search algorithm converts each typed word into a list of possible stems and synonyms Tries all possible permutations of words, trying to find the closest match to the search Calculate standard deviation of the distance between all of the words 12

Searching (cont.) Each file is ranked based on the number of matches it contains – Exact matches rank highest – Reordering of exact match is ranked next – Stems, synonyms, partial matches, and large spacing between searched words rank lowest All rank values found inside a file are summed Highest ranked files considered most relevant 13

14

15

16

17

18

19

20

Unit Testing 21

Unit Testing Benefits Goal Facilitates change Limitations Not omnipotent Low cost performance 22

DocumentTest: /// Returns the document location public void getFileLocationTest() { convertPDF converpdf = new convertPDF("D:\\Class\\test.pdf"); string actual; actual = converpdf.getFileLocation(); string expected; expected = "D:\\Class\\test.pdf"; Assert.AreEqual(actual, expected); } Unit Testing 23

/// creates word count in alphabetical order for all words located inside PDF public void createDictionaryTest() { convertPDF converpdf = new convertPDF("D:\\Class\\test.pdf"); string toDictionary = "this is test code code code"; converpdf.createDictionary(toDictionary); int actual; converpdf.WordCounts.TryGetValue(“code", out actual); Assert.AreEqual(3, actual); } Unit Testing 24

End of Semester Status Goals: – Working, tested prototype – Documentation for future teams Plenty of areas open for extension or improvement 25

Future Possibilities: File Types Currently supported file types – Microsoft Word – Microsoft Excel – PDF No optical character recognition Our system will allow for easy extension 26

27

Future Possibilities: Indexing We have a relatively simple indexing scheme More complex indexing would lead to decreased search time Our indexing scheme is very general – Could be specific to the State Board – Could lead to more relevant results 28

Future Possibilities: Searching Search time increases quickly as search terms are added Thesaurus is broad – Large number of synonyms can slow search – Could be trimmed to fit domain Porter stemming algorithm could be replaced 29

Future Possibilities: Correlation Related documents should be correlated – By date? – Using a tagging system? 30

Future Possibilities: Decision Database A client need that is not addressed by our software Many board decisions have been passed, with varying lifetimes A database could track all board decisions and lifespan Possible connection to our search engine? 31

Future Possibilities: Web-Based Interface Software will be installed on each user’s computer GUI could be web based, with access restricted to State Board employees Users could search from home or while on the road, not just in the office Indexing would be simplified 32

Questions? 33