Final Project of Information Retrieval and Extraction by d93921022 吳蕙如.

Slides:

Advertisements

Similar presentations

CS252: Systems Programming Ninghui Li Program Interview Questions.

Advertisements

TI: An Efficient Indexing Mechanism for Real-Time Search on Tweets Chun Chen 1, Feng Li 2, Beng Chin Ooi 2, and Sai Wu 2 1 Zhejiang University, 2 National.

Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.

©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B+-Tree Index Files B-Tree.

Information Retrieval in Practice

A Guide to SQL, Seventh Edition. Objectives Create a new table from an existing table Change data using the UPDATE command Add new data using the INSERT.

1 CS 430: Information Discovery Lecture 3 Inverted Files and Boolean Operations.

CS 333 Introduction to Operating Systems Class 18 - File System Performance Jonathan Walpole Computer Science Portland State University.

CS/Info 430: Information Retrieval

INEX 2003, Germany Searching in an XML Corpus Using Content and Structure INEX 2003, Germany Yiftah Ben-Aharon, Sara Cohen, Yael Grumbach, Yaron Kanza,

U:/msu/course/cse/103 Day 23, Slide 1 Review of Day 22 What query did you use to search for an actor by name? –Return matches.

On-Demand Media Streaming Over the Internet Mohamed M. Hefeeda, Bharat K. Bhargava Presented by Sam Distributed Computing Systems, FTDCS Proceedings.

Overview of Search Engines

Database Design IST 7-10 Presented by Miss Egan and Miss Richards.

State of Connecticut Core-CT Project Query 4 hrs Updated 1/21/2011.

MS Access: Database Concepts Instructor: Vicki Weidler.

DAY 21: MICROSOFT ACCESS – CHAPTER 5 MICROSOFT ACCESS – CHAPTER 6 MICROSOFT ACCESS – CHAPTER 7 Akhila Kondai October 30, 2013.

Managing Large RDF Graphs (Infinite Graph) Vaibhav Khadilkar Department of Computer Science, The University of Texas at Dallas FEARLESS engineering.

8 Chapter Eight Server-side Scripts. 8 Chapter Objectives Create dynamic Web pages that retrieve and display database data using Active Server Pages Process.

ASP.NET Programming with C# and SQL Server First Edition

ACOT Intro/Copyright Succeeding in Business with Microsoft Excel

Association Rule Mining on Multi-Media Data Auto Annotation on Images Bhavika Patel Hau San Si Tou Juveria Kanodia Muhammad Ahmad.

Chapter An Introduction to Problem Solving 1 1 Copyright © 2013, 2010, and 2007, Pearson Education, Inc.

Miscellaneous Excel Combining Excel and Access. – Importing, exporting and linking Parsing and manipulating data. 1.

Milestone 2 Workshop in Information Security – Distributed Databases Project Access Control Security vs. Performance By: Yosi Barad, Ainat Chervin and.

Physical Database Design & Performance. Optimizing for Query Performance For DBs with high retrieval traffic as compared to maintenance traffic, optimizing.

9 Chapter Nine Compiled Web Server Programs. 9 Chapter Objectives Learn about Common Gateway Interface (CGI) Create CGI programs that generate dynamic.

1 Functions 1 Parameter, 1 Return-Value 1. The problem 2. Recall the layout 3. Create the definition 4. "Flow" of data 5. Testing 6. Projects 1 and 2.

Database Management 9. course. Execution of queries.

Index Building Overview Database tables Building flow (logical) Sequential Drawbacks Parallel processing Recovery Helpful rules.

U:/msu/course/cse/103 Day 06, Slide 1 CSE students: Do not log in yet. Review Day 6 in your textbook. Think about.

Turning science problems into HTC jobs Wednesday, July 29, 2011 Zach Miller Condor Team University of Wisconsin-Madison.

Searching the web Enormous amount of information –In 1994, 100 thousand pages indexed –In 1997, 100 million pages indexed –In June, 2000, 500 million pages.

O FFICE M ANAGEMENT T OOL - II B BA -V I TH. Abdus Salam2 Week-7 Introduction to Query Introduction to Query Querying from Multiple Tables Querying from.

Building a Distributed Full-Text Index for the Web by Sergey Melnik, Sriram Raghavan, Beverly Yang and Hector Garcia-Molina from Stanford University Presented.

UNIT 5.  The related activities of sorting, searching and merging are central to many computer applications.  Sorting and merging provide us with a.

CS333 Intro to Operating Systems Jonathan Walpole.

Advantages of Query Biased Summaries in Information Retrieval by A. Tombros and M. Sanderson Presenters: Omer Erdil Albayrak Bilge Koroglu.

8 Chapter Eight Server-side Scripts. 8 Chapter Objectives Create dynamic Web pages that retrieve and display database data using Active Server Pages Process.

A Guide to SQL, Eighth Edition Chapter Six Updating Data.

CHAPTER 3-3: PAGE MAPPING MEMORY MANAGEMENT. VIRTUAL MEMORY Key Idea Disassociate addresses referenced in a running process from addresses available in.

User-Friendly Systems Instead of User-Friendly Front-Ends Present user interfaces are not accepted because the underlying systems are too difficult to.

The Development of a search engine & Comparison according to algorithms Sung-soo Kim The final report.

for all Hyperion video tutorial/Training/Certification/Material Essbase Optimization Techniques by Amit.

1 CS 430: Information Discovery Lecture 3 Inverted Files.

Retele de senzori Curs 2 - 1st edition UNIVERSITATEA „ TRANSILVANIA ” DIN BRAŞOV FACULTATEA DE INGINERIE ELECTRICĂ ŞI ŞTIINŢA CALCULATOARELOR.

MICROSOFT ACCESS – CHAPTER 5 MICROSOFT ACCESS – CHAPTER 6 MICROSOFT ACCESS – CHAPTER 7 Sravanthi Lakkimsety Mar 14,2016.

CS520 Web Programming Full Text Search Chengyu Sun California State University, Los Angeles.

The Anatomy of a Large-Scale Hypertextual Web Search Engine S. Brin and L. Page, Computer Networks and ISDN Systems, Vol. 30, No. 1-7, pages , April.

Dynamic SQL Writing Efficient Queries on the Fly ED POLLACK AUTOTASK CORPORATION DATABASE OPTIMIZATION ENGINEER.

Announcements Program 1 on web site: due next Friday Today: buffer replacement, record and block formats Next Time: file organizations, start Chapter 14.

Index Building.

Information Retrieval in Practice

Why indexing? For efficient searching of a document

Microsoft Office Access 2010 Lab 3

An Efficient Algorithm for Incremental Update of Concept space

Memory Management Virtual Memory.

Jonathan Walpole Computer Science Portland State University

Dynamic SQL Writing Efficient Queries on the Fly

SQL and SQL*Plus Interaction

Information Retrieval and Web Search

Dynamic SQL Writing Efficient Queries on the Fly

Structured Query Language (SQL) William Klingelsmith

Data Mining Chapter 6 Search Engines

Implementation Based on Inverted Files

Topic 1: Problem Solving

Notes about Homework #4 Professor Hugh C. Lauer CS-1004 — Introduction to Programming for Non-Majors (Slides include materials from Python Programming:

Query Optimization Techniques

Presentation transcript:

Final Project of Information Retrieval and Extraction by d 吳蕙如

Working Environment OS : Linux 7.3 CPU : C800Mhz Memory : 128 MB Tool used : –stopper –stemmer –trec_eval –sqlite Language used : –shell script : control the inverted file indexing procedures –AWK : used for extract needed part from documents –sql : used while trying to adopt the file format database - sqlite.

First Indexing Trial 1.FBIS Source Files 2.Documents Separation 18’51” + 55’13” 3.Documents Pass Stemmer 33’52” + 1:00’58” 4.Documents Pass Stopper 33’23” + 1:09’29” 5.Words sort by AWK 44’07” + 1:19’09” 6.Term Frequency Count and Inverted File Indexing (one file per word) > 9hours, never finished While considering about the indexing procedures, the most directly way is doing it step by step. So in the first trial, I did each step and save the result as input of next step. However, as the directory size grew, the time cost to write a file increased out of control. Time cost of index file generating seems unacceptable and was stopped after 9 hours.

Second Indexing Trial 1.FBIS Source Files 2.Documents Separation 23’29” + 58’36” 3.Documents Pass Stemmer 30’05” + 1:07’26” 4.Documents Pass Stopper 22’34” + 52’29” 5.Words Sort by AWK 22’44” + 48’27” 6.Words Count and Indexing 1.Two Suffix Directory Separating 5” 2.Word Files Indexing 12:41’00” + break The index generating took too much time. This seemed to be caused by the number of files in a directory. So, I tried to set up 26*26 sub directories basing on the first two characters of each words and separated the index files storage. However, it still took so long, and this trial was stopped while finishing FBIS3 after almost 13 hours.

Third Indexing Trial 1.FBIS Source Files 2.Documents Separation 20’15” + 1:09’38” 3.Documents Pass Stemmer 29’25” + 55’42” 4.Documents Pass Stopper and Sort 34’17” + 1:05’48” 5.Words Count and Indexing 1.Suffix Directory Separating 6” 2.Word Files Indexing (break after 11 hours) Well, before finding out a way to solve time consuming problem of indexing, the steps before also cost a lot of time. I tried to combine the steps with pipeline command, but only worked when using system sort command. After using stopper | sort step, at least one hour is saved. Time cost is still far from acceptable.

Fourth Indexing Trial 1.FBIS Source Files 33’51” + 1:00’38” 1.Documents Separation 2.Documents Pass Stemmer 3.Documents Pass Stopper and Sort 2.Words Count and Indexing 1.Suffix Directory Separating 2” 2.Word Files Indexing 13:14’23” + 14:15’12” I finally found out the time was mostly cost on searching the location for next writing, which is a space allocation characteristic of linux systems. So, I combined the former steps by doing a run from per source file to the sorted ones. All middle files are removed as soon as used by the next part. The time consuming decreased amazingly. It only cost one- third of time used in last trail. Indexing was finished for the first time after 29 hours.

Fifth Indexing Trial 1.For Each FBIS Source File 1:10’26” + 1:19’29” 1.Documents Separation 2.Documents Pass Stemmer 3.Documents Pass Stopper and Sort 4.Words Count and Database Indexing The indexing took just so long and I really want to find a way for decreasing the time cost. A file format database may be a solution. So, I adopt sqlite and write all my index lines as table rows into a file using sqlite. The time cost was immediately down to totally two and half hours, how amazing.

Indexing - Level Analysis 1.For Each FBIS Source File 1:08’53” + 1:16’39” v.s. 2:22’57” document count  v.s (same) file size  v.s. same 1.Documents Separation 2.Documents Pass Stemmer 3.Documents Pass Stopper and Sort 4.Words Count and Database Indexing Since the whole indexing can be done in 2.5 hours, I then tried to count the level influence. I tried to index FBIS3 then FBIS4 separately, then combined them as a set and tried again. The time costs were nearly the same, and the document counts and file sizes were all equaled. This is not at all surprising because of the working procedure did not add any outside information in.

Sixth Indexing Trial 1.For Each FBIS Source File 35’49” + 39’47” 33’04” + 35’43” file size  Documents Separation 2.Documents Pass Stemmer 3.Documents Pass Stopper and Sort 4.Words Count and Write in Single Indexing File While revisiting the fourth and fifth trial, I figured out maybe the problem is the number of index files. So I tried to write all the indexing message into a single file. Two sub part were tried : –Write after counting term frequency of each word. –Append after compute all frequency of a document.

Seventh Indexing Trial 1.For Each FBIS Source File 44’38” + 50’32” file number 646  655 total file size  Documents Separation 2.Documents Pass Stemmer 3.Documents Pass Stopper and Sort 4.Words Count and write into 26*26 Indexing File When consider about query and indexing, single index file is just to large and would cost a long time to search for wanted terms. So, I modified the final step and write the index lines into different files based on the word suffix.

Indexing Time indexingFBIS 3FBIS 4total trial 1 18’51”+33’52”+33’23” ＋ 44’07”+ ？ >> 2:10’13” 55’13” ＋ 1:00’58” ＋ 1:09’29” ＋ 1:19’09” ＋？ >> 4:24’49” >> 6:35’02 trial 2 23’29”+30’05”+22’34” ＋ 22’44” +5” ＋ 12:41’00” ＝ 14:19’57” 58’36” ＋ 1:07’26”+52’29” ＋ 48’27” ＋？ >> 3:46’58” >> 18:06’55” trial 3 20’15”+29’25”+34’17”+6” ＋？ >> 1:24’03” 1:09’38” ＋ 55’42” ＋ 1:05’48” ＋？ >> 3:11’08” >> 4:35’11” trial 4 33’51”+13:14’23” ＝ 13:48’14”1:00’38” ＋ 14:15’12” ＝ 15:15’50” 29:04’04” trial 51:10’26”1:19’29”2:29’55” trial 6-135’49”39’47”1:15’36” trial 6-233‘04“ 35’49“1:08’47“ trial 744’38“50’32“1:35’10“

First Topic Query 1.Extract Topics from Source Files and Pass Stemmer and Stopper 1” 2.Select Per Keyword Data from Index Database or Index file 3.Weight Computing 4.Ranking and Filtering 5.Evaluation Five query topics, totally 15 keywords Total time to query : –Index database : 13’38”  31’27 –Single index file : 9’00”  18’39” –Separated index file : 2’ 04” Seems not efficient enough. If exam several terms together, more time should be saved.

Second Topic Query 1.Extract Topics from Source Files and Pass Stemmer and Stopper 2.Generate One Query Strings for each topic 3.Select Data from Index Database or Index File 4.Weight Computing 5.Ranking and Filtering 6.Evaluation Total time to query : –Index database : 2’30”  5’19” –Single index file : 2’26”  4’55” –Separated index file : not much progress expected, for the queried file need to be checked separately. But, as query terms increase, using separated index file would save a lot more search time.

Updated Topic Query 1.Extract Topics from Source Files and Pass Stemmer and Stopper 2.Generate Query Strings based on frequency of each term 3.Select Data from Index Database or Index File 4.Weight Computing 5.Ranking and Filtering 6.Evaluation Some of the terms in the topics seem to get far too much return documents and seem not work at all. Check the document frequency of each terms and removed the high frequency (>10%) terms. Did not work, some more related terms need to be used for better precision.

Frequency Term Query 1.Select Some Terms based on Descriptions, Narratives and web queries for each topic 2.Order these terms based on document frequency of each word 3.Deciding the Number of Terms to Use and Generate Query Strings 4.The Following Steps are same as before Number of terms are tried from five to 100. The precision increase only in the beginning of adding terms. While the query time raise proportionally as the query terms increase. Terms of high frequency were removed, threshold were 10% and 20%. More strict frequency limit (10%) seem to help.

Query : Topic

Query : Updated Topic

Query : Terms

Query Time query term topic db FBIS db FBIS file FBIS file FBIS

Conclusion As I examined the index file and term frequency I generated. I found that there are so many terms seem to be useless. They may be meaningless, like “aaaf”, or wrong spelling, like “internacion”. Some terms have frequency count less than three. If these terms are removed, the query would be doing even faster, I suppose. I could have spent more time to sort and index the inverted file. However, when I tried part of this, the time consuming made me consider about if it is worthwhile. Maybe just a recent query cache is better than a full sort process. Well, this makes the end of my project report.