Presentation is loading. Please wait.

Presentation is loading. Please wait.

LOGO A comparison of two web-based document management systems ShaoxinYu Columbia University March 31, 2009.

Similar presentations


Presentation on theme: "LOGO A comparison of two web-based document management systems ShaoxinYu Columbia University March 31, 2009."— Presentation transcript:

1 LOGO A comparison of two web-based document management systems ShaoxinYu Columbia University March 31, 2009

2 LOGO Index I. Description of the problem II. Google Scholar III. CiteSeer IV. Comparison of Google Scholar and CiteSeer

3 LOGO Description of the problem Nowadays, with mushrooming of the quantity of on-line text information, automatic text summarization plays a more and more important role in information industry Online resources will certainly contain similar content, however, exist separately, it is meaningful for us to find high efficient ways to manage these information.

4 LOGO Description of the problem Background of Multi-document Summarization Techniques 1. Free style summarization 2. Sentence Extraction type summarization 3. Axis (type of main topic) 4. Table style summary Four types

5 LOGO Description of the problem How to achieve documents about the same topic manually? 1.Use a marker to mark the important phrases or sentences 2.Figure out the main topics in the marked sentences OR Make a list to figure out the overview of the documents 3. Connect these main topics

6 LOGO Google Scholar 1. Released in November 2004 2. Search engine for scholarly literature 3. Wide range of subject areas

7 LOGO Google Scholar Do not search all publicly available Web pages as Google Google Scholar gets its records from three sources: 1.Use a proprietary algorithm to identify Web documents “look scholarly” ----full-text documents and citations with abstracts. 2.Add content provided by its partners—journal publishers, scholarly societies, database vendors, and academic institutions. 3.Extracts citations from the reference lists of documents found through the first two methods

8 LOGO Google Scholar Google File System Architecture

9 LOGO Google Scholar 1.Chunk fragment of information used in multimedia formats 64 MB: optimize by statistic method 2.Metadata (stored in master) a. files and chunk namespaces b. mapping from files to chunks c. locations of each chunk’s replicas 3.Master Single process running on a machine that stores all metadata 4. Communication between Master and Chuck Servers If corrupted, master also sends instruction to the chuck servers for deleting existing chunks, creating new chunks.

10 LOGO CiteSeer 1. Public search engine for academic papers 2. Created by Steve Lawrence, Kurt Bollacker and Lee Giles 3. NEC Research Institute, Princeton, New Jersey, USA 4. Hosted by Pennsylvania State University 5. Over 700,000 documents, primarily in computer and science and engineering.

11 LOGO CiteSeer CiteSeer features 1. Autonomous citation indexing system 2. Index academic literature in Postscript files or PDF 3. Literature retrieval by following citation links 4. Evaluation and ranking of papers, authors and journals 5. Create up-to-date databases not limited to preselected journals or restricted by journal publication delays 6. Autonomous operation with a corresponding reduction in cost 7. Powerful interactive browsing of the literature using the context of citations

12 LOGO CiteSeer Methods of CiteSeer use for computing similarity 1.Word Vectors Use the top 20 components, since the truncation may not have a large effect on the distance measures 2. String Distance Use “LikeIt” string distance to measure the edit distance 3. Citations Use common citations to find the research papers most closely related to the document 4. Combination of Methods CiteSeer combines document similarity methods above

13 LOGO Comparison of Google Scholar & CiteSeer Different positioning The core purpose of CiteSeer is to search for the complete academic papers with complete citations and exempt of the hefty fee Google Scholar is Google’s products to promote the complete solution of searching and other need of academic purposes, whose strategy focuses on complete and can be used as a final solution

14 LOGO Comparison of Google Scholar & CiteSeer Coverage and performance Google Scholar utilizes the first 100-120K bytes of the text for searching and the links always need to pay We can trace the informative paper by CiteSeer itself, and the contributions of all the citation papers provide huge help in academic affairs

15 LOGO Comparison of Google Scholar & CiteSeer Click any of the informative links can connect to one link

16 LOGO Comparison of Google Scholar & CiteSeer Results are provided only by the topics extraction

17 LOGO Comparison of Google Scholar & CiteSeer As to the staleness matter, Google Scholar seems to be a loser in comparison with CiteSeer. This effect was more obvious in the early days of appearance of Google Scholar. Nowadays, for majority of uses, the staleness is no longer a big problem for both of them.

18 LOGO


Download ppt "LOGO A comparison of two web-based document management systems ShaoxinYu Columbia University March 31, 2009."

Similar presentations


Ads by Google