ELISQ Systems Demonstration Sagnik Ray Choudhury Doha -- May 2015
SeerQ: SeerSuite for Qatar SeerSuite: A digital library management system developed at Penn State Key features: Crawls web to gather scholarly documents Extracts metadata from PDFs (title, author name, citation) using machine learning Stores extracted metadata in a database and allows metadata and fulltext search Differences from Google Scholar: Stores the metadata and exposes it through OAI-PMH Stores the citation graph which can be used later to measure scholarly impact Collects and stores the PDFs which can be used later for advanced processing such as table/ figure extraction, understanding the semantics SeerQ: The instance of SeerSuite running in Qatar University crawling scholarly content from the Qatari Web
SeerQ: Search Results
SeerQ: Details from Search Results
SeerQ: Components and Statistics System running at (available from within Qatar University, from outside use VPN). Components: Heritrix 3 and OAI based crawler (PSU uses Heritrix 1.2) Solr 3.6 (PSU just moved from Solr 1.2) MySQL and front end (same as PSU) Document collections: Documents crawled from QScience Documents crawled from the Web: seedlist provided by QNL
Some Statistics from SeerQ Total documents in the repository (as of May 2015): 3900 Documents from QScience: 2000 Main sources: qscience, rand, doha institute, doha film institute What can we do with the system: Scholarly analysis: How many authors are from Qatar/Doha/Qatar University? Citation analysis: QScience papers only have a inter journal citation rate of 0.15%. Use the stored PDFs to extract valuable information (Research: PSU RA). Expose the metadata through OAI/PMH.
SeerQ: Exposing Extracted Metadata through OAI-PMH
A searchable database for handwritten documents (both in English and Arabic) Motivation: Retrieve handwritten documents matching the search term Compare the difference in handwriting for Arabic words (recognize the writer) Demonstrate handling of images + text (in both languages) Arabic handwriting project interface: Arabic/English Bilingual Handwriting Database
Handwriting Project: Search Results
Handwriting Project: Image with Metadata
Fusion is a free search eco-system developed by LucidWorks. Includes crawler, Solr for indexing, tools for query log analysis and error reporting Advantages over simple Solr: Enhanced Admin UI Security Data Enrichment Machine Learning Advanced Relevancy Tuning Reporting Admin Signal Processing Recommendations API (Configuration, History, Node, System, Usage) Connector Framework Fusion: A Search Eco System
Using Fusion to collect Qatari Digital Content Around 2 million English & Arabic documents related to Qatar have been crawled and are accessible using Fusion. Specific collections: Qatari Newspapers: >1 million documents from Al-Raya, Gulf-Times, Qatar-tribune Sports: QA domain sports sites, 5000 documents Government: government websites in Qatar, documents Arabic News Articles Templates Summary : 120,000 newspaper articles along with their summary, generated automatically (Research from VT RA) Qatar University Fusion can help in providing a data curation service: users request a collection, curator creates it, exposes the curated content to the user through an interface. archive-it provides some similar functionality, on a broader scope. archive-it
Fusion: for Curators
Fusion: Creating a New Collection
Fusion: How to Combine Multiple Datasources
Fusion: How to Combine Multiple Datasources: 2
Fusion: Two Step Web Crawling: Step 1
Fusion: Two Step Web Crawling: Step 2
Search Interface for Fusion: End User Designed by elisq team for demonstrations.
Search Result on Newspaper Summary Collection