Presentation is loading. Please wait.

Presentation is loading. Please wait.

HathiTrust Research Center Architecture User-facing services.

Similar presentations


Presentation on theme: "HathiTrust Research Center Architecture User-facing services."— Presentation transcript:

1 HathiTrust Research Center Architecture User-facing services

2 What is HathiTrust Research Center? Enables computational access for nonprofit and educational users to published works stored within HathiTrust Extensive collaborative digital library of more than 10 million volumes and 3.5 billion pages of archived material Help meet the technical challenges of dealing with massive amounts of digital text that researchers face by developing cutting-edge software tools and cyberinfrastructure

3 End-to-End Context Goals: – User should be able to authenticate via web portal – User selects an algorithm to execute, collection to run against, and argument(s) to the algorithm – User should see status of the algorithms which have recently run under his account – User should be able to view results of algorithm runs

4 Agent framework Page/volume tree (file system) Authoritative volume store (Cassandra) SEASR analytics service Web portalDesktop SEASR client Task deployment WSO2 registry - services, collections, data capsule images Solr indexes HathiTrust corpus rsync WSO2 Enterprise service bus Future Grid NCSA local resources Penguin on Demand Replicated volume stores Programmatic access (e.g., Bamboo) CI logon (NCSA) Access control (e.g. Grouper) University of Michigan Meandre Orches- tration Agent instance Non-consumptive Data capsules NCSA HPC resources

5 Agent framework Page/volume tree (file system) Authoritative volume store (Cassandra) SEASR analytics service Web portalDesktop SEASR client Task deployment WSO2 registry - services, collections, data capsule images Solr indexes HathiTrust corpus rsync WSO2 Enterprise service bus Future Grid NCSA local resources Penguin on Demand Replicated volume stores Programmatic access (e.g., Bamboo) CI logon (NCSA) Access control (e.g. Grouper) University of Michigan Meandre Orches- tration Agent instance Non-consumptive Data capsules NCSA HPC resources

6 Agent framework Page/volume tree (file system) Authoritative volume store (Cassandra) SEASR analytics service Web portalDesktop SEASR client Task deployment WSO2 registry - services, collections, data capsule images Solr indexes HathiTrust corpus rsync WSO2 Enterprise service bus Future Grid NCSA local resources Penguin on Demand Replicated volume stores Programmatic access (e.g., Bamboo) CI logon (NCSA) Access control (e.g. Grouper) University of Michigan Meandre Orches- tration Agent instance Non-consumptive Data capsules NCSA HPC resources

7 HTRC Portal About Lift Implemented using Lift, a web application framework for Scala Lift is cited as being resistant to common vulnerabilities such as CSS, XSRF, injection. Scalable to high traffic levels Interactive by way of Comet and Ajax support Easy Java library integration

8 HTRC Portal Authentication Our portal uses CILogon for authentication. Provides identity verification for a large number of US academic institutions

9 Agent framework Page/volume tree (file system) Authoritative volume store (Cassandra) SEASR analytics service Web portalDesktop SEASR client Task deployment WSO2 registry - services, collections, data capsule images Solr indexes HathiTrust corpus rsync WSO2 Enterprise service bus Future Grid NCSA local resources Penguin on Demand Replicated volume stores Programmatic access (e.g., Bamboo) CI logon (NCSA) Access control (e.g. Grouper) University of Michigan Meandre Orches- tration Agent instance Non-consumptive Data capsules NCSA HPC resources

10 About SEASR SEASR is a research and development environment used for leading-edge humanities research. Provides workflow capabilities that allow users to produce tag clouds, readability analyses, examinations of N Gram distributions, and more. Tag cloud Readability analysisExtracting location entities for map display

11 Agent framework Page/volume tree (file system) Authoritative volume store (Cassandra) SEASR analytics service Web portalDesktop SEASR client Task deployment WSO2 registry - services, collections, data capsule images Solr indexes HathiTrust corpus rsync WSO2 Enterprise service bus Future Grid NCSA local resources Penguin on Demand Replicated volume stores Programmatic access (e.g., Bamboo) CI logon (NCSA) Access control (e.g. Grouper) University of Michigan Meandre Orches- tration Agent instance Non-consumptive Data capsules NCSA HPC resources

12 HTRC agent Portal Solr Index Cassandra NoSQL WSO2 Governance Registry Agent: Accesses and uses resources on behalf of the user Firewall Computation resources

13 Background Agent code written in Scala, an object-functional JVM language Akka is a feature rich library for designing cloud applications using actors. – Heavily influenced by Erlang’s approach to distributed systems – Actor: Lightweight process that communicates only through message passing

14 What about executing an algorithm? REST layer AgentActor Ask registry for algorithm “executable” Spawn ComputeChild Computation Executable Jar Web service calls Run algorithm X Result ComputeChild Manages a computation Provide algorithm and arguments Report execution status Solr, Cassandra Registry Launches and Monitors

15 Agent framework Page/volume tree (file system) Authoritative volume store (Cassandra) SEASR analytics service Web portalDesktop SEASR client Task deployment WSO2 registry - services, collections, data capsule images Solr indexes HathiTrust corpus rsync WSO2 Enterprise service bus Future Grid NCSA local resources Penguin on Demand Replicated volume stores Programmatic access (e.g., Bamboo) CI logon (NCSA) Access control (e.g. Grouper) University of Michigan Meandre Orches- tration Agent instance Non-consumptive Data capsules NCSA HPC resources

16 WSO2 Governance Registry Monitoring and administration of service ecosystem Register algorithms as web services, and algorithms as executables Easy, programmatic access to stored data Registration of algorithm run results to a central location Improves sustainability through the use of third-party, open-source software

17 HTRC Governance Registry Algo Dynamic web service instances launched for user jobs Related to text analysis Algo Registered executables No EPR Not instantiated Special Collections List of volume IDs belong to each collection E.g. Victorian Literature collection IU collection Persistent CI services Not text analysis algorithms, e.g. Portal HTRC Agent Solr Gov Registry Cassandra Derived Results Results of algorithm runs Intermediate data products E.g Latent Semantic Index result from “Victorian Literature”

18 Cassandra Schema Each row represents a volume – Row key is the volume ID – Each row contains many columns – First column contains metadata attributes about the volume – Each subsequent column family is a page, key is page ID – Page-specific columns contain page contents and metadata about the page Pros – Works well for all access primitives – Well organized metadata – no repetitions – Volume level versioning could follow similar schema, but version number needs to be concatenated to volume ID for historical versions Cons – Columns under supercolumns cannot be indexed – Extra metadata are picked up even when only page contents are needed – Must store historical versions of volumes as deltas; naïve translation of the above format to historical versioning would have high cost in space Key: (volume ID) Inu.320001 metadata copyright public Page count 16 Inu.320001/001 content What’s up doc? size 12 MD5 12345f Inu.320001/xxx content Rabbits size 7 MD5 aabbcc Inu.320002 metadata copyright In-copyright Page count 2406 Inu.320002/001 content 2b|!2b size 6 MD5 7effdd Inu.320002/xxx content A question size 10 MD5 deadbeef …


Download ppt "HathiTrust Research Center Architecture User-facing services."

Similar presentations


Ads by Google