Data Mining at Duke (“What to do with all of those hard drives”) Molly Tamarkin Joel Herndon Associate University Librarian for Information Technology.

Slides:



Advertisements
Similar presentations
Supercomputing Institute for Advanced Computational Research © 2009 Regents of the University of Minnesota. All rights reserved. The Minnesota Supercomputing.
Advertisements

THE JOKOMO / YAMADA LIBRARY DIGITAL LIBRARY PROJECT.
Contact: Hirofumi Amano at Kyushu 40 Years of HPC Services In this memorable year, the.
OPNET Technologies, Inc. Performance versus Cost in a Cloud Computing Environment Yiping Ding OPNET Technologies, Inc. © 2009 OPNET Technologies, Inc.
Presented by: Yash Gurung, ICFAI UNIVERSITY.Sikkim BUILDING of 3 R'sCLUSTER PARALLEL COMPUTER.
SYNAR Systems Networking and Architecture Group CMPT 886: Special Topics in Operating Systems and Computer Architecture Dr. Alexandra Fedorova School of.
Introduction to Computer Terminology
Library and Information Center of the University of Crete
DATABASES FROM HCT LIBRARIES. HCT has many online databases for students to use to find information. A database is a collection of information organized.
© 2013 Association for Computing Machinery Honeywell Introduction to the ACM Digital Library January 16, 2013 Honeywell Introduction to the ACM Digital.
Moving to Win 7 Considerations Dean Steichen A2CAT 2010.
Personal Information Management (PIM) is about keeping information and organizing it in such a way that we can find it when we need it. PIM as a field.
Welcome! Computer 101 Session 2 With Laura Crichton.
Cyberinfrastructure Supporting Social Science Cyberinfrastructure Workshop October Chicago Geoffrey Fox
Adventures in Alice Programming One-Week Workshop Motivation and Background Susan Rodger Duke University June/July, 2010 Supported by the National Science.
The Creation of a Big Data Analysis Environment for Undergraduates in SUNY Presented by Jim Greenberg SUNY Oneonta on behalf of the SUNY wide team.
DuraCloud A service provided by Sandy Payette and Michele Kimpton.
City of San Marcos Virtual Desktop Integration. State CIO Priorities for 2011 Virtualization (servers, storage, computing, data center) Cloud computing.
Reference Outreach & Instruction Discover What’s New in the Electronic Library for Minnesota Beth Staats Librarian
Last Words COSC Big Data (frameworks and environments to analyze big datasets) has become a hot topic; it is a mixture of data analysis, data mining,
A comprehensive portal offering centralized access to all SIRS products which target the needs of students with best-of content selected.
DEPARTMENTAL LIBRARIES AT THE UNIVERSITY OF PITTSBURGH Margarete Bower Chemistry Library.
Future role of DMR in Cyber Infrastructure D. Ceperley NCSA, University of Illinois Urbana-Champaign N.B. All views expressed are my own.
IPlant Collaborative Tools and Services Workshop iPlant Collaborative Tools and Services Workshop Collaborating with iPlant.
1 Programming in C. 2 The Abacus  The abacus, a simple counting aid, may have been invented in Babylonia (now Iraq) in the fourth century B.C.
STANFORD UNIVERSITY INFORMATION TECHNOLOGY SERVICES Cyberinfrastructure (late entry) Phil Reese, Stanford University CSG Meeting, June.
IPlant Collaborative Tools and Services Workshop iPlant Collaborative Tools and Services Workshop Collaborating with iPlant.
The Future of the iPlant Cyberinfrastructure: Coming Attractions.
Contact: Hirofumi Amano at Kyushu Mission 40 Years of HPC Services Though the R. I. I.
Presented by Document Clustering on Supercomputers Yu (Cathy) Jiao, Ph.D. Applied Software Engineering Research Group Computational Sciences and Engineering.
CSE 451: Operating Systems Spring 2013 Module 26 Cloud Computing Ed Lazowska Allen Center 570 © 2013 Gribble, Lazowska, Levy,
The DCS lab. Computer infrastructure Peter Chochula.
Managing Humanity’s Knowledge and Expertise: The InfoVis Cyberinfrastructure Katy Börner & the InfoVis Lab School of Library and Information Science
ONE LAPTOP PER CHILD This works are licensed under a Creative Commons Attribution 2.5 License. One Laptop per Child Christopher Blizzard Software Lead.
Cloud Computing Ed Lazowska Bill & Melinda Gates Chair in Computer Science & Engineering University of Washington.
CSE 451: Operating Systems Autumn 2010 Module 25 Cloud Computing Ed Lazowska Allen Center 570.
What happens when the Science DMZ meets the Commodity Internet? Presenter: Joe Breen Work done: Brian Haymore, Sam Liston University of Utah Center for.
E-Humanities in Germany: Some thoughts. (Not just on Germany.) Dr. Max Vögler Libraries and Information Sciences German Research Foundation (DFG)
© 2012 IBM Corporation Converting Big Data into Big Knowledge.
Amagees Tech Corp value added services Data Management and Infrastructure.
SUPPLY CHAIN OF BIG DATA. WHAT IS BIG DATA?  A lot of data  Too much data for traditional methods  The 3Vs  Volume  Velocity  Variety.
Pathfinder Project Introduction to Teaching November 2009.
CPS 49S Google: The Computer Science Within and its Impact on Society Shivnath Babu Spring 2008.
| nectar.org.au NECTAR TRAINING Module 3 Common use cases.
Web of Science Demonstration Search Chemistry 137 – Spring 2013 Grace Baysinger Head Librarian & Bibliographer, Swain Chemistry & Chemical Engineering.
Computing Strategies. A computing strategy should identify – the hardware, – the software, – Internet services, and – the network connectivity needed.
Global Big Data Infrastructure Market WEBSITE Single User License: US$ 2500 No of Pages: 51 Corporate User License:
Global Multi-factor Authentication Market WEBSITE Single User License: US$ 2500 No of Pages: 51 Corporate User License:
Get Data to Computation eudat.eu/b2stage B2STAGE How to shift large amounts of data Version 4 February 2016 This work is licensed under the.
Hopper The next step in High Performance Computing at Auburn University February 16, 2016.
Hardware Architecture
Everything you know about DVDFab DVD copy DVDFab DVD copy is one of the most famous software used for copying and burning the DVD to the blank DVD disc.
Accelerated B.S./M.S An approved Accelerated BS/MS program allows an undergraduate student to take up to 6 graduate level credits as an undergraduate.
New and Evolving Technologies: Anything Goes
Data Representation N4/N5.
Example analysis of tweets using Python text analytic tools
Topic-Based Reference Skills- Education
Welcome to the University Libraries
Campus Cyberinfrastructure
SowiDataNet - A User-Driven Repository for Data Sharing and Centralizing Research Data from the Social and Economic Sciences in Germany Monika Linne, 30.
Key Resources of the ISU Library
Diskless network security
Research impact and library support
TDM=Text Mining “automated processing of large amounts of structured digital textual content for purposes of information retrieval, extraction, interpretation.
Workshop Outline Finding background/biographical information
Thales Alenia Space Competence Center Software Solutions
Centennial College Libraries
Library Orientation for ENGL 319
Language Technology and Data Analysis Laboratory (LADAL)
Presentation transcript:

Data Mining at Duke (“What to do with all of those hard drives”) Molly Tamarkin Joel Herndon Associate University Librarian for Information Technology Services Head, Data & GIS Services

Today’s Talk Rise of text analysis questions Challenges in providing text analysis services Duke University Libraries’ response

Brandaleone Center for Data and GIS Services

The Rise of Text as Data

New Questions for Research Libraries How has the North American press covered environmental issues over the last 20 years? Can we analyze all (17000) journal articles on German studies in the 20 th century? What might tweets reveal about the Arab Spring in social media?

Challenges in Providing Text Analysis Services

Challenges Collections Licensing Infrastructure Service model

Open (or mostly open) Access

Licensing

“We found some … text mining in fields such as biomedical sciences and chemistry and some early adoption within the social sciences and humanities… however… most text mining in UKFHE is based on Open Access documents or bespoke arrangements.” – key findings (p.2)

Licensing

Photo from editorsweblog.org

ECCO Project

“Big Data” ~63 Drives ~63 terabytes >40 Topics

Gale Backup Drive Collection

Infrastructure

Six Methods of Text Analysis Reading Counting Words Human Coding (researchers coding events/texts) Dictionary Methods (sentiment analysis) Supervised machine learning (using corpora) Unsupervised Machine Learning (topic modeling)

Infrastructure Issues – Storage/ scratch space – Processing power – Tools for analytics

Our Workstations 16 gigs of memory 1 TB of storage 64 bit computing Intel Xeon 3.5 GHz, 4 core Scanner available Fast networking

Swappable Drives?

General Software

Specialized Software

Service Model

Services - Staffing

Expert on Visualization

Services - Staffing

Services – Guides

Services – Workshops

In Summary Lots of research potential Licensing may be an issue for some Easy way to get started text mining with little investment but maybe some risk?

Questions? Joel Herndon – Molly Tamarkin –