HIGH-LEVEL TEXT ANALYSIS AND TECHNIQUES Angela Zoss Data Visualization Coordinator 226 Perkins Library Duke University Libraries,

Slides:



Advertisements
Similar presentations
Uses of a Corpus “[E]xplore actual patterns of language use”
Advertisements

Jing-Shin Chang National Chi Nan University, IJCNLP-2013, Nagoya 2013/10/15 ACLCLP – Activities ( ) & Text Corpora.
Text mining Extract from various presentations: Temis, URI-INIST-CNRS, Aster Data …
Information Retrieval in Practice
iOpener Workbench: Tools for Rapid Understanding of Scientific Literature Cody Dunne, Ben Shneiderman, Bonnie Dorr & Judith Klavans {cdunne, ben,
Content Analysis. Much of sociological research entails the analysis of documents. Comparative/Historical Analysis Survey Returns Field Notes Transcripts.
Predicting Text Quality for Scientific Articles AAAI/SIGART-11 Doctoral Consortium Annie Louis : Louis A. and Nenkova A Automatically.
HCC class lecture 6 comments John Canny 2/7/05. Administrivia.
Supervised by Prof. LYU, Rung Tsong Michael Department of Computer Science & Engineering The Chinese University of Hong Kong Prepared by: Chan Pik Wah,
Overview of Search Engines
Knowledge Science & Engineering Institute, Beijing Normal University, Analyzing Transcripts of Online Asynchronous.
Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2005 Lecture 1 21 July 2005.
More than words: Social networks’ text mining for consumer brand sentiments A Case on Text Mining Key words: Sentiment analysis, SNS Mining Opinion Mining,
The Chicago Guide to Writing about Multivariate Analysis, 2 nd edition. Paper versus speech versus poster: Different formats for communicating research.
The SEASR project and its Meandre infrastructure are sponsored by The Andrew W. Mellon Foundation SEASR Overview Loretta Auvil and Bernie Acs National.
SUPPORTING THE TRANSITION TO THE ENGLISH LANGUAGE ARTS CORE ACADEMIC STANDARDS Missouri Department of Elementary and Secondary Education October, 2012.
It’s All in the Details: Elaboration Using Comic Strips Susan Zimlich University of Alabama AAGC 2008.
CS598CXZ Course Summary ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign.
Instructional Community of Practice Discussion Dream Information Literacy Curriculum December 9, 2014.
Zolkower-SELL 1. 2 By the end of today’s class, you will be able to:  Describe the connection between language, culture and identity.  Articulate the.
Developing Programmatic Objectives Presentation to Department of English SUNY Oneonta October 1, 2008.
COMPUTER-ASSISTED PLAGIARISM DETECTION PRESENTER: CSCI 6530 STUDENT.
1 Computational Linguistics Ling 200 Spring 2006.
Defining Text Mining Preprocessing Transforming unstructured data stored in document collections into a more explicitly structured intermediate format.
U & I: Users & Information Lab Sept 2008  Alice Oh 
ELA Common Core Shifts. Shift 1 Balancing Informational & Literary Text.
Communication Research
Natural Language Processing Introduction. 2 Natural Language Processing We’re going to study what goes into getting computers to perform useful and interesting.
Where are the Academic Jobs ? Interactive Exploration of Job Advertisements in Geospatial and Topical Space Angela M. Zoss 1, Michael Conover 2 and Katy.
Genre Analysis, Rhetorical Analysis, and Business Communication January 14, 2013.
CCT300 – Critical Analysis of Media CCT300 – Labs New media genres Week 3.
Historical Thinking Skills
Media Arts and Technology Graduate Program UC Santa Barbara MAT 259 Visualizing Information Winter 2006George Legrady1 MAT 259 Visualizing Information.
Discovering Descriptive Knowledge Lecture 18. Descriptive Knowledge in Science In an earlier lecture, we introduced the representation and use of taxonomies.
TEXT ANALYTICS - LABS Maha Althobaiti Udo Kruschwitz Massimo Poesio.
Historical Thinking Skills A.P. World History Mr. Schabo Crestwood High School All info care of College Board:
Distant Reading Texts Adam Crymble. Distant Reading - Outline Historical Corpora Linguistics and Statistics Patterns and Anomalies.
Family Classroom Museum Suzanne Hutchins Lonna Sanderson.
Text segmentation Amany AlKhayat. Before any real processing is done, text needs to be segmented at least into linguistic units such as words, punctuation,
Quantitative Formalism: The “Genre” Potential of Political Rhetoric Michael Santoro, Queens College English Department.
Corpus Linguistics MOHAMMAD ALIPOUR ISLAMIC AZAD UNIVERSITY, AHVAZ BRANCH.
Argumentative Writing Grades College and Career Readiness Standards for Writing Text Types and Purposes arguments 1.Write arguments to support a.
GCSE English Language 8700 GCSE English Literature 8702 A two year course focused on the development of skills in reading, writing and speaking and listening.
Data Mining Concepts and Techniques Course Presentation by Ali A. Ali Department of Information Technology Institute of Graduate Studies and Research Alexandria.
INTRODUCTION TO THE WIDA FRAMEWORK Presenter Affiliation Date.
+ PARCC Partnership for Assessment of Readiness for College and Careers.
Multi-Class Sentiment Analysis with Clustering and Score Representation Yan Zhu.
AP World History: An Introduction
Click to edit Master title style Click to edit Master text styles –Second level Third level –Fourth level »Fifth level Network Events GCE English Language.
CLARIN ERIC Franciska de Jong Oxford April 2016
Information Retrieval in Practice
Middle School English Language Arts Learning Targets: I can…
Search Engine Architecture
English Hub School networks A-level English Language
Sentiment analysis algorithms and applications: A survey
Contextual Intelligence as a Driver of Services Innovation
IB Assessments CRITERION!!!.
MYP Descriptors – Essay Types & Rubrics
Applications of Data Mining in Software Engineering
Model-Driven Analysis Frameworks for Embedded Systems
Testing appraisal models with digital corpora
Multi-Dimensional Data Visualization
Passage Types Question Types
Text Analytics and Machine Learning Workshop
Using GOLD to Tracking L2 Development
Dr. Debaleena Chattopadhyay Department of Computer Science
Applied Linguistics Chapter Four: Corpus Linguistics
Historical Thinking Skills
Citation Styles: MLA, APA, CMS
Primary Sources Beyond History
Presentation transcript:

HIGH-LEVEL TEXT ANALYSIS AND TECHNIQUES Angela Zoss Data Visualization Coordinator 226 Perkins Library Duke University Libraries, Digital Scholarship Text > Data, October 25

DOCUMENTS AS CONTEXT

ANGELA AS CONTEXT But first,

How I learned to love the document. B.A. courses: Linguistics, Communication M.S. courses: Communication, Human-Computer Interaction Employment: arXiv.org AdministratorarXiv.org Ph.D. courses: Bibliometrics/Scientometrics Computer Mediated Discourse Analysis Latent Structure Analysis Natural Language Processing

DOCUMENTS AS CONTEXT Now,

Text analysis from… documents down to words (“low-level”) words up to documents (“high-level”)

Using documents to learn about language (or other social phenomena) Analyzing documents as records/proxies of language, social structures, events, etc. Linguistic studies: morphology, word counts, syntax, etc. … over time (e.g., Google ngram viewer) language across corpora (e.g., political speeches)Google ngram viewer Underwood, T. (2012). Where to start with text mining.Where to start with text mining.

Using documents to learn about language Historical culturomics of pronoun frequencies

Using documents to learn about language Universal properties of mythological networks

Using language to learn about documents Analyzing documents as artifacts themselves, with their own properties and dynamics Literary, documentary studies: Structural/rhetorical/stylistic analysis Document categorization, classification Detecting clusters of document features (topic modeling) Underwood, T. (2012). Where to start with text mining.Where to start with text mining.

Using language to learn about documents Literary Empires, Mapping Temporal and Spatial Settings in Swinburne

Using language to learn about documents Using Word Clouds for Topic Modeling Results

What are documents? For this discussion, digital versions of works of spoken or written language Examples: books, articles, transcripts, s, tweets…

Documents as context Documents have: form(at) style provenance entities intentions

STUDIES OF DOCUMENTS

Why study documents? Describe a corpus Compare/organize documents Locate relevant information/filter out irrelevant information

Describing a corpus Finding regularities/differences across groups of documents Developing theories of structure, style, etc. that can then be tested or applied May be manual (content analysis) or computer-assisted (statistical)

Example: Storylines

Differences of format, genre, participants… Articles may have sections, but these will vary by discipline and type of article Books may be fiction or non-fiction (or both) Transcripts may refer to multiple speakers, non-text content …ad infinitum

Example: Literature Fingerprinting Keim, D. A., & Oelke, D. (2007). Literature fingerprinting: A new method for visual literary analysis. In IEEE Symposium on Visual Analytics Science and Technology, VAST 2007 (pp ). doi: /VAST /VAST

Organizing documents Detect similarity between documents and a known category (or simply among themselves) Supports browsing, sentiment analysis, authorship detection

Example: Bohemian Bookshelf Thudt, A., Hinrichs, U., & Carpendale, S. (2012). The Bohemian Bookshelf: Supporting Serendipitous Book Discoveries through Information Visualization. In CHI '12: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, to appear.The Bohemian Bookshelf: Supporting Serendipitous Book Discoveries through Information Visualization

Similarity based on… common document attributes authorship, genre common language patterns topics, phrases common entity references characters, citations

Example: Quantitative Formalism Allison, S., Heuser, R., Jockers, M., Moretti, F., & Witmore, M. (2011). Quantitative formalism: An experiment. Pamphlets of the Stanford Literary Lab (vol. 1).Quantitative formalism: An experiment

Example: Clinton’s DNC Speech

Example: View DHQ

Classification assigning an object to a single class often supervised, using an existing classification scheme and a tagged corpus

Example: Relative signatures Jankowska, M., Keselj, V., & Milios, E. (2012). Relative n-gram signatures: Document visualization at the level of character n-grams. In Proceedings of IEEE Conference on Visual Analytics Science and Technology 2012 (pp ).

Categorization assigning documents to one or more categories suggestive of unsupervised clustering techniques design choices made to fit particular tasks or goals

Example: UCSD Map of Science Börner, K., Klavans, R., Patek, M., Zoss, A. M., Biberstine, J. R., Light, R. P., Larivière, V., & Boyack, K. W. (2012). Design and update of a classification system: The UCSD Map of Science. PLoS ONE, 7(7), e Design and update of a classification system: The UCSD Map of Science

Example: NIH Map Viewer

Reference systems, infrastructure What do we gain by adding structure? What do we lose?

SUMMARIZING DOCUMENTS

Text is only one component of a document. Research questions often push us to be creative with how we operationalize constructs. The richness of language and documents is best preserved by using multiple, complementary approaches.

QUESTIONS?