Artificial Intelligence and the Internet Edward Brent University of Missouri – Columbia and Idea Works, Inc. Theodore Carnahan Idea Works, Inc.

Slides:

Advertisements

Similar presentations

1 of 18 Information Dissemination New Digital Opportunities IMARK Investing in Information for Development Information Dissemination New Digital Opportunities.

Advertisements

ARCHITECTURES FOR ARTIFICIAL INTELLIGENCE SYSTEMS

CS570 Artificial Intelligence Semantic Web & Ontology 2

By Ahmet Can Babaoğlu Abdurrahman Beşinci.  Suppose you want to buy a Star wars DVD having such properties;  wide-screen ( not full-screen )  the extra.

Learning Semantic Information Extraction Rules from News The Dutch-Belgian Database Day 2013 (DBDBD 2013) Frederik Hogenboom Erasmus.

The Web of data with meaning... By Michael Griffiths.

Information Retrieval in Practice

Managing Data Resources

Where are the Semantics in the Semantic Web? Michael Ushold The Boeing Company.

© Tefko Saracevic, Rutgers University1 metadata considerations for digital libraries.

The Semantic Web Week 1 Module Content + Assessment Lee McCluskey, room 2/07 Department of Computing And Mathematical Sciences Module.

ReQuest (Validating Semantic Searches) Norman Piedade de Noronha 16 th July, 2004.

Creating Architectural Descriptions. Outline Standardizing architectural descriptions: The IEEE has published, “Recommended Practice for Architectural.

IMT530- Organization of Information Resources1 Feedback Like exercises –But want more instructions and feedback on them –Wondering about grading on these.

Outline Chapter 1 Hardware, Software, Programming, Web surfing, … Chapter Goals –Describe the layers of a computer system –Describe the concept.

Building Knowledge-Driven DSS and Mining Data

What is adaptive web technology?  There is an increasingly large demand for software systems which are able to operate effectively in dynamic environments.

Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Drew DeHaas.

 MODERN DATABASE MANAGEMENT SYSTEMS OVERVIEW BY ENGINEER BILAL AHMAD

Overview of Search Engines

Web 3.0 or The Semantic Web By: Konrad Sit CCT355 November 21 st 2011.

Knowledge Process Outsourcing1 Turning Information into Knowledge... for YOU The Gyaan Team.

Semantic Web Technologies Lecture # 2 Faculty of Computer Science, IBA.

LÊ QU Ố C HUY ID: QLU OUTLINE  What is data mining ?  Major issues in data mining 2.

HTML Comprehensive Concepts and Techniques Intro Project Introduction to HTML.

On Roles of Models in Information Systems (Arne Sølvberg) Gustavo Carvalho 26 de Agosto de 2010.

Module 3: Business Information Systems Chapter 11: Knowledge Management.

Enabling Organization-Decision Making

ARTIFICIAL INTELLIGENCE [INTELLIGENT AGENTS PARADIGM] Professor Janis Grundspenkis Riga Technical University Faculty of Computer Science and Information.

16-1 The World Wide Web The Web An infrastructure of distributed information combined with software that uses networks as a vehicle to exchange that information.

Search Engines and Information Retrieval Chapter 1.

Tools in Media Research In every research work, if is essential to collect factual material or data unknown or untapped so far. They can be obtained from.

Knowledge representation

Chapter 7 Web Content Mining Xxxxxx. Introduction Web-content mining techniques are used to discover useful information from content on the web – textual.

Clément Troprès - Damien Coppéré1 Semantic Web Based on: -The semantic web -Ontologies Come of Age.

The Semantic Web Service Shuying Wang Outline Semantic Web vision Core technologies XML, RDF, Ontology, Agent… Web services DAML-S.

Introduction to XML. XML - Connectivity is Key Need for customized page layout – e.g. filter to display only recent data Downloadable product comparisons.

 Knowledge Acquisition  Machine Learning. The transfer and transformation of potential problem solving expertise from some knowledge source to a program.

Metadata and Geographical Information Systems Adrian Moss KINDS project, Manchester Metropolitan University, UK

EARTH SCIENCE MARKUP LANGUAGE Why do you need it? How can it help you? INFORMATION TECHNOLOGY AND SYSTEMS CENTER UNIVERSITY OF ALABAMA IN HUNTSVILLE.

BAA - Big Mechanism using SIRA Technology Chuck Rehberg CTO at Trigent Software and Chief Scientist at Semantic Insights™

2007. Software Engineering Laboratory, School of Computer Science S E Web-Harvest Web-Harvest: Open Source Web Data Extraction tool 이재정 Software Engineering.

EEL 5937 Ontologies EEL 5937 Multi Agent Systems Lecture 5, Jan 23 th, 2003 Lotzi Bölöni.

OWL Representing Information Using the Web Ontology Language.

Introduction to the Semantic Web and Linked Data

Trustworthy Semantic Webs Dr. Bhavani Thuraisingham The University of Texas at Dallas Lecture #4 Vision for Semantic Web.

Intellectual Works and their Manifestations Representation of Information Objects IR Systems & Information objects Spring January, 2006 Bharat.

Of 33 lecture 1: introduction. of 33 the semantic web vision today’s web (1) web content – for human consumption (no structural information) people search.

Digital Libraries1 David Rashty. Digital Libraries2 “A library is an arsenal of liberty” Anonymous.

Information Retrieval

Two Paradigms for Official Statistics Production Boris Lorenc, Jakob Engdahl and Klas Blomqvist Statistics Sweden.

Metadata and Meta tag. What is metadata? What does metadata do? Metadata schemes What is meta tag? Meta tag example Table of Content.

Distributed Data Analysis & Dissemination System (D-DADS ) Special Interest Group on Data Integration June 2000.

International Conference on Fuzzy Systems and Knowledge Discovery, p.p ,July 2011.

Instructional Technologies - used as media for delivering instruction - conveyors of information and tutors of students. Mindtools – are computer applications.

Semantic Web COMS 6135 Class Presentation Jian Pan Department of Computer Science Columbia University Web Enhanced Information Management.

The Semantic Web. What is the Semantic Web? The Semantic Web is an extension of the current Web in which information is given well-defined meaning, enabling.

A Portrait of the Semantic Web in Action Jeff Heflin and James Hendler IEEE Intelligent Systems December 6, 2010 Hyewon Lim.

Selecting Relevant Documents Assume: –we already have a corpus of documents defined. –goal is to return a subset of those documents. –Individual documents.

Connecting to External Data. Financial data can be obtained from a number of different data sources.

COMPUTER SYSTEM FUNDAMENTAL Genetic Computer School INTRODUCTION TO ARTIFICIAL INTELLIGENCE LESSON 11.

Chapter 1 Overview of Databases and Transaction Processing.

Setting the stage: linked data concepts Moving-Away-From-MARC-a-thon.

Decision Support and Business Intelligence Systems (9 th Ed., Prentice Hall) Chapter 12: Artificial Intelligence and Expert Systems.

Managing Data Resources File Organization and databases for business information systems.

Geospatial metadata Prof. Wenwen Li School of Geographical Sciences and Urban Planning 5644 Coor Hall

Information Retrieval in Practice

Quantitative Methods for Business Studies

The Semantic Web By: Maulik Parikh.

2. An overview of SDMX (What is SDMX? Part I)

Presentation transcript:

Artificial Intelligence and the Internet Edward Brent University of Missouri – Columbia and Idea Works, Inc. Theodore Carnahan Idea Works, Inc.

Overview Objective – Consider how AI can be (and in many cases is being) used to enhance and transform social research on the Internet Framework – intersection of AI and research issues View Internet as a source of data whose size and rate of growth make it important to automate much of the analysis of data

Overview (continued) We discuss a leading AI-based approach, the semantic web, and an alternative paradigmatic approach, and the strengths and weaknesses of each We explore how other AI strategies can be used including intelligent agents, multi-agent systems, expert systems, semantic networks, natural language understanding, genetic algorithms, neural networks, machine learning, and data mining We conclude by considering implications for future research

Key Features of the Internet Decentralized Few or no standards for much of the substantive content Incredibly diverse information Massive and growing rapidly Unstructured data

The Good News About the Internet A massive flow of data Digitized A researcher’s dream

The Bad News A massive flow of data Digitized A researcher’s nightmare

Data Flows The Internet provides many examples of data flows. A data flow is an ongoing flux of new information, often from multiple sources, and typically large in volume. Data flows are the result of ongoing social processes in which information is gathered and/or disseminated by humans for the assessment or consumption by others. Not all data flows are digital, but all flows on the Internet are. Data flows are increasingly available over the internet. Examples of data flows include  News articles Published research articles  Medical records  Personnel recordsArticles submitted for publication  Research proposalsArrest records  Birth and death records

Data Flows vs Data Sets Data flows are fundamentally different from the data sets with which most social scientists have traditionally worked. A data set is a collection of data, often collected for a specific purpose and over a specific period of time, then frozen in place. A data flow is an ongoing flux of new information, with no clear end in sight. Data sets typically must be created in research projects funded for that purpose in which relevant data are collected, formatted, cleaned, stored, and analyzed. Data flows are the result of ongoing social processes in which information is gathered and/or disseminated by humans for the assessment or consumption by others. Data sets are sometimes analyzed only once in the context of the initial study, but are often made available in data archives to other researchers for further analysis. Data flows often merit continuing analysis, not only of delimited data sets from specific time periods, but as part of ongoing monitoring and control efforts.

The Need for Automating Analysis Together, the tremendous volume and rate of growth of the Internet, and the prevalence of ongoing data flows make automating analysis both more important and more cost-effective. Greater cost savings result from automated analysis with very large data sets Ongoing data flows require continuing analysis and that also makes automation cost-effective

AI and Automating Research Artificial Intelligence strategies offer a number of ways to automate research on the Internet. We

Contemporary Social Research on the Web Formulate the research problem Search for and sample web sites containing relevant data Process, format, store data for analysis Develop a coding scheme Code web pages for analysis Conduct analyses

Strengths and Weaknesses of Contemporary Approach May use qualitative or quantitative programs to assist with the coding and analysis Advantages  Versatile  Gives researcher much control Disadvantages  Coding schemes often not shared, requiring more effort, making research less cumulative and less objective  Expensive and time-consuming  Unlikely to keep up with rapidly changing data in data flows  Not cost-effective for ongoing analysis and monitoring

The Semantic Web The semantic web is an effort to build into the World Wide Web tags or markers for data along with representations of the semantic meaning of those tags (Berners-Lee and Lassila, 2001; Shadbolt, Hall and Berners-Lee, 2006). The semantic web will make it possible for computer programs to recognize information of a specific type in any of many different locations on the web and to “understand” the semantic meaning of that information well enough to reason about it. This will produce interoperability – the ability of different applications and databases to exchange information and to be able to use that information effectively across applications. Such a web can provide an infrastructure to facilitate and enhance many things including social science research.

Implementing the Semantic Web Contemporary Research Possible Implementation of the Semantic Web Coding schemeXML Schema – a standardized set of XML tags used to markup web pages. For example, research proposals might include tags such as Coded dataWeb pages marked up with XML (extensible markup language) – a general- purpose markup language designed to be readable by humans while at the same time providing metadata tags for various kinds of substantive content that can be easily recognized by computers Knowledge representation Resource Description Framework – a general model for expressing knowledge as subject-predicate-object statements about resources A sample plan in a research proposal might include these statements Systematic sampling - is a - sampling procedure Sampling procedure - is part of - a sampling plan TheoryOntology – a knowledgebase of objects, classes of objects, attributes describing those objects, and relationships among objects An ontology is essentially a formal representation of a theory AnalysisIntelligent agents – software programs capable of navigating to relevant web pages and using information accessible through the semantic web to perform useful functions

The Semantic Web: What Can It Do? Illustrate briefly

AI Strategies and the Semantic Web Several components of the semantic web make use of artificial intelligence (AI) strategies Semantic Web Component Artificial intelligence and related computational strategies Knowledge representation Object-Attribute-Value (O-A-V) triplets commonly used in semantic networks TheorySemantic network AnalysisIntelligent agents, Expert systems, Multi-agent models Distributed computing, parallel processing, grid

Strengths of the Semantic Web Fast and efficient to develop  Most coding done by web developers one time and used by everyone Fast and efficient to use  Intelligent agents can do most of the work with little human intervention  Structure provided makes it easier for computers to process  Can take advantage of distributed processing and grid computing Interoperability  Many different applications can access and use information from throughout the web

Weaknesses of the Semantic Web (Pragmatic Concerns) Seeks to impose standardization on a highly decentralized process of web development  Requires cooperation of many if not all developers  Imposes the double burden of expressing knowledge for humans and for computers  How will tens of millions of legacy web sites be retrofitted?  What alternative procedures will be needed for noncompliant web sites? Major forms of data on the web are provided by untrained users unlikely to be able to markup for the semantic web  E.g., blogs, input to online surveys, s,

Weaknesses of the Semantic Web (Fundamental Concerns) Assumes there is a single ontology that can be used for all web pages and all users (at least in some domain).  For example, a standard way to markup products and prices in commercial web sites could make it possible for intelligent agents to search the Internet for the best price for a particular make and model of car. This assumption may be inherently flawed for social research for two reasons. 1) Multiple paradigms - What ontology could code web pages from multiple competing paradigms or world views (Kuhn, 1969).  If reality is socially constructed, and “beauty is in the eye of the beholder” how can a single ontology represent such diverse views? 2) Competing interests – What if developers of web pages have political or economic interests at odds with some of the viewers of those web pages?

Multiple Perspectives Chomsky’s deep structure vs subtexts

Contested terms

Paradigmatic Approach We describe an alternative approach to the semantic web, one that we believe may be more suitable for many social science research applications. Recognizes there may be multiple incompatible views of data Data structure must be imposed on data dynamically by the researcher as part of the research process  (in contrast to the semantic web which seeks to build an infrastructure of web pages with data structure pre-coded by web developers)

Paradigmatic Approach (continued) Relies heavily on natural language processing (NLP) strategies to code data. NLP capabilities are not already developed for many of these research areas and must be developed. Those NLP procedures are often developed and refined using machine learning strategies. We will compare the paradigmatic approach to traditional research strategies and the Semantic Web for important research tasks.

Example Areas Illustrating the Paradigmatic Approach Event analysis in international relations Essay grading Tracking news reports on social issues or for clients  E.g., Campaigns, Corporations, Press agents Each of these areas illustrate significant data flows. These areas and programs within them illustrate elements of the paradigmatic approach. Most do not yet employ all the strategies.

Essay Grading These are programs that allow students to submit essays using the computer then a computer program examines the essays and computes a score for the student. Some of the programs also provide feedback to the student to help them improve. These programs are becoming more common for standardized assessment tests and classroom applications. Examples of programs  SAGrader™  E-rater®  C-rater®  Intelligent Essay Assessor®  Criterion® These programs illustrate large ongoing data flows and generally reflect the paradigmatic approach.

Digitizing Data TaskTraditional ResearchSemantic Web Paradigmatic Approach Digitizing Data from Internet digitized by web page developers. Other data must be digitized by researcher or analyzed manually. This can be a huge hurdle. Data digitized by web page developers The first step in any computer analysis must be converting relevant data to digital form where it is expressed as a stream of digits that can be transmitted and manipulated by computers These two approaches both rely on web page developers to digitize information. This gives them a distinct advantage over traditional research where digitizing data can be a major hurdle.

Essay Grading: Digitizing Data Digitizing  Papers replaced with digital submissions SAGrader, for example, has students submit their papers over the Internet using standard web browsers.  Digitizing often still a major hurdle limiting use Access issues Security concerns

Data Conversions TaskTraditional ResearchSemantic WebParadigmatic Approach Converted Data Digitized data suitable for web delivery for human interpretation Digitized data suitable for web delivery and machine interpretation Converting No further data conversions required once digitized by web page author Further conversion sometimes required by researcher (e.g., OCR, speech recognition, handwriting recognition)

Essay Grading: Converting Data Data conversion  Where essays are submitted on paper, optical character recognition (OCR) or handwriting recognition programs must be used to convert to digitized text. Standardized testing programs often face this issue

Encoding Data Task Traditional Research Semantic WebParadigmatic Approach Encoding Data Encoding done by researcher (often with use of qualitative or quantitative programs) Each web page developer must encode small or moderate amount of data Researchers must encode massive amounts of data Encoding automated using NLP strategies (including statistical, linguistic, rule-based expert systems, and combined strategies) machine learning (unsupervised learning, supervised learning, neural networks, genetic algorithms, data mining) Coded Data Coded data based on coding rubric XML markup based on standard ontology An XML schema indicates the basic structure expected for a web page XML markup based on ontology for that paradigm An XML schema indicates the basic structure expected for a web page

Essay Grading: Coding Essay grading programs employ a wide array of strategies for recognizing important features in essays. Intelligent Essay Assessor (IEA) employs a purely statistical approach, latent semantic analysis (LSA).  This approach treats essays like a “bag of words” using a matrix of word frequencies by essays and factor analysis to find an underlying semantic space. It then locates each essay in that space and assesses how closely it matches essays with known scores. E-rater uses a combination of statistical and linguistic approaches.  It uses syntactic, discourse structure, and content features to predict scores for essays after the program has been trained to match human coders. SAGrader uses a strategy that blends linguistic, statistical, and AI approaches.  It uses fuzzy logic to detect key concepts in student papers and a semantic network to represent the semantic information that should be present in good essays. All of these programs require learning before they can be used to grade essays in a specific domain.

Knowledge TaskTraditional ResearchSemantic WebParadigmatic Approach Knowledge TheoryA single shared world- view or objective reality Multiple paradigms Coding scheme implemented with a Codebook (often imperfect) Ontology (knowledgebase developed by web page developers and shared as standard) (implemented with RDF and ontological languages) Multiple ontologies, one for each paradigm (developed by researchers and shared within paradigm) (implemented with RDF and ontological languages)

Essay Grading: Knowledge Most essay grading programs have very little in the way of a representation of theory or knowledge. This is probably because they are often designed specifically for grading essays and are not meant to be used for other purposes requiring theory, such as social science research.  For example, C-rater, a program that emphasizes semantic content in essays, yet has no representation of semantic content other than as desirable features for the essay. The exception is SAGrader.  SAGrader employs technologies developed in a qualitative analysis program, Qualrus. Hence, SAGrader uses a semantic network to explicitly represent and reason about the knowledge or theory.

Analysis Task Traditional ResearchSemantic WebParadigmatic Approach Analysis Analysis (by hand, perhaps with help of qualitative or quantitative programs) Intelligent AgentsIntelligent agents The semantic web and paradigmatic approaches can take similar approaches to analysis.

Essay Grading: Analysis All programs produce scores, though the precision and complexity of the scores varies. Some produce explanations Most of these essay grading programs simply perform a one-time analysis (grading) of papers. However some of them, such as SAGrader, provide for ongoing monitoring of student performance as students revise and resubmit their papers. Since essays presented to the programs are already converted into standard formats and are submitted to a central site for processing, there is no need for the search and retrieval capabilities of intelligent agents

Advantages of Paradigmatic Approach Suitable for multiple-paradigm fields Suitable for contested issues Does not require as much infrastructure development on the web Can be used for new views requiring different codes with little lag time

Disadvantages of Paradigmatic Approach Relies heavily on NLP technologies that are still evolving May not be feasible in some or all circumstances Requires extensive machine learning Often requires additional data conversion for automated analysis Requires individual web pages to be coded once for each paradigm rather than a single time, hence increasing costs. (However, by automating this, costs are made manageable) Current NLP capabilities are limited to problems of restricted scope. Instead of general-purpose NLP programs, they are better characterized as special-purpose NLP programs.

Structured Data Structured data – data stored in a computer in a manner that makes it efficient to examine  A good data structure does much of the work, making the algorithms required for some kinds of reasoning straightforward, even trivial. Examples of structured data include data stored in spreadsheets, statistical programs, and data bases. Unstructured data – data stored in a manner that does not make it efficient to examine Examples of unstructured data include newspaper articles, blogs, interview transcripts, and graphics files.  A structured – unstructured dichotomy is an oversimplification  Data well-structured for some purposes may not be well-structured for other purposes.  For viewing by humans E.g., photographs, protected pdf files  For processing by programs E.g., text, doc, html  Marked for analysis (semantic web)

Event Analysis How is this data flow?

Event Analysis Schrodt’s discusison of various coding schemes

Discussion and Conclusions Both semantic web and paradigmatic approaches have advantages and disadvantages Codes on semantic web could facilitate coding by paradigmatic-approach programs Where there is much consensus the single coding for the semantic web could be sufficient While the infrastructure for the semantic web is still in development the paradigmatic approach could facilitate analysis of legacy data The paradigmatic approach could be used to build out the infrastructure for the semantic web