Making Good Use of Data at Hand: Open Source Tools Mark C. Cooke, Ph.D. Tax Management Associates, Inc.

Slides:



Advertisements
Similar presentations
Data Mining with R/ORE Minming Duan. 2 iTech Solution Profile Agenda R/ORE Overview 1 XML output generation using SQL 4 Integration with IBP and BIEE.
Advertisements

Database System Concepts and Architecture
Components of GIS.
GIS and BI with NovaView GIS Shimon Shlevich, Panorama Software.
Introduction to Databases
Management Information Systems, Sixth Edition
Advance Analytics Capabilities
Network Management Overview IACT 918 July 2004 Gene Awyzio SITACS University of Wollongong.
Database and Data Warehouse
File Systems and Databases
SmartSQL AlfaTech Software Solutions Application Requirements Document  Radi Bekker  Vladimir Goldman  Marina Shaevich  Alexander Shapiro Team Members:
Business Driven Technology Unit 2
Attribute databases. GIS Definition Diagram Output Query Results.
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 2 Overview of Database Languages and Architectures.
Business Intelligence System September 2013 BI.
Business Intelligence components Introduction. Microsoft® SQL Server™ 2005 is a complete business intelligence (BI) platform that provides the features,
Chapter 1: The Database Environment
Application of PDM Technologies for Enterprise Integration 1 SS 14/15 By - Vathsala Arabaghatta Shivarudrappa.
BUSINESS INTELLIGENCE/DATA INTEGRATION/ETL/INTEGRATION AN INTRODUCTION Presented by: Gautam Sinha.
Data Warehouse Tools and Technologies - ETL
Managing Data Interoperability with FME Tony Kent Applications Engineer IMGS.
By N.Gopinath AP/CSE. Why a Data Warehouse Application – Business Perspectives  There are several reasons why organizations consider Data Warehousing.
Overview of SQL Server Alka Arora.
1.Knowledge management 2.Online analytical processing 3. 4.Supply chain management 5.Data mining Which of the following is not a major application.
Concepts of Database Management, Fifth Edition Chapter 1: Introduction to Database Management.
Jason G. Caudill Assistant Professor of Business Administration Carson-Newman College.
Data Warehousing at STC MSIS 2007 Geneva, May 8-10, 2007 Karen Doherty Director General Informatics Branch Statistics Canada.
The McGraw-Hill Companies, Inc Information Technology & Management Thompson Cats-Baril Chapter 3 Content Management.
Web-Enabled Decision Support Systems
Fundamentals of Information Systems, Fifth Edition
1 INTRODUCTION TO DATABASE MANAGEMENT SYSTEM L E C T U R E
ICT Technologies Session 2 4 June 2007 Mark Viney.
material assembled from the web pages at
Database A database is a collection of data organized to meet users’ needs. In this section: Database Structure Database Tools Industrial Databases Concepts.
Database Design and Management CPTG /23/2015Chapter 12 of 38 Functions of a Database Store data Store data School: student records, class schedules,
C6 Databases. 2 Traditional file environment Data Redundancy and Inconsistency: –Data redundancy: The presence of duplicate data in multiple data files.
5-1 McGraw-Hill/Irwin Copyright © 2007 by The McGraw-Hill Companies, Inc. All rights reserved.
1 Geospatial and Business Intelligence Jean-Sébastien Turcotte Executive VP San Francisco - April 2007 Streamlining web mapping applications.
Introduction – Addressing Business Challenges Microsoft® Business Intelligence Solutions.
5 - 1 Copyright © 2006, The McGraw-Hill Companies, Inc. All rights reserved.
By N.Gopinath AP/CSE. There are 5 categories of Decision support tools, They are; 1. Reporting 2. Managed Query 3. Executive Information Systems 4. OLAP.
1 Technology in Action Chapter 11 Behind the Scenes: Databases and Information Systems Copyright © 2010 Pearson Education, Inc. Publishing as Prentice.
C OMPUTING E SSENTIALS Timothy J. O’Leary Linda I. O’Leary Presentations by: Fred Bounds.
BOĞAZİÇİ UNIVERSITY DEPARTMENT OF MANAGEMENT INFORMATION SYSTEMS MATLAB AS A DATA MINING ENVIRONMENT.
© 2013, published by Flat World Knowledge Chapter 10 Understanding Software: A Primer for Managers 10-1.
INTRODUCTION TO GIS  Used to describe computer facilities which are used to handle data referenced to the spatial domain.  Has the ability to inter-
Distributed Data Analysis & Dissemination System (D-DADS ) Special Interest Group on Data Integration June 2000.
Analytics Plus Product Overview. Introduction Analytics Plus is a self-service Business Intelligence and advanced analytics software. On-premise reporting.
Introduction to the Power BI Platform Presented by Ted Pattison.
Databases Salihu Ibrahim Dasuki (PhD) CSC102 INTRODUCTION TO COMPUTER SCIENCE.
1 Seattle University Master’s of Science in Business Analytics Key skills, learning outcomes, and a sample of jobs to apply for, or aim to qualify for,
Self-Service Data Integration with Power Query Stéphane Fréchette.
The Concepts of Business Intelligence Microsoft® Business Intelligence Solutions.
SAP BI – The Solution at a Glance : SAP Business Intelligence is an enterprise-class, complete, open and integrated solution.
MAKING BUSINESS INTELLIGENT Brought to you by your local PASS Community! Self Service ETL with Power Query Welcome.
BUSINESS INTELLIGENCE. The new technology for understanding the past & predicting the future … BI is broad category of technologies that allows for gathering,
INTRODUCTION TO INFORMATION SYSTEMS LECTURE 9: DATABASE FEATURES, FUNCTIONS AND ARCHITECTURES PART (2) أ/ غدير عاشور 1.
Managing Data Resources File Organization and databases for business information systems.
Management Information Systems by Prof. Park Kyung-Hye Chapter 7 (8th Week) Databases and Data Warehouses 07.
Business Intelligence MSCS 6931 Compare Tableau and Power BI Haochen(Bamboo) Sun Sep 30, 2015.
INTRODUCTION TO GEOGRAPHICAL INFORMATION SYSTEM
Accessing Spatial Information from MaineDOT
Data Warehouse.
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 2 Database System Concepts and Architecture.
OpenWorld 2018 How to Create Chatbots with OMCe
Power Query Discovery and connectivity to a wide range of data sources
Analytics Plus Product Overview 1.
Data Warehousing Concepts
Business Intelligence
Presentation transcript:

Making Good Use of Data at Hand: Open Source Tools Mark C. Cooke, Ph.D. Tax Management Associates, Inc.

Overview Open Data concept – Data is produced for various purposes but can be used to derive novel insights; i.e. “Business Intelligence (BI)” Open Source tools exist for making good use of existing data sets – ETL (“Extract, Transform, Load”) + Analytics Knime and the R language are two of the most powerful resources for leveraging data 2013 Open Source Tools for Data Analysis Mark C Cooke - Tax Management Associates, Inc.

Open Data Open Data concept – governments collect, through existing management systems, enormous quantities of data that can be leveraged in alternative and novel ways to find solutions. The goal is often to leverage the broader community to develop solutions that governments may not have previously conceived. Open Data and Business Intelligence should be used by internal consumers as well Open Source Tools for Data Analysis Mark C Cooke - Tax Management Associates, Inc.

Open Data Open Source Tools for Data Analysis Mark C Cooke - Tax Management Associates, Inc. 2013

“Data Scientist” 2013 Open Source Tools for Data Analysis Mark C Cooke - Tax Management Associates, Inc.

Doing Data the Old Way Data is locked inside systems :-( – Software systems are designed to wrap a Graphical User Interface (GUI) around data. – The GUI functionality, historically, has to be programmed to produce reports, views, and analysis. The GUI is driven by the sole purpose of the software. But the data has many purposes… 2013 Open Source Tools for Data Analysis Mark C Cooke - Tax Management Associates, Inc.

Open Data – Way Forward Making data talk across platforms: AS400, SQL, XML, Excel, PDF’s, Text Files, Image Files (.png,.jpeg, etc.), Shape Files (ESRI), archives, web-scraping, API’s from social media, etc. Connecting data across multiple platforms Using data for novel insight Tools now exist for importing, cleaning, standardizing, and analyzing data using complex algorithms built into accessible packages 2013 Open Source Tools for Data Analysis Mark C Cooke - Tax Management Associates, Inc.

Open Data These systems are known as “Data Agnostic:” Database Agnostic - Database-agnostic is a term describing the capacity of software to function with any vendor’s database management system (DBMS). In information technology (IT), agnostic refers to the ability of something – such as software or hardware – to work with various systems, rather than being customized for a single system. – tabase-agnostic Open Source Tools for Data Analysis Mark C Cooke - Tax Management Associates, Inc. 2013

Data Science What is the breadth of the tool base? – Reading in data from various resources – Transforming data to merge various resources, translate data into a usable format or to add new data elements – Analyzing data from basic logical and statistical functions to higher level machine learning tools and algorithms “Machine learning, a branch of artificial intelligence, concerns the construction and study of systems that can learn from data.” intelligencelearn Open Source Tools for Data Analysis Mark C Cooke - Tax Management Associates, Inc. 2013

Data Science What is the output? – “Business Intelligence” or actionable information that drives business decisions through insight – Creating new insights from existing data – Visualizations - representation of that BI in ways to make it consumable to a non-specialist audience “According to Friedman (2008) the "main goal of data visualization is to communicate information clearly and effectively through graphical means.” Open Source Tools for Data Analysis Mark C Cooke - Tax Management Associates, Inc. 2013

Open Source Tools for Data Analysis Mark C Cooke - Tax Management Associates, Inc.

Knime is a GUI-based data agnostic tool for ETL, analytics, and visualization. Knime is an open source platform for the desktop with commercial enterprise server layers including collaboration tools and web-services (web-portal). Knime supports other analytics languages, including the R language for statistical computing Open Source Tools for Data Analysis Mark C Cooke - Tax Management Associates, Inc. 2013

The advantages of Knime: – Rapid development environment – Very powerful processing handling large datasets on commodity hardware Allows for 100% data samples up to millions of elements row-wise – Workflows can be saved, shared, and duplicated – nodes are stepwise allowing for quick revisions – nodes provide access to complex algorithms Open Source Tools for Data Analysis Mark C Cooke - Tax Management Associates, Inc. 2013

What is Knime? Open Source Tools for Data Analysis Mark C Cooke - Tax Management Associates, Inc. 2013

The Knime Workbench Open Source Tools for Data Analysis Mark C Cooke - Tax Management Associates, Inc. 2013

Knime Nodes Open Source Tools for Data Analysis Mark C Cooke - Tax Management Associates, Inc Nodes are the workers inside a workflow Every node serves at least one function Nodes can also be built as Meta- Nodes, which are a collection of nodes performing common functions A collection of nodes is called a “workflow” You can develop nodes with Java and the node development support

Knime Nodes Open Source Tools for Data Analysis Mark C Cooke - Tax Management Associates, Inc For example, the file reader node is an intelligent file reader that can determine the type of file However, it also allows for the end user to adjust parameters

Knime Nodes Open Source Tools for Data Analysis Mark C Cooke - Tax Management Associates, Inc The Column Filter node allows users to filter columns from a table (conveniently named…)

Knime Nodes (sample) Open Source Tools for Data Analysis Mark C Cooke - Tax Management Associates, Inc. 2013

Knime Integrates with R Open Source Tools for Data Analysis Mark C Cooke - Tax Management Associates, Inc R integration is key to expanding the data analysis and visualization capabilities of Knime R supports data ingestion of complex files (including ESRI) R supports complex data manipulation and statistical analysis R supports a wide variety of highly customizable visualizations So, what is R, exactly?

R Project for Statistical Computing Open Source Tools for Data Analysis Mark C Cooke - Tax Management Associates, Inc R is an open source scripting language which can be run inside Knime, but also within a command line environment independently Several GUI interfaces for R exist such as R Studio, a group that provides software for using R as well as training and extension packages ( Community contributions make up the bulk of R packages, which now total more than 4,700

R Project for Statistical Computing Open Source Tools for Data Analysis Mark C Cooke - Tax Management Associates, Inc The R base package (standard software) provides methods for reading data, ETL, analysis and visualizations The community provided packages take this base and build on it depending on the interest of the producer Packages stretch across all imaginable data uses, including advanced statistical analyses, machine learning and data mining, and advanced graphical visualizations (including sophisticated mapping)

Popular R Packages Open Source Tools for Data Analysis Mark C Cooke - Tax Management Associates, Inc A (very) brief overview of popular packages: Plyr – for advanced data manipulation Maps – for mapping datasets onto georeferenced outputs GGPlot2 – for advanced data visualizations Rcurl – for reading data from webpages and repositories TextMining – for text mining applications SNA – for social network analysis

R Inside Knime Open Source Tools for Data Analysis Mark C Cooke - Tax Management Associates, Inc Basic Data Manipulation:

R Inside Knime Open Source Tools for Data Analysis Mark C Cooke - Tax Management Associates, Inc Basic Visual using Maps:

Knime + R + TPP Open Source Tools for Data Analysis Mark C Cooke - Tax Management Associates, Inc Case examples for working with TPP: Look at distribution of TPP accounts across a county, state, or region Map entities or create a heatmap (choropleth) of the distribution of personal property values Compare personal property reporting across schedules across industry sectors (m&e across manufacturing types) Compare like-kind entity reporting (franchises, big-box) for consistency in values Compare personal property accounts with other data resources (real property accounts, permits, etc.)

Brief Demonstration Open Source Tools for Data Analysis Mark C Cooke - Tax Management Associates, Inc Data: Florida 67 Counties More than 1.24 million personal property accounts Goals: 1.Group all data by industry to illustrate the taxable value and exempted value by type 2.Subset the data to include only a particular industry 3.Map the state-wide exempt value in a choropleth

Questions? Open Source Tools for Data Analysis Mark C Cooke - Tax Management Associates, Inc Thank you for your time and attention. I am always happy to discuss data, so please feel free to contact me at any of the information below. Mark C Cooke (office) (cell)