Presentation is loading. Please wait.

Presentation is loading. Please wait.

Making Good Use of Data at Hand: Open Source Tools Mark C. Cooke, Ph.D. Tax Management Associates, Inc.

Similar presentations


Presentation on theme: "Making Good Use of Data at Hand: Open Source Tools Mark C. Cooke, Ph.D. Tax Management Associates, Inc."— Presentation transcript:

1 Making Good Use of Data at Hand: Open Source Tools Mark C. Cooke, Ph.D. Tax Management Associates, Inc.

2 Overview Open Data concept – Data is produced for various purposes but can be used to derive novel insights; i.e. “Business Intelligence (BI)” Open Source tools exist for making good use of existing data sets – ETL (“Extract, Transform, Load”) + Analytics Knime and the R language are two of the most powerful resources for leveraging data 2013 Open Source Tools for Data Analysis Mark C Cooke - Tax Management Associates, Inc.

3 Open Data Open Data concept – governments collect, through existing management systems, enormous quantities of data that can be leveraged in alternative and novel ways to find solutions. The goal is often to leverage the broader community to develop solutions that governments may not have previously conceived. Open Data and Business Intelligence should be used by internal consumers as well. 2013 Open Source Tools for Data Analysis Mark C Cooke - Tax Management Associates, Inc.

4 Open Data Open Source Tools for Data Analysis Mark C Cooke - Tax Management Associates, Inc. 2013

5 “Data Scientist” 2013 Open Source Tools for Data Analysis Mark C Cooke - Tax Management Associates, Inc.

6 Doing Data the Old Way Data is locked inside systems :-( – Software systems are designed to wrap a Graphical User Interface (GUI) around data. – The GUI functionality, historically, has to be programmed to produce reports, views, and analysis. The GUI is driven by the sole purpose of the software. But the data has many purposes… 2013 Open Source Tools for Data Analysis Mark C Cooke - Tax Management Associates, Inc.

7 Open Data – Way Forward Making data talk across platforms: AS400, SQL, XML, Excel, PDF’s, Text Files, Image Files (.png,.jpeg, etc.), Shape Files (ESRI), email archives, web-scraping, API’s from social media, etc. Connecting data across multiple platforms Using data for novel insight Tools now exist for importing, cleaning, standardizing, and analyzing data using complex algorithms built into accessible packages 2013 Open Source Tools for Data Analysis Mark C Cooke - Tax Management Associates, Inc.

8 Open Data These systems are known as “Data Agnostic:” Database Agnostic - Database-agnostic is a term describing the capacity of software to function with any vendor’s database management system (DBMS). In information technology (IT), agnostic refers to the ability of something – such as software or hardware – to work with various systems, rather than being customized for a single system. – http://searchdatamanagement.techtarget.com/definition/da tabase-agnostic Open Source Tools for Data Analysis Mark C Cooke - Tax Management Associates, Inc. 2013

9 Data Science What is the breadth of the tool base? – Reading in data from various resources – Transforming data to merge various resources, translate data into a usable format or to add new data elements – Analyzing data from basic logical and statistical functions to higher level machine learning tools and algorithms “Machine learning, a branch of artificial intelligence, concerns the construction and study of systems that can learn from data.” http://en.wikipedia.org/wiki/Machine_learningartificial intelligencelearn http://en.wikipedia.org/wiki/Machine_learning Open Source Tools for Data Analysis Mark C Cooke - Tax Management Associates, Inc. 2013

10 Data Science What is the output? – “Business Intelligence” or actionable information that drives business decisions through insight – Creating new insights from existing data – Visualizations - representation of that BI in ways to make it consumable to a non-specialist audience “According to Friedman (2008) the "main goal of data visualization is to communicate information clearly and effectively through graphical means.” http://en.wikipedia.org/wiki/Data_visualization Open Source Tools for Data Analysis Mark C Cooke - Tax Management Associates, Inc. 2013

11 Open Source Tools for Data Analysis Mark C Cooke - Tax Management Associates, Inc.

12 Knime is a GUI-based data agnostic tool for ETL, analytics, and visualization. Knime is an open source platform for the desktop with commercial enterprise server layers including collaboration tools and web-services (web-portal). Knime supports other analytics languages, including the R language for statistical computing www.Knime.org Open Source Tools for Data Analysis Mark C Cooke - Tax Management Associates, Inc. 2013

13 The advantages of Knime: – Rapid development environment – Very powerful processing handling large datasets on commodity hardware Allows for 100% data samples up to millions of elements row-wise – Workflows can be saved, shared, and duplicated – nodes are stepwise allowing for quick revisions – nodes provide access to complex algorithms Open Source Tools for Data Analysis Mark C Cooke - Tax Management Associates, Inc. 2013

14 What is Knime? Open Source Tools for Data Analysis Mark C Cooke - Tax Management Associates, Inc. 2013

15 The Knime Workbench Open Source Tools for Data Analysis Mark C Cooke - Tax Management Associates, Inc. 2013

16 Knime Nodes Open Source Tools for Data Analysis Mark C Cooke - Tax Management Associates, Inc. 2013 Nodes are the workers inside a workflow Every node serves at least one function Nodes can also be built as Meta- Nodes, which are a collection of nodes performing common functions A collection of nodes is called a “workflow” You can develop nodes with Java and the node development support

17 Knime Nodes Open Source Tools for Data Analysis Mark C Cooke - Tax Management Associates, Inc. 2013 For example, the file reader node is an intelligent file reader that can determine the type of file However, it also allows for the end user to adjust parameters

18 Knime Nodes Open Source Tools for Data Analysis Mark C Cooke - Tax Management Associates, Inc. 2013 The Column Filter node allows users to filter columns from a table (conveniently named…)

19 Knime Nodes (sample) Open Source Tools for Data Analysis Mark C Cooke - Tax Management Associates, Inc. 2013

20 Knime Integrates with R Open Source Tools for Data Analysis Mark C Cooke - Tax Management Associates, Inc. 2013 R integration is key to expanding the data analysis and visualization capabilities of Knime R supports data ingestion of complex files (including ESRI) R supports complex data manipulation and statistical analysis R supports a wide variety of highly customizable visualizations So, what is R, exactly?

21 R Project for Statistical Computing Open Source Tools for Data Analysis Mark C Cooke - Tax Management Associates, Inc. 2013 www.r-project.org R is an open source scripting language which can be run inside Knime, but also within a command line environment independently Several GUI interfaces for R exist such as R Studio, a group that provides software for using R as well as training and extension packages (www.rstudio.com)www.rstudio.com Community contributions make up the bulk of R packages, which now total more than 4,700

22 R Project for Statistical Computing Open Source Tools for Data Analysis Mark C Cooke - Tax Management Associates, Inc. 2013 www.r-project.org The R base package (standard software) provides methods for reading data, ETL, analysis and visualizations The community provided packages take this base and build on it depending on the interest of the producer Packages stretch across all imaginable data uses, including advanced statistical analyses, machine learning and data mining, and advanced graphical visualizations (including sophisticated mapping)

23 Popular R Packages Open Source Tools for Data Analysis Mark C Cooke - Tax Management Associates, Inc. 2013 A (very) brief overview of popular packages: Plyr – for advanced data manipulation Maps – for mapping datasets onto georeferenced outputs GGPlot2 – for advanced data visualizations Rcurl – for reading data from webpages and repositories TextMining – for text mining applications SNA – for social network analysis

24 R Inside Knime Open Source Tools for Data Analysis Mark C Cooke - Tax Management Associates, Inc. 2013 Basic Data Manipulation:

25 R Inside Knime Open Source Tools for Data Analysis Mark C Cooke - Tax Management Associates, Inc. 2013 Basic Visual using Maps:

26 Knime + R + TPP Open Source Tools for Data Analysis Mark C Cooke - Tax Management Associates, Inc. 2013 Case examples for working with TPP: Look at distribution of TPP accounts across a county, state, or region Map entities or create a heatmap (choropleth) of the distribution of personal property values Compare personal property reporting across schedules across industry sectors (m&e across manufacturing types) Compare like-kind entity reporting (franchises, big-box) for consistency in values Compare personal property accounts with other data resources (real property accounts, permits, etc.)

27 Brief Demonstration Open Source Tools for Data Analysis Mark C Cooke - Tax Management Associates, Inc. 2013 Data: Florida 67 Counties More than 1.24 million personal property accounts Goals: 1.Group all data by industry to illustrate the taxable value and exempted value by type 2.Subset the data to include only a particular industry 3.Map the state-wide exempt value in a choropleth

28 Questions? Open Source Tools for Data Analysis Mark C Cooke - Tax Management Associates, Inc. 2013 Thank you for your time and attention. I am always happy to discuss data, so please feel free to contact me at any of the information below. Mark C Cooke Mark.Cooke@tma1.com 704.847.1234 (office) 704.953.6349 (cell) www.linkedin.com/in/markccooke


Download ppt "Making Good Use of Data at Hand: Open Source Tools Mark C. Cooke, Ph.D. Tax Management Associates, Inc."

Similar presentations


Ads by Google