REU S UMMER 2009 P ROJECT Association Rule Preprocessing By: Walter Garcia University of Houston - Downtown.

Slides:



Advertisements
Similar presentations
Character Arrays (Single-Dimensional Arrays) A char data type is needed to hold a single character. To store a string we have to use a single-dimensional.
Advertisements

ARDUINO CLUB Session 1: C & An Introduction to Linux.
Introduction to arrays
Strings Input/Output scanf and printf sscanf and sprintf gets and puts.
Character and String definitions, algorithms, library functions Characters and Strings.
Mining Long Sharable Patterns in Trajectories of Moving Objects Győző Gidofalvi and Torben Bach Pedersen Arrrrgggg, all this spatio- temporal data from.
Programming in Visual Basic
1 Pass Compiler 1. 1.Introduction 1.1 Types of compilers 2.Stages of 1 Pass Compiler 2.1 Lexical analysis 2.2. syntactical analyzer 2.3. Code generation.
Applying the ROCAT algorithm to find subspace clusters in categorical data Presented by George Hodulik.
WEKA Evaluation of WEKA Waikato Environment for Knowledge Analysis Presented By: Manoj Wartikar & Sameer Sagade.
CS31: Introduction to Computer Science I Discussion 1A 4/2/2010 Sungwon Yang
Declare A DTD File. Declare A DTD Inline File For example, use DTD to restrict the value of an XML document to contain only character data.
Topic 15 Implementing and Using Stacks
1 Lab Session-III CSIT-120 Spring 2001 Revising Previous session Data input and output While loop Exercise Limits and Bounds GOTO SLIDE 13 Lab session.
Understanding SAS Data Step Processing Alan C. Elliott stattutorials.com.
Basic Data Mining Techniques
Chapter Seven Advanced Shell Programming. 2 Lesson A Developing a Fully Featured Program.
Tutorial 3: XML Creating a Valid XML Document. 2 Creating a Valid Document You validate documents to make certain necessary elements are never omitted.
XP New Perspectives on XML Tutorial 3 1 DTD Tutorial – Carey ISBN
GO BACK TO ACTIVITY SLIDE GO TO TEACHER INFORMATION SLIDE To move from one activity to the next, just click on the slide! PATTERNS OR CLICK ON A BUTTON.
Homework Reading Programming Assignments
Guide to Assignment 3 Programming Tasks 1 CSE 2312 Computer Organization and Assembly Language Programming Vassilis Athitsos University of Texas at Arlington.
Database evidence By Ana Figueiredo. Field name: Food item Data Type: text Field Size: 25 Explanation I have used text because for this specific field.
1 Lab Session-III CSIT-120 Fall 2000 Revising Previous session Data input and output While loop Exercise Limits and Bounds Session III-B (starts on slide.
Matching school attendance boundaries with schools from CCD dataset.
Computer Science Department Data Structure & Algorithms Problem Solving with Stack.
DEPARTMENT OF COMPUTER SCIENCE & TECHNOLOGY FACULTY OF SCIENCE & TECHNOLOGY UNIVERSITY OF UWA WELLASSA 1 CST 221 OBJECT ORIENTED PROGRAMMING(OOP) ( 2 CREDITS.
Mastering Char to ASCII AND DOING MORE RELATED STRING MANIPULATION Why VB.Net ?  The Language resembles Pseudocode - good for teaching and learning fundamentals.
1 Functions 1 Parameter, 1 Return-Value 1. The problem 2. Recall the layout 3. Create the definition 4. "Flow" of data 5. Testing 6. Projects 1 and 2.
1 Verifying and Mining Frequent Patterns from Large Windows ICDE2008 Barzan Mozafari, Hetal Thakkar, Carlo Zaniolo Date: 2008/9/25 Speaker: Li, HueiJyun.
The DM Process – MS’s view (DMX). The Basics  You select an algorithm, show the algorithm some examples called training example and, from these examples,
System Development Life Cycle. The Cycle When creating software, hardware, or any kind of product you will go through several stages, we define these.
1 Tutorial 13 Validating Documents with DTDs Working with Document Type Definitions.
Tutorial session 2 Network annotation Exploring PPI networks using Cytoscape EMBO Practical Course Session 8 Nadezhda Doncheva and Piet Molenaar.
Downloading and Installing Autodesk Revit 2016
Getting Started with MATLAB 1. Fundamentals of MATLAB 2. Different Windows of MATLAB 1.
C STRUCTURES. A FIRST C PROGRAM  #include  void main ( void )  { float height, width, area, wood_length ;  scanf ( "%f", &height ) ;  scanf ( "%f",
Weka: Experimenter and Knowledge Flow interfaces Neil Mac Parthaláin
Downloading and Installing Autodesk Inventor Professional 2015 This is a 4 step process 1.Register with the Autodesk Student Community 2.Downloading the.
1 Building Java Programs Chapter 7: Arrays These lecture notes are copyright (C) Marty Stepp and Stuart Reges, They may not be rehosted, sold, or.
12-CRS-0106 REVISED 8 FEB 2013 KUG1C3 Dasar Algoritma dan Pemrograman.
Control Structures (B) Topics to cover here: Sequencing in C++ language.
Homework #2: Functions and Arrays By J. H. Wang Mar. 20, 2012.
Representing Strings and String I/O. Introduction A string is a sequence of characters and is treated as a single data item. A string constant, also termed.
ID Mapping to accessions from different databases. COST Functional Modeling Workshop April, Helsinki.
ALGORITHMS.
Cryptography.
Autoentry and Autocoder Efficiently creating and coding people records from resumes.
Guide to Assignment 3 and 4 Programming Tasks 1 CSE 2312 Computer Organization and Assembly Language Programming Vassilis Athitsos University of Texas.
***Classification Model*** Hosam Al-Samarraie, PhD. CITM-USM.
Weka Tutorial. WEKA:: Introduction A collection of open source ML algorithms – pre-processing – classifiers – clustering – association rule Created by.
Chapter 2 Getting Data into SAS Directly enter data into SAS data sets –use the ViewTable window. You can define columns (variables) with the Column Attributes.
CS 1704 Introduction to Data Structures and Software Engineering.
Machine Learning in Practice Lecture 9 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining Chapter 3 Basic Data Mining Techniques Jason C. H. Chen, Ph.D. Professor of MIS School of Business.
In part from: Yizhou Sun 2008 An Introduction to WEKA Explorer.
Machine Learning in Practice Lecture 9 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
WEKA: A Practical Machine Learning Tool WEKA : A Practical Machine Learning Tool.
Multiplication Timed Tests.
Topic: Binary Encoding – Part 1
Bulk Loading Documents* into Windchill
13 Text Processing Hongfei Yan June 1, 2016.
Waikato Environment for Knowledge Analysis
Model Functions Input x 6 = Output Input x 3 = Output
Weka Package Weka package is open source data mining software written in Java. Weka can be applied to your dataset from the GUI, the command line or called.
Fundamentals of Data Structures
State Reporting Processing
New Perspectives on XML
Lecture 10 – Introduction to Weka
Assignment 8 : logistic regression
Presentation transcript:

REU S UMMER 2009 P ROJECT Association Rule Preprocessing By: Walter Garcia University of Houston - Downtown

P ROJECT G OALS Convert Heartfelt Study data set into MAFIA format. Run the converted data set through the MAFIA program to find maximal frequent item sets. Convert the MAFIA output into SemAna format. Run the converted data set through the SemAna program to find unknown and correct relations. Use our method to validate other studies that have been performed on the Heartfelt Study. Find a relation that is interesting, useful, and correct that has not yet been discovered.

The Heartfelt Study examined 383 children aged years. It included 140 African-American, 117 Hispanic, and 126 Non-Hispanic White. The original heartfelt itemset contains unique transactions and each transaction contains 101 different attributes (items) such as heart rate, age, posture, BMI, obesity, etc. Here is a screenshot if the file.

MAFIA is an acronym for MAximal Frequent Itemset Algorithm. It finds the most frequent subsets in a transactional dataset. MAFIA accepts input in the format below. As you can see every transaction is a set of intergers. However, the original dataset includes integers, real numbers, and “?” that represent missing data. MAFIA FormatOriginal Format

I used a program called WEKA program from the University of Waikato to analyze and discretize the items into 10 unique items or less each. This assigned a unique integer to each item as required by the MAFIA program.

After discretizing the items in each attribute I saved the results in an excel file for cross referencing later. Here is a screenshot:

My program converts the original itemset file into MAFIA format by performing the following actions: Read transaction as a STRING Converts the STRING into a character array Tokenizes the char array into multiple character arrays and outputs a matching integer value to an outputset.ascii file as it tokenizes Repeat until the End of File

Once the program completes the conversion the outpuset file looks like this: Ready for MAFIA!

When the outpuset file is loaded into the MAFIA program we get the following output:

W HAT DOES THE OUTPUT MEAN ? In the previous example we ran MAFIA with the following parameters: mafia –mfi.7 –ascii outputset.ascii mfi.txt This means that MAFIA will accept the input file in ascii format and find the most frequent subsets from the item dataset with a minimum support of 70% or found in at least transactions.

W HAT DOES THE OUTPUT MEAN ? If we take one line from the MAFIA output MFI file we can find out what it means by cross referencing it with the excel file: For example, we examine the line below. The number in parenthesis means that the subset {351, 314, 239, 136} was found times in the dataset (11874) By looking at the excel file: 351 means a RELAX1 selection of 1 (Child was relaxed) 314 means a TAXHYN selection of 0 (Anger Traits were High) 239 means a AGE2 selection of <= 14 (Age less than or equal to 14 years) 136 means a RAW.S.AN selection of <= 113 (Raw Trait Anger score < 113)

What does the output mean? One problem with the MAFIA output that we saw in the previous slide is that MAFIA will find every single frequent subset. It includes subsets that are trivial or incorrect. What we need now is a way to filter the MAFIA output to find subsets that are interesting, useful, unknown and correct. For this we use a program called SemAna (Semantic Analyzer). When the MAFIA output set is converted and processed through the SemAna program it places all of the trivial subsets in a file called trivial.rule all of the unknown and correct subsets in a file called UnKnownCorrect.rule file. Here is a screenshot of the unknown/correct file.

In order to validate our findings (frequent subsets) I am comparing our results to studies that have already been performed by other scientists on the Heartfelt Study to see if they match. Validating our Method

The first study I analyzed was “Blood Pressure and Sexual Maturity in Adolescents” found in the American Journal of Human Biology (2001). This study found that Systolic Blood Pressure in adolescents increases as their Sexual Maturity increases.

Validating our Method Using our method I found the subsets below. This shows that as the TANNER (Sexual Maturity Measurement) increases Systolic Blood Pressure also increases. TANNER='( ]' MATURE=0 SBP='( ]' ZHTCM='( ]' TANNER='( ]' MATURE=0 SBP='( ]' OBESITY=0 TANNER='( ]' MATURE=0 SBP='( ]' WHRATIO='( ]' TANNER='( ]' SBP='( ]' MATURE=0 TANNER='( ]' SBP='( ]' APWAIST='( ]' MATURE=1 OBESITY=0 TANNER='( ]' SBP='( ]' MAP='(80-94]' MATURE=1

Future Work I will continue to validate more studies using our method. Find a relation that is interesting, useful, and correct that has not yet been discovered.