PORTAL Data Quality & Aggregation Sue Ahn, Kristin Tufte.

Slides:



Advertisements
Similar presentations
Abstract Travel time based performance measures are widely used for transportation systems and particularly freeways. However, it has become evident that.
Advertisements

INTRODUCTION TO TRANSPORT Lecture 7 Introduction to Transport Lecture 7: Signal Coordination.
1 NATMEC 2008 Christopher Monsere Kristin Tufte, Robert L. Bertini, and Soyoung Ahn Intelligent Transportation Systems Laboratory Maseeh College of Engineering.
Chapter 2 (supplement): Capacity and Level-of-Service Analysis for Freeways and Multilane Highways Objectives of this presentation: By the end of this.
Case Study 4 New York State Alternate Route 7. Key Issues to Explore: Capacity of the mainline sections of NYS-7 Adequacy of the weaving sections Performance.
Econometric Details -- the market model Assume that asset returns are jointly multivariate normal and independently and identically distributed through.
Noam Segev, Israel Chernyak, Evgeny Reznikov Supervisor: Gabi Nakibly, Ph. D.
1Chapter 9-4e Chapter 9. Volume Studies & Characteristics Understand that measured volumes may not be true demands if not careful in data collection and.
2015/6/161 Traffic Flow Theory 2. Traffic Stream Characteristics.
1 CSI 101 Elements of Computing Fall 2009 Lecture #4 Using Flowcharts Monday February 2nd, 2009.
2/27/06Michael Dixon1 CE 578 Highway Traffic Operations Lecture 16: Freeways Basic Sections.
Chapter 3. Highway Design for Performance
1 Psych 5500/6500 The t Test for a Single Group Mean (Part 5): Outliers Fall, 2008.
Testing an individual module
Probability Distributions Continuous Random Variables.
Month XX, 2004 Dr. Robert Bertini Using Archived Data to Measure Operational Benefits of ITS Investments: Ramp Meters Oregon Department of Transportation.
The one sample t-test November 14, From Z to t… In a Z test, you compare your sample to a known population, with a known mean and standard deviation.
CSA and Roadside Inspections Trooper John Sova Motor Carrier Operations Division North Dakota Highway Patrol.
Data Editing United Nations Statistics Division (UNSD) 16 March 2011 Santiago, Chile.
Measures of Central Tendency
Exploring Engineering
Investigation of Speed-Flow Relations and Estimation of Volume Delay Functions for Travel Demand Models in Virginia TRB Planning Applications Conference.
Lec 17, Ch.9, pp : Capacity of freeway sections (objectives) Understand capacity and level of service are the heart of transportation analyses Understand.
Simpson County Travel Demand Model Mobility Analysis November 7, 2003.
University of Maryland Department of Civil & Environmental Engineering By G.L. Chang, M.L. Franz, Y. Liu, Y. Lu & R. Tao BACKGROUND SYSTEM DESIGN DATA.
CHAPTER 18: Inference about a Population Mean
Estimating Traffic Flow Rate on Freeways from Probe Vehicle Data and Fundamental Diagram Khairul Anuar (PhD Candidate) Dr. Filmon Habtemichael Dr. Mecit.
Abstract Raw ITS data is commonly aggregated to 20-30sec intervals for collection and communication, and further aggregated to 1-60min intervals for archiving.
Accuracy in Real-Time Estimation of Travel Times Galen McGill, Kristin Tufte, Josh Crain, Priya Chavan, Enas Fayed 15 th World Congress on Intelligent.
Abstract Transportation sustainability is of increasing concern to professionals and the public. This project describes the modeling and calculation of.
Abstract The Portland Oregon Regional Transportation Archive Listing (PORTAL) is the official intelligent transportation systems data archive for the Portland.
1 Single Table Queries. 2 Objectives  SELECT, WHERE  AND / OR / NOT conditions  Computed columns  LIKE, IN, BETWEEN operators  ORDER BY, GROUP BY,
Abstract The Portland Oregon Transportation Archive Listing (PORTAL) archives high resolution traffic data including speed, volume, and occupancy collected.
Assessment and Refinement of Real-Time Travel Time Algorithms for Use in Practice November 8 th, 2007.
Advanced Topics- Functions Introduction to MATLAB 7 Engineering 161.
Assessment and Refinement of Real-Time Travel Time Algorithms for Use in Practice September 6, 2007.
Traffic Flow Fundamentals
Calibrating Model Speeds, Capacities, and Volume Delay Functions Using Local Data SE Florida FSUTMS Users Group Meeting February 6, 2009 Dean Lawrence.
Assessment and Refinement of Real-Time Travel Time Algorithms for Use in Practice April 11, 2007.
© Department of Statistics 2012 STATS 330 Lecture 20: Slide 1 Stats 330: Lecture 20.
Chapter 10 Verification and Validation of Simulation Models
This document gives one example of how one might be able to “fix” a meteorological file, if one finds that there may be problems with the file. There are.
Using Signal Systems Data and Buses as Probes to Create Arterial Performance Measures Mathew Berkow, Michael Wolfe, John Chee, Robert Bertini,
Rescaling Reliability Bounds for a New Operational Profile Peter G Bishop Adelard, Drysdale Building, Northampton Square,
1 Techniques for Validating an Automatic Bottleneck Detection Tool Using Archived Freeway Sensor Data Jerzy Wieczorek, Rafael J. Fernández-Moctezuma, and.
1 Using Archived ITS Data to Automatically Identify Freeway Bottlenecks in Portland, Oregon Robert L. Bertini, Rafael J. Fernández-Moctezuma, Jerzy Wieczorek,
Archived Data Management System Study Advisory Committee Meeting May 14, 2003.
Abstract The value of creating an ITS data archive is somewhat undisputed, and a number exist in states and major metropolitan regions in North America.
July 13, 2005ITE District 6 Annual Meeting Using Ground Truth Geospatial Data to Validate Advanced Traveler Information Systems Freeway Travel Time Messages.
Using Archived Data to Measure Operational Benefits of a System-wide Adaptive Ramp Metering (SWARM) System Data Collection Plan / Experimental Design May.
ITE District 6 June 27, 2006 Incorporating Incident Data into a Freeway Data Archive for Improved Performance Measurement ITE District 6 June 27, 2006.
Uncertainty2 Types of Uncertainties Random Uncertainties: result from the randomness of measuring instruments. They can be dealt with by making repeated.
ITE District 6 June 27, 2006 Incorporating Incident Data into a Freeway Data Archive for Improved Performance Measurement ITE District 6 June 27, 2006.
1 Bottleneck Identification and Forecasting in Traveler Information Systems Robert L. Bertini, Rafael Fernández-Moctezuma, Huan Li, Jerzy Wieczorek, Portland.
Distance Vector Routing
Abstract Dynamic Message Signs (DMS) on freeways are used to provide a variety of information to motorists including incident and construction information,
PORTAL: Portland Transportation Archive Listing Improving Travel Demand Forecasting Conclusion Introduction Metro is working closely with PSU researchers.
The Data Collection and Statistical Analysis in IB Biology John Gasparini The Munich International School Part II – Basic Stats, Standard Deviation and.
Retrieving Information Pertemuan 3 Matakuliah: T0413/Current Popular IT II Tahun: 2007.
The latte Stream-Archive Query Project - Exploring Stream+Archive Data in Intelligent Transportation Systems Jin Li (with Kristin Tufte, Vassilis Papadimos,
Quality Control of Soil Moisture and Temperature For US Climate Reference Network Basic Methodology February 2009 William Collins USCRN.
Lessons learned from Metro Vancouver
Travelling to School.
The Selection Structure
CE 3500 Transportation Engineering Basic Traffic Flow Concepts
Accumulation AP Calculus AB Days 7-8
Chapter 10 Verification and Validation of Simulation Models
CHAPTER 21: Comparing Two Means
CHAPTER 18: Inference about a Population Mean
CHAPTER 18: Inference about a Population Mean
Presentation transcript:

PORTAL Data Quality & Aggregation Sue Ahn, Kristin Tufte

Road Map Initial Data Cleaning (more cleaning later)  Set null values as appropriate  Set Data Quality Flags based on row-level data integrity violations Aggregate Calculations  Calculate 5-minute aggregates  Calculate 15-min, 1-hour aggregates Portal Output Calculations  Timeseries  Grouped Data  Performance

Road Map: Initial Data Cleaning Set NULL values Data Quality Flags

Set Null Values as Appropriate Case 1: No Data Received from Detector  ODOT indicates vol = occ = speed = -1, status = 0 means no data received from detector ** (status 0 is inhibited)  We set volume, occupancy and speed to NULL in this case Case 2: No Activity at Detector  ODOT indicates vol = occ = speed = 0, status = 0 means no activity at detector **  We set speed to NULL in this case Case 3: Disabled (?)  Also get readings: vol = occ = speed = -1, status = 1 (status 1 is disabled)  Similar to Case 1, except status = 1, so assume this does mean disabled  We set volume, occupancy and speed to NULL in this case  Should verify the meaning of vol = occ = speed = -1, status = 1 with ODOT ** as of ; fields were different before No -999’s in data received live; there are -999s in 2001, 2002 data

Data Quality Flags Detect data inconsistencies and erroneous values Row-level/Reading-level flags  Flags are set for each (vol, occ, speed) reading we get  For each reading (row), a flag is set to indicate if that row fails a particular criteria Examples  Max Speed, set if speed > 100 mph  Max Occ, set if occ > 95% Adapted from TTI (see wiki for link)

Data Quality Flags – Single Value Flags Maximum Volume (DQ_MAXVOL)  Condition: volume > 17 (20-sec data)  Comments: Equivalent to 3060 vph; may be caused by motorcycles or lane changes for example Maximum occupancy (DQ_MAXOCC)  Condition: occupancy > 95%  Comments: detectors may be “stuck-on” Maximum speed (DQ_MAXSPD)  Condition: speed > 100 mph  Comments: based on empirical evidence Minimum speed (DQ_MINSPD)  Condition: speed < 5 mph  Comments: based on empirical evidence

Data Quality Flags – Multi-variate Consistency Multi-variate consistency #1 (DQ_MVC_S0VGT0)  Speed = 0 and Volume > 0  Indicates potential problem with: speed  Is this necessary? We already eliminate speed < 5 mph Multi-variate consistency #2 (DQ_MVC_SGT0V0)  Speed > 0 and Volume = 0  Indicates potential problem with: speed Multi-variate consistency #3 (DQ_MVC_OGT0V0)  Occupancy > 0 and Volume = 0  Note: alternative: set occupancy = 0  Indicates potential problem with: occupancy

Road Map: Aggregate Calculations 5-minute aggregates 15-min, 1-hr aggregates

Missing Values in 5-min Aggregates of Vol, Speed, Occ Raw Aggregates  Sum/avg data with no imputation, rescaling or cleaning Processed Aggregates  Two issues:  #1 Is there enough data to calculate an aggregate?  #2 If there is enough data, how to handle the fact that we are missing values?

Missing Values in 5-min Aggregates of Vol, Speed, Occ, Cont’d Issue #1: Determining if a time period has enough valid data to calculate aggregates for the time period  Aggregate vol, speed, occ set to null if less than X% of the readings are valid  Do sensitivity analysis to pick X (start with 75% - set 5-min aggregate to null if we have less than 75% of the data)

Bad Data Threshold Rob - In general, throughout the development of the aggregation process, we consistently encountered the issue of how to determine when there is enough data to create an aggregate vs. when an aggregate value should be declared null because of lack of data. For starters, we’ve chosen a threshold of 75%. If 75% or more of the 20- second readings in a 5-minute interval are valid, we calculate an aggregate for that 5-minute interval. The same issue arises in the following situations:  Calculating 5-min aggregates from 20-second data  Calculating 15-min aggregates from 5-min data  Calculating 1-hr aggregates from 15-min data  Grouped Data: grouping over multiple days (group by hour)  Grouped Data: aggregating 1-hr data to daily data for group by day of week and week of year Related issues:  Aggregating over lanes (how to handle the situation when data is missing from one lane, note that data may be missing from a lane for only a portion of a day) We wonder if studying this percentage (data sensitivity) would make a good paper topic.

#2 Handling Missing Values Two basic methods for dealing with missing values  1) Use preceding/following values (replace missing volume with preceding volume reading for that detector)  2) Aggregate values in a 5 minute range and scale based on # of readings Since we are aggregating, suggest method 2  We are aggregating, so interpolation may not be necessary  Think 2 will perform similarly to 1 in cases when only a few readings are missing, when lots of readings missing, should report that data is not available (null)  Method 2 is well-defined (simple to define and express/explain)  Method 2 is easier to implement than 1

5-min Aggregates (vol, spd, occ) Volume  For each detector, sum all valid volume readings in 5- minute period (eliminate any readings with non-zero flags) and rescale by # of missing readings Speed, Occupancy  For each detector, calculate the volume-weighted average of all valid readings in a 5-minute period  Assume 20-second readings are space-mean speed  Weight by volume (to be technically correct, we need to weight by density, but that requires estimating density)

5-Min Aggregate – Other Measures Vehicle Miles Traveled (VMT)  5-min VMT = 5-min volume * length of midpoint influence area of associated detector Vehicle Hours Traveled (VHT)  5-min VHT = 5-min volume * (length of midpoint influence area / 5-min space-mean speed) Travel Time  5-min travel time = length of midpoint influence area / (5-min space-mean speed)

5-min Aggregates – Other – Cont’d Total Delay (in veh-mins)  5-min delay = 5-min volume * (travel time – (influence area / free-flow speed))  Where: free-flow speed = min {assumed free-flow speed, actual 5-min speed}  Note: Free-flow speed is computed in this way in order to avoid negative delay Average Delay/Vehicle (in minutes)  5-min delay = Total Delay/5-min volume Should we report Average or Total Delay or both?

Summary: Which values/measures affect which calculations? 20-sec volume 20-sec speed 20-sec occupancy 5-min volume 5-min speed 5-min occupancy 5-min VMT 5-min Travel Time 5-min Delay 5-min VHT ** * May be needed, depends on method used ** Need 5-min speed to calculate free-flow speed *

Road Map: 15-min, 1-hr Aggregates Issues:  Calculated from 5-min aggregates or from 20- second data? We chose to use 5-min data  What should the threshold be for % of bad data? Chose to toss 15-min reading if we are missing more than one 5-min reading For now, toss 1 hr reading if we have less than 75% of 5-min readings (or less than 9 readings) Note the “bad data” threshold issue arises here, also see next page.

Aggregation over Larger Intervals 15-min aggregates are calculated from 5-min aggregates, set aggregate value to null if missing more than one 5- minute reading 1-hr aggregates are calculated from 5- min aggregates, set aggregate value to null if missing more than three 5- minute readings

Aggregation over Time, Cont’d Volume  Sum 5-min volumes Occupancy  Volume-weighted average of 5-min occupancies Speed  Volume-weighted average of 5-min speeds Vehicle Miles Traveled  sum of 5-min VMTs Vehicle Hours Traveled  sum of 5-min VHTs Travel Time  Volume-weighted average of 5-min travel times Delay  Total Delay: Sum of 5-min delays  Average Delay: Volume-weighted average of 5-min delays

Road Map: PORTAL Output We have described how to do the 5-min, 15-min and 1-hr aggregations For each page in PORTAL, analyze what calculations are needed and how those calculations should be done

Timeseries – Line Graphs Timeseries  Options: graph by station or lane, 7 basic measures, 5-min, 15-min, 1-hr granularity of input data  For Lane Graph These values are in aggregate table (calculation described in previous slides) Requires 5-min (15-min, 1-hr) aggregates of each value for each detector If data is missing, put blank space in the graph  For Station Graph: Requires aggregation over lanes Calculation described in subsequent slides If data is missing from one or more lanes, show blank space on graph

Timeseries – Contour Plots Contour Plot (Surface Plot?) for a Highway  Requires same calculations as for station plot on previous slide

Aggregation over Lanes Volume  Sum Occupancy  Volume-weighted average of by-lane occupancies Speed  Volume-weighted average of by-lane speeds Vehicle Miles Traveled  Sum over lanes Vehicle Hours Traveled  Sum over lanes Travel Time  Volume-weighted average of by-lane travel times Average Delay  Volume-weighted average of by-lane delays Total Delay  Sum of by-lane delays

Grouped Data – Line Plots Line Plots by Station/Detector  Group by: 5 minute (minute of day) Input is 5-minute aggregations Simple arithmetic average, min, max, and std dev of data from appropriate 5-min period Result shows variation between days  Group by: Hour of day Input is 1-hr aggregations As above, simple arithmetic average, min, max and std dev of data from appropriate 1-hr periods Result shows variation between days What to do if we are missing data? What if we are missing 50% of the values for 8:00-9:00, for example? We may be missing different %s for different hour intervals??

Grouped Data – Line Plots, Cont’d Line Plots by Station/Detector  Group by: Day of Week Input must be 1-hr aggregations Aggregate 1-hr data to daily data What if one or more hours in a day is/are missing?? (What if we are missing a peak hour?) Then simple arithmetic avg, min, max, std dev of aggregates Result shows variation within day of week  Group by: Week of Year Similar to day of week Note: “bad data threshold” issue again. Seems reasonable to rescale when going from 20-sec data to 5-min data, and again 5-min to 1-hr seems OK. But neither interpolating or rescaling seem really acceptable in this situation.

Aggregation to Day of Week Volume  Sum 1-hr volumes Occupancy  ?? Volume-weighted average of 5-min occupancies Speed  (  1-hr VMTs)/(  1-hr VHTs)  ensure for VMT and VHT - either are both null or neither are null Vehicle Miles Traveled   1-hr VMTs  Rescale??? Vehicle Hours Traveled   1-hr VHTs Travel Time  Volume-weighted average of 5-min travel times Delay  Total Delay: Sum of 5-min delays  Average Delay: Volume-weighted average of 5-min delays

WARNING: STOP HERE Following slides are under construction (hard hats required)

Grouped Data – Highway Plots Highway Plots  What does this do anyway???  Group by: 5 minute (minute of day) Requires input to be 5-minute aggregate data Need to group 5-minute periods from different days together Use appropriate aggregation-by-time method  Group by: Hour of day Input can be 5-min, 15-min or 1-hr For 1hr input need to group 1-hr periods from different days together  Use appropriate aggregation-by-time method 5-min, 15-min input – Should we restrict this to 1-hr input??  Just use appropriate aggregation-by-time method??

Grouped Data – Highway Plots, Cont’d Highway Plots  Group by: Day of Week Input is 5-min, 15-min, 1-hr Need to group 5-min data into 5-min groups, 15- min data into 15-min groups, etc. Use appropriate aggregation by time method??  Group by: Week of Year Same as Day of Week, I think

Grouped Data – All Highways All Highways Plots

Performance Reports MONTHLY FIGURES Avg TT Avg TS Avg VS FFTT FFTS 95th% TT 95th% TS Mins MPH MPH Mins MPH Mins MPH VMT VHT % Cngst Hwy Len Days Used 211,298,304 4,118, % Miles 31 WEATHER NUMBER OF DAYS Rain ≥ 0.5" Rain ≥ 1.0" Frzn Precip Fog TIME WINDOW FIGURES (primarily for daily report – BI, TTI are new) Avg TT Avg TS 95th% TT 95th% TS Mins MPH Mins MPH VMT VHT % Cngst TTI BI 211,298,304 4,118, %

Future Flags Calibration Error (not in this presentation, see paper) Multi-variate consistency  Volume > n max when Occupancy = 0  This should be tested after correcting for calibration error Maximum reduction in speed  Speed n+1 > 0.45*Speed n Identical records  If more than 8 consecutive records (volume, occupancy, and speed) are identical

Extra Slides

Questions No vehicles present Speed = 0 when Volume = 0Set Speed = null Am not doing this because of the -1, -1, -1, and 0, 0, 0 tests At what point do we call a 5-minute interval invalid?? 5 missing readings? 8? 9? 10? Aggregation over space – we need something for vol and Aggregation over space – need to deal with missing data Need aggregation over time (i.e. combining data from different days together)

ODOT SWARM Flags 0 - inhibited 1 - disabled 2 - ok 3 - suspect 4 - softfail 5 - hardfail

Data Notes Status 0 Received Data As of Feb 14, 2005, ODOT reports vol -1, occ -1, speed -1, status 0, for detectors with no controller communications (no data received from detector), and vol 0, occ 0, speed 0, status 0 for detectors reporting no traffic data (no activity) (info from David Gaarsoe). See Data Fidelity and minus 1 and minus 999 for more info.Data Fidelityminus 1 and minus 999 Data Cleaning The nightly scripts "clean" the -1 and -999 data as follows:  Sets volume = 0, speed = NULL, occupancy = 0 where volume = -999  Sets volume = NULL, speed = NULL, occupancy = NULL, where volume = -1  I don’t belive we receive -999 data anymore (KT) Unknown  We seem to also get vol = 0, speed = 0, status = 2 (or other status), with occ > 0. What does this mean?  What was that weird thing I noted in the file on badger?

Steve H’s notes from s from ODOT -999 When we received blocks of data from 2001 and 2002, the 20-second data included records where volume=-999. This indicated that no cars passed. As of July 14, 2004 when PORTAL went live, we no longer received volume= are no longer an issue. -1 From July 14, 2004 to November 17, 2004, we were receiving records with volume=-1. In all cases, when volume=-1, the SWARM algorithm variable status=2, which is the code for "OK." As I understand it, this is supposed to represent a communication error. From November 18, 2004 to February 13, 2005, we did not receive any records where volume = -1. This is due to "Transport ATMS build 2.1, distributed on Nov 17, included many internal changes to traffic sensor and detector station data, (including station level data smoothing). Among other changes it started sending a 0 detector volume during Communications Failure with the Wapiti controllers, as noted by Robert Bertini." -David Gaarsoe During this time period, it is impossible to distinguish between communication errors and periods where no cars pass. Beginning on February 14, 2005, it appears that Transport ATMS build 2.3 was released. "Transport ATMS build 2.3 will... once again report a -1 vol, -1 occupancy, and -1 speed, Inhibited status (0) for Traffic Detectors with no controller communications, and a 0 vol, 0 occupancy, 0 speed, Inhibited status for detectors of reporting controllers with no traffic data (no activity)." -David Gaarsoe It appears that, beginning on February 14, 2005 we are now able to accurately distinguish between communication errors and periods where no vehicles pass a detector.

Data Quality Flags - Implementation Queries or scripts analyze the data and set flags when criteria are met We will be able to tell which data values failed which criteria

Aggregation over Lanes/Corridors Volume  Over lanes: Sum  (average volume over detector stations doesn’t seem useful) Occupancy  Over lanes: Weighted average of by-lane occupancies by volumes  (average occupancy over detector stations doesn’t seem useful) Speed  Over lanes: Weighted average of by-lane speeds by volumes  Over detector stations: Weighted average of speeds by volumes (?)

Aggregation over Space, Cont’d Vehicle Miles Traveled  Sum over lanes or loop detector stations Vehicle Hours Traveled  Sum over lanes or loop detector stations Travel Time  Over lanes: Weighted average of travel time by volumes  Over loop detector stations: Sum of segment-wide travel times Delay  Over lanes: Weighted average of travel time by volumes  Over loop detector stations: Sum of segment-wide travel times