Data Science Application in Intelligent Transportation systems: An integrative Approach for Border Delay Prediction and Traffic Accident Analysis

PhD Student: Lei Lin

Advisors: Qian Wang and Adel Sadek

With the great progress in information and communications technologies in the past few decades, intelligent transportation systems (ITS) have accumulated vast amounts of data regarding the movement iof people and goods from one location to another. Besides the traditional fixed sensors and GPS devices, new emerging data sources and approaches such as social media and crowdsourcing can be used to extract travel-related data, especially given the wide popularity of mobile devices such as smartphones and tablets, along with their associated apps. To take advantage of all these data and to address the associated challenges, big data techniques, and a new emerging field called data science, are currently receiving more and more attention. Data science employs techniques and theories from many fields such as statistics, machine learning, data mining, analytical models and computer programming to solve the data analysis task. It is therefore timely and important to explore how data science may be best employed for transportation data analysis. In this doctoral study, an integrative approach is proposed for data science applications in ITS. The proposed approach constitutes to an integration of multiple steps in the data analysis process, or integration of different models to build a more powerful one. The integrative approach is applied and tested on two case studies: border crossing delay prediction and traffic accident data analysis.

For the first case study, a two-step border crossing delay prediction model is proposed, consisting of a short-term traffic volume prediction model and a multi-server queueing model. As such, this can be seen as an integration of data-driven models and analytical models. For the first step, the short-term traffic volume prediction model, an integration of data "width" decreasing (i.e., data grouping) step and model development step is applied. For model development, a model combination step of a Seasonal Autoregressive Integrated Moving Average Model (SARIMA) and Support Vector Regression (SVR) is applied to realize better performance than when using each single model. In addition, the spinning network (SPN) forecasting paradigm is enhanced for border crossing traffic prediction through the utilization of a dynamic time warping (DTW) similarity metric. The DTW-SPN is shown to yield several advantages such as computational efficiency and accuracy as demonstrated by a promising Mean Absolute Percent Error (MAPE) compared to SARIMA and SVR.

This dissertation also proposes the introduction of a data diagnosis step before short-term traffic prediction. In order to develop a methodology for model selection guidance, the author calculated the statistical measures of nonlinearity and complexity for multiple datasets and correlated those to the performances of multiple models SARIMA, SVR and k nearest neighbor (k-NN). Based on this, useful insights are revealed pertaining to parameter setting and model selection based on the data diagnosis results.

For the second step, namely the queueing model development, heuristic solutions are presented for two types of queueing models M/E K /n and BMAP/PH/n. These models take the predicted traffic volume as input, and use it to calculate future waiting time. The analytical results are compared to the results from a VISSIM model simulation results, and shown to be comparable. . Finally, an android smartphone app, which utilizes the two-step border prediction model methodology described above, is developed to collect, share and predict waiting time at the three Niagara Frontier border crossings.

For the second case study involving traffic accident data analysis, first an integration of a data "depth" decreasing step and a model development step is once again applied. To do this, the modularity-optimizing community detection algorithm is used to cluster the dataset, and for each cluster, the association rule algorithm is applied to yield insight into traffic accident hotspots and incident clearance time. The results show that more meaningful association rules can be derived when the data is clustered compared to when using the whole dataset directly. Secondly, an integration of a data "width" decreasing step (variable selection) and model development step is applied for real-time traffic accident risk prediction. For this, a novel variable selection method based on the Frequent Pattern tree (FP tree) algorithm is proposed and tested, before applying Bayesian networks and the k-NN algorithms. The experiment shows the models based on variables selected by FP tree always performed better than those using variables selected by the random forecast method. Lastly, an integration of the data mining model, M5P tree, and the hazard-based duration model (HBDM) statistical method is applied to traffic accident duration prediction. The M5P-HBDM method is shown to be capable of identifying more meaningful factors that impact the traffic accident duration, and to have a better prediction performance, than either M5P or HBDM.

The two case studies considered in this dissertation serve to illustrate the advantages of an integrative data science approach to analyzing transportation data. With this approach, invaluable insight is gained that can help solve transportation problems and guide public policy.