The Heterogeneity and Endogeneity Issues in the Powered Two- and Three-Wheeled Vehicle (PTW) Crash-Related Injury Severity Analysis


Student thesis: Doctoral Thesis

View graph of relations



Awarding Institution
Award date8 Dec 2021


Powered Two- and Three-Wheeled Vehicles (PTWs) are motor-operated two- and three-wheeled vehicles, powered by either a combustion engine or rechargeable batteries. The main categories of PTWs covered in this thesis are motorcycles (which include mopeds) and electric bikes (e-bikes). The PTW fleet is growing rapidly in most parts of the world due to its affordability, personal mobility, flexible trips, time-saving in traffic congestion. The PTW users are the most vulnerable road users for the least protections on road. The increased PTW fleet has brought great traffic safety challenges around the world.

For a better understanding of road crash mechanisms, various models have been developed to identify the risk factors of road traffic crash injuries based on the crash history data. The unobserved heterogeneity issue resulting from the omitted variables can play a critical role in road traffic crash data analysis. If the unobserved factors are correlated with observed factors, parameter estimates are likely to be biased, resulting in incorrect inferences. Over the last decade, a wide variety of statistical approaches have been developed and employed to accommodate unobserved heterogeneity in the crash severity analysis. In addition to the unobserved heterogeneity issue, endogeneity is another important issue in traffic safety analysis. The existence of endogeneity violates the assumption of basic regression models that regressors and disturbances should be uncorrelated. Ignoring the endogeneity issue could result in erroneous conclusions and inferences. Despite the existence of endogeneity issue has been pointed out by a few studies, this issue has not received enough attention in the traffic safety analysis.

This thesis, comprising five studies, aims to address the unobserved heterogeneity and endogeneity issues in the PTW crash-related injury severity analysis and investigate the influential factors of the injury severity sustained by the PTW users involved in the crash.

In the first study, a latent class clustering approach integrated with random parameters binary logit model (LCRBL) was compared to traditional random parameters binary logit model (RBL) in terms of injury severity prediction and the ability of revealing the influential factors of motorcycle rider injury levels. The empirical analysis was conducted based on 23,881 motorcycle crashes in Hunan, China. The comparison in terms of data fitness confirms that applying the latent class clustering approach as a preliminary tool to segment the whole data into meaningful subsets before conducting rider injury severity analysis improves the model predictive accuracy. Comparing the general and cluster-based model results, several important findings are suggested: 1) clustering can help reveal new information, including important contributing factors in subgroups which might be ignored in the pooled model, the ones with different influential magnitudes in cluster models, and factors showing opposite effects in the clustering sample; 2) different contributing variables found in individual clusters and the whole data indicate that some factors are only influential under some specific conditions, such as no violation committed by riders, turning right prior to the crash, and single crashes, etc.; 3) clustering indeed has a great potential in reducing heterogeneity of crash data and explaining the heterogeneity source.

In the second study, a latent segmentation random parameters ordered logit model (LSROL) was compared to the latent class clustering ordered logit (LCROL) model based on the motorcycle crash data of Queensland, Australia, from the years 2012 through 2016. The latent class clustering- and latent segmentation-based models are employed to account for heterogeneity across different groups. Further, the random parameter variants of these modeling frameworks are employed to consider heterogeneity within the group. Both of these approaches have recently gained significant attention in road safety literature. However, the similarities and differences between these two methods are seldom explained and investigated. This study thus proposes to compare the performance of latent class clustering and latent segmentation based random parameter models in examining crash injury severity outcomes. To accommodate the ordinal nature of injury severity levels, these models have been estimated based on an ordered logit modeling framework. For examining crash injury severity outcomes, this is the first study to consider the random parameter variant of ordered modeling structure within a latent segmentation modeling scheme. The comparison exercise is also augmented by estimating aggregate level elasticity effects of exogenous variables. The comparison exercise clearly highlights the latent segmentation approach's superiority in examining injury severity compared to the latent class clustering-based modeling approach. Moreover, both frameworks' random parameter variants performed better than their fixed-parameter counterparts, which highlights the need to account for both across- and within-group heterogeneity.

In the third study, the endogeneity issue in the motorcycle crash severity analysis was examined and discussed. The driver fault status is recorded as one of the important factors directly influencing driver injury severity. In previous studies, its effects are evaluated without considering its potential endogeneity to injury severity. That is, it is possible that intrinsically unsafe riders tend to be at-fault and are the ones likely to be involved in severe crashes due to unobserved factors. However, this endogeneity issue and its influence on the model estimation are seldom investigated. To fill this research gap, a Hierarchical Bayesian Simultaneous model with a Recursive Structure (HBS-RS) was developed and estimated using the 5,296 motorcycle two/three-vehicle crashes with motor vehicles during the period 2011 to 2018 in Queensland. The model results confirmed 1) the endogeneity issue between fault status and injury severity through the significantly positive error-correlation coefficient (0.130); 2)the inflated parameter without considering endogeneity (0.220 vs 0.116); 3) the indirect effects of exogenous variables on injury severity (e.g. the parameter for intersection-cross underestimated by 16% in the non-recursive structure model); and 4) the heterogeneity in the analysis (e.g. the random parameters for age over 59 and traffic give-way/stop sign).

In the fourth study, a classification and regression trees approach (CART) was applied to identify high-risk scenarios where motorycycle riders are more likely to result in severe injuries. CART is one of the most commonly applied data mining techniques which can address the unobserved heterogeneity issue and endogeneity issue by avoiding the assumptions imposed on the statistical methods. CART is a white box model, which can display the results graphically in a way that is easy to interpret. CART is also able to capture non-addictive behaviors, allowing to highlight sophisticated relationships that are difficult to reveal otherwise. In addition, explanatory variable correlations and outliers are not problematic in the CART approach. The motorcycle high-risk scenarios are identified by using a comprehensive dataset of 4,587 police-reported crashes involving motorcycles during 2015–2017 in Hunan province, China. The findings are expected to shed more light on a deeper understanding of the mechanism of motorcycle crash injuries.

In the fifth study, a random parameters generalized ordered probit model with heterogeneity in means (RGOP-HM) was proposed to investigate the contributing factors of electrical bike (e-bike) crash severity. The proposed model can account for the ordinal nature of crash severity, accommodate heterogeneity, relax the assumption of fixed means of random parameters, and relax the fixed threshold assumption. In addition, the proposed model could account for parts of the source of the unobserved heterogeneity by revealing the factors influencing the mean of the random parameters. The empirical analyses were based on the 2,222 police-reported e-bike crashes in Hunan province, China from 2014 to 2016. The DIC values underscored the superiority of the proposed model over other comparative models, indicating the importance of relaxing the limitations of traditional ordinal probability methods. According to the results of the proposed model, we have several findings: 1) collision with heavy motor vehicles, rider over 59, and not-at-fault produced random effects on e-bike riders' injury severity; 2) lighting condition (dim light and darkness-unlighted) and rainy affects the means of the random parameters estimated for non-at-fault and collision with heavy motor vehicles, respectively; 3) the factors increasing the injury severity include horizontal curves, posted high speed limit, single-vehicle crashes, the age over 44 (45-59, above 59), and rural residence.

The findings of our work have both methodological and empirical contributions. The methods developed/applied in each study can add to the literature about methods applied to address the heterogeneity issue or endogeneity issue in crash-related injury severity analysis. As for the empirical contribution, all studies are focused on the analysis of PTW traffic crash-related injury severity, the findings provide evidence-based references for traffic engineers and policymakers to develop effective interventions aiming at improving PTW traffic safety.

    Research areas

  • PTW crash, Road safety, Injury severity analysis, Heterogeneity issue, Endogeneity issue