XBART: A Novel Tree-Based Machine Learning Framework for Regression, Classification and Treatment Effect Estimation

Project: Research

View graph of relations

Description

In the era of big data, huge size and variety of available data becomes new norm in many fields of science. Thus they enable researchers to deploy more complicated machine learning algorithms. Tree ensemble methods are one of the most successful machine learning models, as reported in many literatures. Bayesian additive regression tree (BART) is the state-of-the-art tree ensemble algorithm in terms of prediction accuracy and is robust to hyper-parameters. It is well known that tree-based algorithms are prone to overfit the data. Thus regularization of trees is critical for better performance. Intuitively, we conjecture that BART's clever regularization might be behind its remarkable performance. Besides, BART comes with a natural measure of statistical uncertainty while other machine learning methods cannot. Despite these virtues, BART is underutilized by data scientists. One of the main obstacles to wider adoption is that BART fits the model by a random walk MetropolisHastings algorithm, which does not scale to data sets with many observations or predictors. On the other hand, alternative tree-based methods such as random forest or gradient boosting fit tree by a fast recursive partition algorithm.Inspired by the observations mentioned above, we ask a question: is it any chance to combine the BART regularization with recursive partition algorithm? The proposed research will develop a new algorithm named accelerate Bayesian additive regression tree (XBART), to combine regularization and stochastic search strategies from Bayesian modeling with computationally efficient techniques from recursive partitioning algorithm.Our plan includes developing algorithm for both regression and classification problems, and application in treatment effect estimation. Besides, we plan to explore the theoretical aspect of the algorithm, try to understand how does it work from the perspective of consistency theory, or Bayesian posterior concentrations.Furthermore, we believe that any machine learning algorithms cannot be useful unless an efficient implementation is available to the public. We will develop optimized, efficient computer software packages for R, Python and Julia, and make it open-source for broader usage.

Detail(s)

Project number9048224
Grant typeECS
StatusActive
Effective start/end date1/01/22 → …