TY - JOUR
T1 - Earning and Learning with Varying Cost
AU - Zhong, Ying
AU - Hong, L. Jeff
AU - Liu, Guangwu
PY - 2021/8
Y1 - 2021/8
N2 - We study a dynamic pricing problem where the observed cost in each selling period varies from period to period, and the demand function is unknown and only depends on the price. The decision maker needs to select a price from a menu of K prices in each period to maximize the expected cumulative profit. Motivated by the classical upper confidence bound (UCB) algorithm for the multi-armed bandit problem, we propose a UCB-Like policy to select the price. When the cost is a continuous random variable, as the cost varies, the profit of the optimal price can be arbitrarily close to that of the second-best price, making it very difficult to make the correct decision. In this situation, we show that the expected cumulative regret of our policy grows in the order of (log T)2, where T is the number of selling periods. When the cost takes discrete values from a finite set and all prices are optimal for some costs, we show that the expected cumulative regret is upper bounded by a constant for any T. This result suggests that in this situation, the suboptimal price will only be selected in a finite number of periods, and the trade-off between earning and learning vanishes and learning is no longer necessary beyond a certain period.
AB - We study a dynamic pricing problem where the observed cost in each selling period varies from period to period, and the demand function is unknown and only depends on the price. The decision maker needs to select a price from a menu of K prices in each period to maximize the expected cumulative profit. Motivated by the classical upper confidence bound (UCB) algorithm for the multi-armed bandit problem, we propose a UCB-Like policy to select the price. When the cost is a continuous random variable, as the cost varies, the profit of the optimal price can be arbitrarily close to that of the second-best price, making it very difficult to make the correct decision. In this situation, we show that the expected cumulative regret of our policy grows in the order of (log T)2, where T is the number of selling periods. When the cost takes discrete values from a finite set and all prices are optimal for some costs, we show that the expected cumulative regret is upper bounded by a constant for any T. This result suggests that in this situation, the suboptimal price will only be selected in a finite number of periods, and the trade-off between earning and learning vanishes and learning is no longer necessary beyond a certain period.
KW - earning and learning
KW - multi-armed bandit
KW - regret analysis
KW - upper confidence bound
KW - varing cost
UR - http://www.scopus.com/inward/record.url?scp=85102261394&partnerID=8YFLogxK
UR - https://www.scopus.com/record/pubmetrics.uri?eid=2-s2.0-85102261394&origin=recordpage
U2 - 10.1111/poms.13380
DO - 10.1111/poms.13380
M3 - 21_Publication in refereed journal
VL - 30
SP - 2379
EP - 2394
JO - Production and Operations Management
JF - Production and Operations Management
SN - 1059-1478
IS - 8
ER -