Performance Guarantee of an Approximate Dynamic Programming Policy for Robotic Surveillance
Abstract: This paper is focused on the development and analysis of suboptimal decision algorithms for a collection of robots that assist a remotely located operator in perimeter surveillance. The operator is tasked with the classification of incursions across the perimeter. whenever there is an incursion into the perimeter, an unattended ground sensor (UGS) in the vicinity, signals an alert. A robot services the alert by visiting the alert location, collecting information, e.g., photo and video imagery, and transmitting it to the operator. The accuracy of operator's classification depends on the volume and freshness of information gathered and provided by the robots at locations where incursions occur. There are two competing objectives for a robot: it needs to spend adequate time at an alert location to collect evidence for aiding the operator in accurate classification but it also needs to service other alerts as soon as possible, so that the evidence collected is relevant. The decision problem is to determine the optimal amount of time a robot must spend servicing an alert. The incursions are stochastic and their statistics are assumed to be known. This problem can be posed as a Markov Decision Problem. However, even for two robots and five UGS locations, the number of states is of the order of billions rendering exact dynamic programming methods intractable. Approximate dynamic programming (ADP) via linear programming (LP) provides a way to approximate the value function and derive suboptimal strategies. The novel feature of this paper is the derivation of a
tractable lower bound via LP and the construction of a suboptimal policy whose performance improves upon the lower bound. An illustrative perimeter surveillance example corroborates the results derived in this paper. Note to Practitioners-In practice, one often encounters the curse of dimensionality in the application of dynamic programming to determine optimal policies for controlled Markov chains. This is true, in- particular, for dynamic scheduling problems involving multiple robots/servers and queues of tasks that arrive in a stochastic fashion. The computation of value function, critical to the determination of optimal policies, is nearly impractical. Hence, one must settle for suboptimal policies. Two natural questions arise: (1) How does one construct a suboptimal policy? (2) How “good� is the constructed suboptimal policy? A common strategy to tackle the first problem is to approximate the value function and construct a suboptimal policy that is greedy with respect to the approximate value function. Typically, an approximate value function is constructed via a choice of basis functions. The question of how to choose the basis functions systematically for any problem is a difficult one; usually, the structure of the problem at hand is exploited in the construction of basis functions. The same approach is taken here and the state space is partitioned based on the reward structure and the optimal cost-to-go or value function is approximated by a constant over each partition. The second question is related to the first question in the sense that one needs to construct bounds for the performance of a suboptimal policy. In this paper, we construct upper and lower bounds for the value function (optimal performance) and use the lower bound as an approximate value function. Furthermore, we also show that the resulting suboptimal policy comes with a performance guarantee, in that it improves on the lower bound, it was derived from. Literature is replete with techniques for computing upper bounds; however, there is little work on lower bounds, which are also required for bounding the suboptimality of the policy. One encounters prohibitively large number of constraints in the case of computing an upper bound and has to deal with disjunctive linear inequalities in the case of a lower bound. The problem structure is exploited here to circumvent these difficulties.