Senior Researcher, Google DeepMind
Learning Safe Policies in Sequential Decision-Making
LocationHalls department, Hall 3
Date and TimeWednesday, 27 December 2017
14:30 - 15:30
AbstractIn many practical problems from online advertisement to health informatics and computational finance, it is often important to be able to guarantee that the policy/strategy generated by our algorithm performs at least as well as a baseline. This reduces the risk of deploying our policy and helps us to convince the product (hospital, investment) manager that it is not going to harm the business.
In this talk, we first discuss four different approaches to this fundamental problem that we have studied in the last four years, we call them model-based, model-free, online, and risk-sensitive. We then focus on model-based and online approaches that are related to the important problems of robust learning and control, and safe exploration. In the model-based approach, we first use the batch of data and build a simulator that mimics the behavior of the dynamical system under studies (online advertisement, hospital’s ER, financial market), and then use this simulator to generate data and learn a policy. The main challenge here is to have guarantees on the performance of the learned policy, given the error in the simulator. In the online approach, the goal is to control the exploration of the algorithm in a way that never during its execution the loss of using it instead of the baseline strategy is more than a given margin. We present algorithms based on these approaches and demonstrate their usefulness in real-world applications such as personalized ad recommendation, energy arbitrage, traffic signal control, and American option pricing.