Formalizing Model-Based Multi-Objective Reinforcement Learning With a Reward Occurrence Probability Vector

Formalizing Model-Based Multi-Objective Reinforcement Learning With a Reward Occurrence Probability Vector

Tomohiro Yamaguchi, Yuto Kawabuchi, Shota Takahashi, Yoshihiro Ichikawa, Keiki Takadama
DOI: 10.4018/978-1-7998-8686-0.ch012
OnDemand:
(Individual Chapters)
Available
$37.50
No Current Special Offers
TOTAL SAVINGS: $37.50

Abstract

The mission of this chapter is to formalize multi-objective reinforcement learning (MORL) problems where there are multiple conflicting objectives with unknown weights. The objective is to collect all Pareto optimal policies in order to adapt them for use in a learner's situation. However, it takes huge learning costs in previous methods, so this chapter proposes the novel model-based MORL method by reward occurrence probability (ROP) with unknown weights. There are three main features. First one is that average reward of a policy is defined by inner product of the ROP vector and a weight vector. Second feature is that it learns the ROP vector in each policy instead of Q-values. Third feature is that Pareto optimal deterministic policies directly form the vertices of a convex hull in the ROP vector space. Therefore, Pareto optimal policies are calculated independently with weights and just one time by Quickhull algorithm. This chapter reports the authors' current work under the stochastic learning environment with up to 12 states, three actions, and three and four reward rules.
Chapter Preview
Top

Introduction

This chapter describes solving multi-objective reinforcement learning problems where there are multiple conflicting objectives with unknown weights. Reinforcement learning (RL) is the popular algorithm for automatically solving sequential decision problems. It is commonly modeled as Markov decision processes (MDPs). There are many reinforcement learning methods, most of them are focused on single-objective settings where the goal of an agent is to decide a single solution by the optimality criterion. These reinforcement learning methods are classified according to the learning algorithms and the optimality criteria. The former, there are two kinds of learning algorithms whether directly estimating the parameters of MDP model or not, one is the model-based approach such as Dyna architecture (Sutton, 1990, 1991), real-time dynamic programming (RTDP) (Barto et al., 1995) and H-Learning (Tadepalli and Ok 1998) which takes a small time complexity but a large space complexity, and another one is the model-free approach (Yang et al., 2016) such as Q-learning which takes a large time complexity but a small space complexity.

The model-based approach starts with directly estimating the MDP model statistically. When s is a state, a is an action, and (s, a) is a state action pair which is called a rule, it calculates V(s) which is the value of each state or Q(s, a) which is the quality of each rule using the estimated MDP. The goal is to search the optimal solution that maximizes V(s) of each state. In contrast, the model-free approach which directly learning V(s) or Q(s, a) without estimating the MDP model. According to Plaat et al. (2020), in Deep Reinforcement Learning (DRL) research field, above model-based approaches are called classic (non-deep) model-based methods since model-based DRLs treat high-dimensional problems. An overview of the model-based RL including deep RL is described in the next section.

The latter, there are two kinds of optimality criteria whether using a discount factor or not, one is maximizing the sum of the discounted rewards, and another one is maximizing the average reward without any discount factor (Mahadevan 1996; Tadepalli and Ok 1998; Gao 2006; Yang et al., 2016). Most previous RL methods are model-free approaches with a discount factor since the model-based approach takes the large space complexity. Note that the role of a discount factor is to control the range of sum of the expected rewards, in other words, it considers the risk of achieving future rewards. The demerit is to take the large time complexity until every V(s) or Q(s, a) in which the discount sum of rewards is converged.

A multi-objective MDP (MOMDP) is an MDP in which the reward function describes a vector of n rewards (reward vector), one for each objective. To decide a single solution, a scalarization function with a vector of n weights (weight vector), one for each objective is commonly used. The simple scalarization function is linear scalarization such as weighted sum. This chapter mainly targets the weighted sum function for the scalarization function, however this method can be applied other scalarization function such as Tchebycheff norm method.

There are several multi-objective reinforcement learning methods (Roijers et al., 2013; Roijers et al., 2015; Liu et al., 2016; Pinder 2016), they are two main approaches which are scalar combination and pareto optimization (Herrmann 2015). In the former case, scalar combination is to find a single policy that optimizes a combination of the rewards. MOMDP and known weights are input to the learning algorithm, then it output a single solution. In the latter case, Pareto optimization is to find multiple policies that cover the Pareto front, which requires collective search for sampling the Pareto set (Natarajan and Tadepalli 2005). MOMDP is input to the learning algorithm, then it outputs a solution set. Note that there are two ways to select a single solution in the set, one is the scalarization with known weight, and another way is a user selection.

Key Terms in this Chapter

Reward Occurrence Probability (ROP) Vector Space: It is one of the rectangular coordinate system in which any ROP vector associated with a stationary policy is mapped to a point where coordinates are indicated by its ROP vector.

Reinforcement Learning: The popular learning algorithm for automatically solving sequential decision problems. It is commonly modeled as Markov decision processes (MDPs).

Deterministic Policy: It is defined by a function that selects an action for every possible state.

Reward Occurrence Probability (ROP): The expected occurrence probability per step for the reward.

Model-Based Approach: The reinforcement learning algorithm which starts with directly estimating the MDP model statistically, then calculates the value of each state as V(s) or the quality of each state action pair Q(s, a) using the estimated MDP to search the optimal solution that maximizes V(s) of each state.

Reward Occurrence Probability (ROP) Vector: A vector where n th value is the occurrence probability of n th reward rule of a policy.

Pareto Optimization: It is to find multiple policies that cover the Pareto front, which requires collective search for sampling the Pareto set.

Weight Vector: A trade-off among multi objective, and each element of the vector represents a weight of each objective.

Markov Decision Process (MDP): It is a discrete time and a discrete state space stochastic control process. It provides a mathematical framework for modeling decision making in situations where outcomes are partly random and partly under the control of a decision maker.

Average Reward: The expected received rewards per step when an agent performs state transitions routinely according to a policy. In this research, an average reward of a policy is defined by the inner product of a ROP vector of the policy and a weight vector.

Markov Chain: A stochastic model describing a sequence of possible states in which the probability of each state depends only on the previous state. It is an intension of Markov decision processes; the difference is the subtraction of actions and rewards.

Rule: It is a state action pair (s, a) in a MDP model.

Multi-Objective MDP (MOMDP): A MDP in which the reward function describes a vector of n rewards (reward vector), one for each objective, instead of a scalar.

Reward Rule: It is a state action pair (s, a) assigned a non-zero reward in a MDP model.

Simple Markov Property: A property that the next state s' depends on the executed rule(s, a), that is the current state s and the decision maker's action a , and is independent of all previous executed rules.

LC-Learning: One of the average reward model-based reinforcement learning methods. It collects all reward acquisition deterministic policies under the unichain condition.

Complete Chapter List

Search this Book:
Reset