基于马尔科夫理论的不确定性规划和感知问题研究,Markov Theory based Planning and Sensing under Uncertainty

Size: px

Start display at page:

Download "基于马尔科夫理论的不确定性规划和感知问题研究,Markov Theory based Planning and Sensing under Uncertainty"

舍晁
7 years ago
Views:

3 University of Science and Technology of China A dissertation for doctor s degree Markov Theory based Planning and Sensing under Uncertainty Author : Aijun Bai Speciality : Computer Science Supervisor : Prof. Xiaoping Chen Finished Time : September, 2014

5 ,

7 MDP POMDP MDP POMDP MDP POMDP MDP POMDP MDP POMDP MAXQ MDP MAXQ-OP MDP POMDP DNG-MCTS D²NG-POMCP POMDP PFS MAXQ-OP MDP MAXQ-OP MAXQ MDP MDP MAXQ-OP MAXQ-OP RoboCup 2D MAXQ-OP 2D RoboCup 2D MAXQ-OP MCTS MCTS MDP POMDP MCTS DNG-MCTS D²NG-POMCP I

8 Thompson MDP POMDP DNG-MCTS D²NG-POMCP UCT POMCP POMDP PFS EM PFS PFS PETS2009 CLEAR MOT CoBot PFS II

9 ABSTRACT ABSTRACT In the research of Artificial Intelligence, agent-based paradigm aims to provide a unifying framework for conceptualizing, designing, and implementing intelligent systems, that sense, act and learn autonomously in dynamic and/or stochastic environments, to solve a growing number of complex problems. Agents, particularly various kinds of robots, are playing more and more important roles in world economics and people s everyday life, from satellites to smartphones. Generally speaking, perceptional inputs from sensors have inevitable noises and errors. The effects of actuators have also unpredictable impact with noises, or even failures. There may also exist different levels of hidden information that can not be observed directly. Such uncertainties have brought huge challenges to the problem of agent planning and sensing. Markov decision processes (MDPs) and partially observable Markov decision processes (POMDPs) provide important basis in terms of theory and algorithm to optimal planning and sensing under uncertainty. However, solving large MDPs and POMDPs exactly is usually intractable due to the curse of dimensionality the state space grows exponentially with the number of state variables. To address this challenge in practice, researchers are usually utilizing approximation techniques such as online planning, hierarchical planning, Monte-Carlo simulation, particle filtering, etc. Following the theories of MDP and POMDP, this thesis is focusing on developing efficient approximate algorithms for large MDPs and POMDPs. Specifically, we propose a MAXQ hierarchical decomposition based online planning algorithm MAXQ-OP; we develop DNG- MCTS and D²NG-POMCP algorithms which apply the idea of Thompson sampling to Monte-Carlo planning in MDPs and POMDPs; and, we develop a particle filtering over sets (PFS) approach to multi-human tracking problem. The proposed hierarchical online planning algorithm, namely MAXQ-OP, is a novel algorithm that combines the advantages of both online planing and hierarchical planning. It provides a more sophisticated solution for programming autonomous agents in large stochastic domains. Specifically, we perform online decision-making by following MAXQ value function decomposition. We empirically evaluate our algorithm on the Taxi problem a common benchmark for MDPs. The experimental results show that MAXQ-OP is able to find near optimal policy online, with extremely less computation time comparing to traditional online planning algorithms. The RoboCup soccer simulation 2D domain is a very large test-bed for the research of artificial intelligence. The key challenge lies in the fact that it is a fully distributed, multi-agent stochastic system with continuous state, action and observation spaces. We have conducted a III

10 ABSTRACT long-term case study in RoboCup 2D domain and developed a team named WrightEagle that have won multiple world champions and national champions in RoboCup annual competitions. The results of our case study confirm MAXQ-OP s important potential of scaling up to very large domains. Monte-Carlo tree search (MCTS) has been drawing great interest recently in domains of planning and learning under uncertainty. One of the key challenges is the trade-off between exploration and exploitation. We develop novel approaches, namely DNG-MCTS and D²NG-POMCP, to MCTS by using posterior action sampling to select actions for online planning in MDPs and POMDPs. Specifically, we treat the cumulative reward obtained by taking an action in a search node and following a tree policy thereafter over the Monte-Carlo search tree as a random variable following an unknown distribution. We parametrize the distribution by introducing necessary hidden parameters, and infer the posterior in Bayesian settings. Thompson sampling is then used to exploit and explore the search tree, by selecting an action based on its posterior probability of being optimal. Experimental results confirm that the proposed algorithms outperform the state-of-the-art approaches with better values on several benchmark problems, showing the potential of successfully applying to very large real-world problems. The ability for an autonomous robot to detect, track and identify potentially multiple humans is essential for socialized human-robot interactions in dynamic environments. Online multi-object tracking problem is equivalent to real-time belief update of a complex POMDP. The main challenge is that, without the knowledge of actual number of humans, the robot needs to estimate each human s state information in realtime from sequentially ambiguous observations, including inevitable false and missing detections, while both the robot and humans are constantly moving. In this thesis, we propose a novel particle filtering over sets (PFS) approach to address this challenge. We define joint states and observations both as finite sets, and develop motion and observation functions accordingly. The target identification problem is then solved by using the expectation-maximization (EM) method, given updated particles. The set formulation enables us to avoid directly performing observation-to-target association, leading to high fault-tolerance and robustness in complex dynamic environments with frequent noises and errors in terms of detections. The overall PFS algorithm outperforms the state-of-the-art, in terms of CLEAR MOT metrics, in PETS2009 dataset. We also demonstrate the effectiveness of PFS on a real robot, namely CoBot. Keywords: Markov Decision Process, Partially Observable Markov Decision Process, Decision-Theoretic Planning, Hierarchical Online Planning, Monte-Carlo Tree Search, Multi-Object Tracking IV

11 I ABSTRACT III V IX XI XIII XV MAXQ MAXQ MAXQ MAXQ-OP MAXQ-OP 41 V

12 RoboCup 2D RoboCup 2D RoboCup 2D MDP MAXQ-OP MDP Thompson DNG-MCTS POMDP Thompson D²NG-POMCP MDP POMDP VI

13 VII

15 WrightEagle WrightEagle RoboCup 2D CTP etaxi[5] RockSample D²NG-POMCP T a = 0.1 T fm = PFS PETS2009 S2L IX

17 Wikipedia Wikipedia DBN MAXQ POMDP MDP MAXQ RoboCup D WrightEagle Helios MAXQ RoboCup D MAB Thompson Simple Regret etaxi RockSample D²NG-POMCP PocMan D²NG-POMCP PocMan D²NG-POMCP PFS PETS2009 S2L CoBot CORAL CoBot I CoBot II PFS CoBot XI

19 2.1 MDP MDP MDP AO* MDP RTDP MDP UCT POMDP POMDP RTDP-Bel POMDP POMCP MAXQ-OP OnlinePlanning MAXQ-OP EvaluateState(i, s, d) MAXQ-OP EvaluateCompletion(i, s, a, d) MAXQ-OP NextAction(i, s) Dirichlet-NormalGamma DNG-MCTS Thompson Dirichlet-Dirichlet-NormalGamma POMCP D²NG-POMCP Thompson Human Identification XIII

21 MDP POMDP DEC-POMDP BRL HRL MAB MOT MAXQ MAXQ MAXQ-OP MAXQ MCTS UCB UCT POMCP DNG-MCTS D²NG-POMCP PFS S A T R O Ω B P H Pr E Var Cov N U XV

23 Artificial Intelligence Intelligence WrightEagle KeJia Agent Markov Property 1.1 / [1] Intelligent Robots Web Crawlers Siri Mars Rover Information State Sufficient Statistics State Belief State 1.1 [2] Observation State Estimation / Action 1

24 / Hidden Information [3] Belief Update Information Fusion Sensor Fusion [4] [5] Odometry Bayesian Method 2

25 Data Association [6] Multi-Object Tracking MOT Simultaneous Localization and Mapping, SLAM Landmark Decision Making Sequential Decision-Making Problem Planning [1] Scheduling [7 9] [10] Control Policy Utility Function Classical Planning Shortest Path Problem Action Cost Depth-First Search Breadth-First Search Backward Chaining Heuristic Search 3

26 (a) (b) 1.2 Wikipedia [11] Reinforcement Learning [12] Model Based Reinforcement Learning Bayesian Reinforcement Learning BRL [13 18] Exploitation Exploration [19 22] Markov Decision Process, MDP [23] Partially Observable Markov Decision Process, POMDP [24] Markov Decision Theory 4

27 1.2 Stochastic Process [25] S X S {X t t T} T = {0, 1, 2,... } Time Index Differential Equation X 0 Markov Process Markov Chain Andrey Markov 1906 [26] Pr(X t+1 X 0, X 1,..., X t ) = Pr(X t+1 X t ). (1.1) 1.2(a) A 40% E 60% A E 70% A 30% E Partially Observable Hidden Markov Model, HMM [27, 28] (b) X y a b Belief Space MDP MDP MDP POMDP POMDP MDP 1.1 MDP MDP s a 5

28 1.1 Markov Chain HMM MDP POMDP 1.3 Wikipedia s Immediate Reward R(s, a, s ) Pr(s s, a) MDP s a s Planning Horizon Cumulative Reward MDP Optimization Problem [29] MDP Dynamic Programming Monte-Carlo [30] MDP P [31] MDP 1.3 MDP S a +5 1 POMDP MDP 6

29 POMDP System Dynamics MDP MDP POMDP MDP MDP Belief MDP MDP MDP MDP MDP POMDP POMDP POMDP PSPACE POMDP POMDP POMDP 100% POMDP / MDP POMDP Multi-Agent Systems [32, 33] Joint Action MDP POMDP 7

30 1.4 Nested Belief A B B A B A B A B POMDP I-POMDP [34] I-POMDP Decentralized MDP, DEC-MDP Decentralized POMDP, DEC-POMDP [35] DEC-POMDP Policy Space Decision Tree [36] a o DEC-POMDP 8

31 DEC-POMDP NEXP DEC-POMDP 1.3 MDP POMDP MAXQ [37] MDP MAXQ-OP [38 41] Thompson [42] MDP POMDP DNG-MCTS D²NG-POMCP [43, 44] PFS Particle Filtering over Sets PFS [45] RoboCup 2D [46] * MDP POMDP MDP MAXQ-OP MAXQ-OP MAXQ MAXQ-OP Monte- Carlo Tree Search, MCTS DNG-MCTS D²NG-POMCP Dirichlet-NormalGamma MDP DNG- MCTS MDP UCT Thompson DNG-MCTS POMDP * RoboCup 2D Wiki 9

32 Dirichlet-Dirichlet-NormalGamma POMDP D²NG-POMCP POMDP POMCP POMDP Thompson PFS Expectation-Maximization, EM Observation Likelihood PETS2009 [47] CoBot PFS 10

33 MDP POMDP MDP POMDP QMDP MDP 2.1 MDP MDP ( ). S, A, T, R, S State Space A Action Space T : S A S [0, 1] Transition Function T(s s, a) = Pr(s s, a) s a s R : S A R Reward Function R(s, a) s a MDP S A S A Flat Representation Factored Representation [48] s 1, s 2, s 3,... Tabular Method s a s T(s s, a) S A S s = [x 1, x 2,..., x n ] n 11

34 2.1 DBN x i (1 i n) [x, y, ẋ, ẏ] (x, y) (ẋ, ẏ) s a s T(s s, a) Dynamic Bayesian Network, DBN [49] DBN DBN s s u i s x i T(s s, a) = Pr(x i u i, a). (2.1) 1 i n 2.1 DBN (x, y) (v_x, v_y) π π : S A [0, 1] π(s, a) s a π π s π : S A {0, 1} π : S A π(s) s 12

35 π s 0 Follow π H π s 0 U(π s 0 ) H [ ] U(π s 0 ) = E R(s t, π(s t )), (2.2) 0 t<h R(s t, π(s t )) t γ (0, 1] U(π s 0 ) [ ] U(π s 0 ) = E γ t R(s t, π(s t )). (2.3) t γ γ = 1 γ 1 R max /(1 γ) R max MDP Optimal Policy [50, 51] π π = argmax U(π s 0 ). (2.4) π π s π s t π t t H t s H t π = {π H, π H 1,..., π 1 } π = {π H, π H 1,..., π 1 } t s π Value Function V π t (s) V π 1 (s) = R(s, π 1(s)) 13

36 t s a π π Q π t (s, a) Q π t (s, a) = R(s, a) + γ T(s s, π t (s))vt 1(s π ). (2.5) s S s a π R(s, a) t 1 T(s s, π t (s)) s Vt π (s) = Q π t (s, π t (s)). (2.6) V π = { VH π, V H 1 π,..., } Vπ 1 VH π π s 0 U(π s 0 ) = VH(s π 0 ). (2.7) π s V π (s) s π s a π Q π (s, a) Q π (s, a) = R(s, a) + γ T(s s, π t (s))v π (s ). (2.8) s S V π (s) V π (s) = Q π (s, π(s)). (2.9) 2.9 V π (s) V π π s 0 U(π s 0 ) = V π (s 0 ). (2.10) V Q π Bellman Optimality Equation [52] V Q : 14 V t (s) = max a A Q t(s, a), (2.11)

37 V t (s) = max a A { V (s) = max a A Q (s, a). (2.12) R(s, a) + γ s S T(s s, a)v t 1(s ) }, (2.13) V (s) = max a A { R(s, a) + γ s S T(s s, a)v (s ) }. (2.14) V V π π t(s) = argmax Vt (s), (2.15) a A π (s) = argmax V (s). (2.16) a A MDP MDP Near-Optimal Policy π s V (s) V π (s) ϵ π ϵ- V π 1 Vπ 2 V π 3 V π H H V H V MDP Multi- Armed Bandit MAB MDP MDP s R(s, a) := X a X a f Xa (x) t a X at MAB MAB Cumulative Regret CR [ T ] R T = E (X a X at ), (2.17) t=1 15

38 a Simple Regret SR r n = E [X a Xā], (2.18) ā = argmax a A X a MDP c(s, a) R(s, a) min max MDP MDP max min MDP MDP 2.2 MDP Stochastic Optimal Control Dynamic Programming The Curse of Dimensionality MDP Offline Online Policy Iteration Value Iteration Policy Evaluation Policy Improvement π V π v 2.9 v v V π = v π V π 16

39 Input: An MDP S, A, T, R, and a small positive number ϵ Output: A near-optimal policy π 1 Initialize π(s) A arbitrarily for all s S 2 repeat 3 repeat foreach s S do 6 V (s) R(s, π(s)) + λ s S T(s s, a)v(s ) 7 max {, V(s) V (s) } 8 V(s) V (s) 9 end 10 until < ϵ 11 converged T rue 12 foreach s S do 13 π (s) argmax a A { R(s, a) + λ s S T(s s, a)v(s ) } 14 if π (s) π(s) then 15 converged False 16 end 17 π(s) π (s) 18 end 19 until converged = T rue 20 return π 2.1: MDP π Greedy Algorithm π (s) = argmax a A Q π (s, a) Q π π V π (s) V π (s) s π π 0 PE V π 0 PI π 1 PE V π 1 PI π 2 PE... PI π PE V, (2.19) PE PI MDP S A [53] 2.1 MDP ϵ

40 Input: An MDP S, A, T, R, and a small positive number ϵ Output: A near-optimal policy π 1 Let V(s) 0 for all s S 2 repeat foreach s S do 5 foreach a A do 6 Q(s, a) R(s, a) + λ s S T(s s, a)v(s ) 7 end 8 π(s) argmax a A Q(s, a) 9 max {, V(s) Q(s, π(s)) } 10 V(s) Q(s, π(s)) 11 end 12 until < ϵ 13 return π 2.2: MDP Backup Operation { } V t+1 (s) = max a A R(s, a) + λ s S T(s s, a)v t (s ), (2.20) V t t 2.20 Asynchronous Dynamic Programming V t V t MDP 2.2 2ϵ γ 1 γ - [24] Sweep

41 (a) (b) MAXQ 2.2 MAXQ AND/OR Tree Search [54] Real-Time Dynamic Programming, RTDP [55] Monte-Carlo Tree Search [56] AND/OR Tree [54] 2.2(a) s 0 s 0 a 1 a 2 Pr(s 1 s 0, a 1 ) = p Pr(s 2 s 0, a 1 ) = 1 p Pr(s 3 s 0, a 2 ) = q Pr(s 4 s 0, a 2 ) = 1 q AO* Best-First Search G 1. G G 2. G G G AO* 2.3 AO* [57] RTDP Trial-based Search MDP 19

42 Input: An MDP S, A, T, R, graph G initially empty, heuristic function h, current state s 0, and planing horizon H Output: An action a 1 Let G G (s 0, H) 2 Let V(s 0, H) h(s 0, H) 3 Initialize best partial graph to G 4 while True do 5 Let (s, d) non-terminal tip node in best partial graph 6 if (s, d) is null then 7 break 8 end 9 foreach a A do 10 Add node (a, s, d) as child of s, d 11 foreach s S do 12 if T(s s, a) > 0 then 13 Add node s, d 1 as child of (a, s, d) 14 if d 1 = 0 then 15 V(s, d 1) 0 16 end 17 else 18 V(s, d 1) h(s, d 1) 19 end 20 end 21 end 22 end 23 foreach s G in a bottom-up way do 24 Q(s, a, d) R(s, a) + γ s S T(s s, a)v(s, d 1) 25 V(s, d) max a A Q(s, a, d) 26 if V(s, d) = Q(s, a, d) then 27 Mark state s and action a 28 end 29 end 30 Recompute best partial graph to G by following marked actions 31 end 32 return marked action for state s 0 in best partial graph to G 2.3: MDP AO* RTDP AO* RTDP RTDP RTDP 2.4 MDP RTDP [58, 59] 20

43 Input: An MDP S, A, T, R, heuristic function h, current state s 0, and planing horizon H Output: An action a 1 foreach s S do 2 Initialize V(s) h(s) 3 end 4 repeat 5 Let s s 0 6 Let d 0 7 while True do 8 d d foreach a A do 10 Q(s, a) R(s, a) + γ s S T(s s, a)v(s ) 11 end 12 Let a argmax a A Q(s, a) 13 Update V(s) Q(s, a ) 14 Sample s T(s s, a ) 15 if s is goal state or d > H then 16 break 17 end 18 Let s s 19 end 20 until resource budgets reached 21 return a argmax a A Q(s 0, a) 2.4: MDP RTDP MCTS [60] MCTS MDP MDP Generative Model / Simulator MCTS MCTS Rollout 2.3 MCTS [61] MCTS MCTS 2.4 [62] MCTS Anytime MCTS MCTS MCTS 21

44 UCT MCTS [56] UCT MAB UCB [63] UCB(s, a) = Q(s, a) + c log N(s) N(s, a), (2.21) Q(s, a) a s N(s, a) a s N(s) = a A N(s, a) s c c UCT MDP UCT [57] 22

45 Input: An MDP simulator sim, current state s 0, search graph G initially empty, rollout policy π, planning horizon H, exploration constant C Output: An action a 1 UCT(s : state, d : depth, sim : simulator, G : graph, π : policy, H : horizon, C : constant) 2 if d H or s is terminal then 3 return 0 4 end 5 if node (s, d) / G then 6 Add node (s, d) to graph G 7 Initialize N(s, d) 0 and N(s, a, d) 0 for all a A 8 Initialize Q(s, a, d) 0 for all a A 9 Play rollout policy π from s for H d steps according to simulator sim 10 Let r the sampled cumulative discounted reward 11 return r 12 end 13 else 14 foreach a A do 15 if N(s, a, d) > 0 then 16 Let Bonus(a) C log N(s, d)/n(s, a, d) 17 end 18 else 19 Let Bonus(a) 20 end 21 end 22 Select a argmax a A {Q(s, a, d) + Bonus(a)} 23 Sample s T(s s, a) according to simulator sim 24 Let nv R(s, a) + γuct(s, d + 1, sim, G, π, H, C) 25 Increment N(s, d) and N(s, a, d) 26 Update Q(s, a, d) Q(s, a, d) + (nv Q(s, a, d))/n(s, a, d) 27 return nv 28 end 29 repeat 30 UCT(s 0, 0, sim, G, π, H, C) 31 until resource budgets reached 32 return a argmax a A Q(s 0, a, 0) 2.5: MDP UCT 2.3 MDP [64] Option [65] Hierarchies of Abstract Machines [66] MAXQ MAXQ Hierarchical Decomposition [37] Option Macro Action Option 23

46 Option o I o S π o β o : S [0, 1] β o (s) s Option o Option MDP MAXQ MDP MDP MDP MDP Semi-Markov Decision Process, SMDP SMDP MAXQ SMDP MDP MDP MDP SMDP τ N + s a SMDP T(s, τ s, a) s a τ s R(s, a) s a SMDP { V (s) = max a A R(s, a) + s S,τ N + γ τ T(s, τ s, a)v (s ) }. (2.22) SMDP SMDP MAXQ MAXQ MDP M MDP {M 0, M 1,, M n } MDP M 0 M 0 MDP M M i T i, A i, R i T i M i Active States S i Terminal States G i 24

47 A i M i R i M i s S i g G i MAXQ Task Graph 2.2(b) MAXQ Root Task M 0 M 1 M 2 M 3 M 1 M 2 M 3 M i 4 i 8 MAXQ Hierarchical Policy π π = {π 0, π 1,, π n } π i : S i A i M i Projected Value Function V π (i, s) s π = {π 0, π 1,, π n } M i g G i Q π (i, s, a) M a π = {π 0, π 1,, π n } M i M a V π (a, s) = R(s, a) π Q π (i, s, a) = V π (a, s) + C π (i, s, a), (2.23) V π (i, s) = { R(s, i), Mi Q π (i, s, π(s)), (2.24) C π (i, s, a) Completion Function M i M a M a M i π C π (i, s, a) = γ N Pr(s, N s, a)v π (i, s ), (2.25) s,n Pr(s, N s, a) s M a N s Recursively Optimal Policy π Q (i, s, a) = V (a, s) + C (i, s, a), (2.26) V (i, s) = { R(s, i), Mi max a Ai Q (i, s, a), (2.27) 25

48 C (i, s, a) = C π (i, s, a) π M i π i(s) = argmax Q (i, s, a). (2.28) a A i 2.4 MDP POMDP MDP [67] ( ). S, A, O, T, Ω, R, S, A, T, R O Observation Space Ω : S A O [0, 1] Observation Function Ω(o s, a) a s o POMDP MDP b b(s) s b(s) = Pr(s b) b 0 h = (a 0, o 1, a 1, o 2,... a t 1, o t ) b (s ) = ηω(o s, a) s S T(s s, a)b(s), (2.29) η = 1/P(o b, a) η = 1 s S Ω(o s, a) s S T(s s, a)b(s). (2.30) 2.29 b a o b = ζ(b, a, o) ζ Bayesian Filter B POMDP π π : B A MDP POMDP π POMDP MDP MDP Bayesian-Adaptive 26

49 2.5 POMDP MDP, BAMDP B, A, T +, r B BAMDP A r(b, a) = s S b(s)r(s, a) T + T + (b b, a) = o O 1[b = ζ(b, a, o)]ω(o b, a), (2.31) 1 Indicator Function BAMDP { } V (b) = max a A r(b, a) + γ o O Ω(o b, a)v (ζ(b, a, o)) V π { π (b) = argmax a A r(b, a) + γ o O Ω(o b, a)v (ζ(b, a, o)) }. (2.32). (2.33) BAMDP POMDP BAMDP BAMDP MDP BAMDP 2.5 MDP POMDP POMDP 27

50 POMDP Piecewise-Linear and Convex [68] t V t S Γ t = {α 0, α 1,..., α m } α- a A b α- V t (b) = max α Γ t s S α(s)b(s). (2.34) POMDP [24] 3 POMDP MDP POMDP V t 1 V t α- α- Γ t = O( A Γ t 1 O ) t POMDP O( A Z S 2 Γ t 1 Z ) Γ t One-Pass [69] Witness [70] [71] QMDP QMDP POMDP MDP QMDP QMDP MDP POMDP ˆQ(b, a) = Q MDP (s, a)b(s), (2.35) s S Q MDP (s, a) MDP ˆV(b) = max a A ˆQ(b, a). (2.36) MDP POMDP QMDP 28

51 2.6 MDP [72] α- B t α- Γ t α a (s) = R(s, a), { Γt a,o = α a,o i αa,o i (s) = γ } T(s s, a)ω(o s, a)α i(s ), α i Γ t 1, s S { Γt b = α a b αa b = α a b + } argmax α(s)b(s), a A, (2.37) o O α Γt a,o s S { } Γ t = α b α b = argmax b(s)α(s), b B. α Γt b s S Γ 0 α- α 0 (s) = 1 1 γ min s S,a A R(s, a) Γ t 1 B O( A O S B ( S + B )) PBVI [73] Perseus [74] 29

52 Input: Current belief b 0, AND-OR search tree T, planning horizon H, lower bound on L, upper bound U 1 Let b b 0 2 Initialize T to contain only b at the root 3 while not ExectionTerminated() do 4 while not PlanningTermiated() do 5 Let b ChooseNextNodeToExpand() 6 Expand(b, H) 7 UpdateAncestors(b ) 8 end 9 Execute best action a for b 10 Perceive a new observation o 11 Update b ζ(b, a, o) 12 Update tree T so that b is the new root 13 end 2.6: POMDP HSVI [75] SARSOP [76] POMDP MDP POMDP α MDP POMDP 2.6 POMDP [2] s a o 2.29 s s Expectation Maximization 2.6 POMDP [77] 30

53 Input: Current belief b 0, initial approximate value function V 0, hashtable of beliefs and approximate values V, discretization resolution k 1 Initialize b to the b 0 and V to an empty hashtable 2 while not ExectionTerminated() do 3 foreach a A do 4 Evaluate Q(b, a) r(b, a) + γ o O Pr(o b, a)v(discretize(ζ(b, a, o), k)) 5 end 6 Select a argmax a A Q(b, a) 7 Execute best action a for b 8 V(Discretize(b, k)) Q(b, a ) 9 Perceive a new observation o 10 Let b ζ(b, a, o) 11 end 2.7: POMDP RTDP-Bel Satia-Lave [78] BI-POMDP [79] AEMS [80] RTDP-Bel MDP POMDP [58] RTDP-Bel RTDP-Bel ˆQ(b, a) = r(b, a) + γ o O Ω(o b, a)v(ζ(b, a, o)), (2.38) V(b) b RTDP-Bel 2.7 RTDP-Bel [77] Discretize(b, k) b b b (s) = round(kb(s))/k O((k + 1) S ) 31

54 Input: An MDP simulator sim, current history h 0, search tree T initially empty, rollout policy π, termination condition ϵ, exploration constant C Output: An action a 1 Rollout(s : state, h : history, d : depth) 2 if γ d < ϵ then 3 return 0 4 end 5 Select a π(h, ) 6 Sample (s, o, r) sim(s, a) 7 return r + γrollout(s, hao, d + 1) 8 Simulate(s : state, h : history, d : depth) 9 if γ d < ϵ then 10 return 0 11 end 12 if h / T then 13 foreach a A do 14 T(ha) (N init (ha), V init (hs), ) 15 end 16 return Rollout(s, h, d) 17 end 18 Select a argmax a A {V(ha) + C } log N(h)/N(ha) 19 Sample (s, o, r) sim(s, a ) 20 R r + γsimulate(s, ha o, d + 1) 21 B(h) B(h) {s} 22 Increment N(h) and N(ha ) 23 Update V(ha ) V(ha ) + R V(ha ) N(ha ) 24 return R 25 repeat 26 Sample s B(h) 27 Simulate(s, h, 0) 28 until resource budgets reached 29 return a argmax a A V(ha) 2.8: POMDP POMCP MDP POMDP POMCP UCT POMDP [81] POMCP h POMCP b(h) Root Sampling MCTS MDP 32

55 POMDP POMCP POMCP UCB UCB(h, a) = Q(h, a) + c log N(h) N(h, a), (2.39) Q(h, a) h a N(h, a) h a N(h) = a A N(h, a) h c POMCP [82, 83] c POMCP 1 POMCP [84 86] 2.8 POMCP [81] MDP MDP MDP 2. MDP Option MAXQ 3. POMDP MDP POMDP MDP POMDP POMDP POMDP 33

56 4. UCB UCB MAB MAXQ MAXQ-OP DNG-MCTS D²NG-POMCP PFS 34

57 MAXQ MAXQ MDP MAXQ MDP MDP MAXQ-OP MAXQ MAXQ-OP MAXQ MAXQ-OP MDP MAXQ-OP MAXQ-OP 2D RoboCup 4 2 RoboCup 2D MAXQ-OP 3.1 MAXQ-OP MAXQ-OP RoboCup 2D [23] MDP [31] RoboCup 2D

58 MAXQ RoboCup 2D RoboCup 2D 6000 [55] LAO* [87] UCT [56] DNG-MCTS [43] RoboCup 2D RoboCup 2D 100 MDP [64] MAXQ MDP [37] MAXQ Temporal Abstraction State Abstraction Subtask Sharing MAXQ MDP RoboCup 2D RoboCup 2D MAXQ MAXQ-OP MAXQ-OP MDP MAXQ-OP RoboCup 2D 36

59 MAXQ moving-to-target MAXQ-OP MAXQ-OP MAXQ MAXQ MAXQ MDP S, A, T, R G S g G a A Pr(g g, a) = 1 R(g, a) = 0 MDP MDP Undiscounted Negative-Reward Goal-directed MDP [88] MDP Stochastic Shortest Path [89] MAXQ [39, 41] Terminating Distribution MDP MAXQ-OP MAXQ-OP WrightEagle RoboCup RoboCup 2D MAXQ-OP 3.2 MDP RTDP [55, 90 92] 37

60 MAXQ AO* [57, 87, 93] MDP MCTS [43, 56, 94 97] Trial-based Heuristic Tree Search, THTS THTS MAXQ-OP Hierarchical Reinforcement Learning, HRL MDP [64] State Abstraction [98 104] Sutton Option HRL SMDP [65] Option Dietterich MAXQ [37] MAXQ HRL MDP SMDP [105, 106] MDP Hauskrecht MDP Abstract MDP MDP Variable Influence Structure Analysis, VISA MDP DBN Causal Graph [107] Barry DetH* MDP DetH* MDP 3.3 MAXQ MAXQ M = {M 0, M 1,..., M n } MAXQ- 38

61 MAXQ OP V (i, s) Q (i, s, a) MAXQ M 0 a A 0 a p A a p MAXQ-OP 2.27 MAXQ-OP MAXQ-OP M a M i 2.25 π C (i, s, a) = γ N Pr(s, N s, a)v (i, s ), (3.1) s,n Pr(s, N s, a) = s,s 1,...,s N 1 Pr(s 1 s, π a(s)) Pr(s 2 s 1, π a(s 1 ))... Pr(s s N 1, π a(s N 1 )) Pr(N s, a). (3.2) s, s 1,..., s N 1 π a π s s N π s, a s C (i, s, a) π MAXQ 1 γ = γ N 1 C (i, s, a) = Pr(s s, a)v (i, s ), (3.3) s Pr(s s, a) = N Pr(s, N s, a) M i Pr(s s, a) 39

62 MAXQ Input: an MDP model with its MAXQ hierarchical structure Output: the accumulated reward r after reaching a goal 1 r 0 2 s GetInitState() 3 while s G 0 do 4 v, a p EvaluateState(0, s, [0, 0,..., 0]) 5 r r+ ExecuteAction(a p, s) 6 s GetNextState() 7 end 8 return r 3.1: MAXQ-OP OnlinePlanning 2.27 { V (i, s) max a A i V (a, s) + s Pr(s s, a)v (i, s ) }. (3.4) MAXQ-OP d D d[i] D[i] M i M i H 3.4 H(i, s) if d[i] D[i] V(i, s, d) max a Ai {V(a, s, d)+ s Pr(s s, a)v(i, s, d[i] d[i] + 1)} (3.5) 3.5 MAXQ-OP MAXQ V(0, s, [0, 0,..., 0]) s M 0 MAXQ-OP s a Pr(s s, a) G s,a = {s s Pr(s s, a)} C (i, s, a) C (i, s, a) 1 G s,a s G s,a V (i, s ). (3.6) 3.5 H(i, s) if d[i] D[i] V(i, s, d) max a Ai {V(a, s, d)+ 1 s G V(i, s,a G s,a s, d[i] d[i] + 1)} 40 (3.7)

63 MAXQ Input: subtask M i, state s and depth array d Output: V (i, s), a primitive action a p 1 if M i is primitive then return R(s, M i ), M i 2 else if s S i and s G i then return, nil 3 else if s G i then return 0, nil 4 else if d[i] D[i] then return HeuristicValue(i, s), nil 5 else 6 v, a p, nil 7 for M k Subtasks(M i ) do 8 if M k is primitive or s G k then 9 v, a p EvaluateState(k, s, d) 10 v v+ EvaluateCompletion(i, s, k, d) 11 if v > v then 12 v, a p v, a p ; 13 end 14 end 15 end 16 return v, a p 17 end 3.2: MAXQ-OP EvaluateState(i, s, d) MAXQ-OP 3.1 MAXQ-OP OnlinePlanning s GetInitState GetNextState ExecuteAction g G 0 MAXQ-OP EvaluateState EvaluateState s s [54, 87] 3.2 MAXQ-OP MAXQ-OP M i s d[i] D[i] 41

64 MAXQ Input: subtask M i, state s, action M a and depth array d Output: estimated C (i, s, a) 1 G s,a {s s Pr(s s, a)} 2 v 0 3 for s G s,a do 4 d d 5 d [i] d [i] v v+ EvaluateState(i, s, d ) 7 end 8 v v G s,a 9 return v 3.3: MAXQ-OP EvaluateCompletion(i, s, a, d) d[i] D[i] M i nil nil EvaluateState MAXQ MAXQ-OP NextAction Subtasks NextAction A* 42

65 MAXQ Input: subtask index i and state s Output: selected action a 1 if SearchStopped(i, s) then 2 return nil 3 end 4 else 5 a argmax a Ai H i [s, a] + c 6 N i [s] N i [s] N i [s, a ] N i [s, a ] return a 9 end ln Ni [s] N i [s,a] 3.4: MAXQ-OP NextAction(i, s) (a) (b) MAXQ 3.1 MAXQ 3.4 UCB [63] NextAction N i [s] N i [s, a] s (s, a) M i, c SearchStopped H i [s, a] M i s a 3.4 MDP [37] 3.1(a) R G Y B 4 x y pl dl pl taxi dl 4 pl dl pl dl

66 MAXQ 3.1 Root pl = dl Get Put 2 Get pl taxi pl = taxi Nav(t) Pickup 2 Put pl = taxi pl = dl Nav(t) Putdown 2 Nav(t) (x, y) = t North, South, East West MAXQ-OP ± ± 0.16 ms LRTDP ± ± 3.71 ms AOT ± ± 2.37 ms UCT ± ± 4.24 ms DNG-MCTS ± ± 4.75 ms R-MAXQ ± ± 50 - MAXQ-Q ± ± [106] North South East West 2. Pickup 3. Putdown Pickup Putdown -10 MAXQ-OP [37] MAXQ 3.1(b) Nav(t) t t R G Y B 3.1 EvaluateCompletion Root Get Put Nav(t) 1 North South East West T(s s, a) HeuristicValue Manhattan Get 44

67 MAXQ Manhattan((x, y), pl) 1 Manhattan((x 1, y 1 ), (x 2, y 2 )) (x 1, y 1 ) (x 2, y 2 ) Manhattan x 1 x 2 + y 1 y 2 s M i 0 d[i] = 0 v, a p cache[i, hash(i, s)] v, a p, cache hash(i, s) s M i s 0.9 MAXQ-OP 3.2 LRTDP [90] AOT [57] UCT [56] DNG-MCTS [43] Anytime LRTDP AOT min-min [90] UCT DNG-MCTS min-min Rollout UCT DNG-MCTS MDP R-MAXQ MAXQ-Q [106] Linux 3.8 CPU 2.90 GHz 8GB MAXQ-OP 3.93 ± ± 0.15 MAXQ-OP MAXQ-OP 3.5 RoboCup 2D RoboCup 2D [108, 109] * RoboCup RoboCup 2D RoboCup 2D Simulator * Peter Stone 45

MAXQ 3.2 RoboCup 2013 2D WrightEagle Helios [110] RoboCup 2013 2D WrightEagle Helios 3.

68 MAXQ 3.2 RoboCup D WrightEagle Helios [110] RoboCup D WrightEagle Helios 3.2 [111] RoboCup 2D Tile-Coding Sarsa(λ) Keepaway [112] [113] MAXQ-OP RoboCup MAXQ-OP 2D 2009 RoboCup 4 2 MAXQ-OP RoboCup 2D RoboCup 2D RoboCup 2D Server

69 MAXQ RoboCup 2D RoboCup 2D MDP RoboCup 2D RoboCup 2D MDP RoboCup 2D s = (s 0, s 2,..., s 22 ) 23 s u, u [1, 11] {s 1,..., s 11 } s u {s 12,..., s 22 } s 0 s = (x, y, ẋ, ẏ, α, β) (x, y) (ẋ, ẏ) α β s = (x, y, ẋ, ẏ) RoboCup 2D dash kick tackle turn turn_neck dash kick tackle kick turn turn_neck dash dash power [0, 1] angle [0, 2π) power = p angle = θ (ẍ, ÿ) = (pa cos θ, pa sin θ) + r a A = 1.0m/s 2 r a (x, y) (x, y)+(ẋ, ẏ)+(ẍ, ÿ) (ẋ, ẏ) (ẋ, ẏ)ω + (ẍ, ÿ)ω ω = 0.4 MDP 47

70 MAXQ RoboCup 2D MAXQ-OP dash turn RoboCup MDP WrightEagle [24] b b(s) s b(s) b(s) = b i (s[i]), (3.8) 0 i 22 s s[i] i b i (s[i]) s[i] m i b i b i (s[i]) {x ij, w ij } j=1...mi, (3.9) x ij i w ij 1 j m i w ij = 1 RoboCup 2D Motion Model Sensor Model [114, 115] s s[i] = w ij x ij. (3.10) 1 j m i 3.3 neck_dir RoboCup 2D turn_neck 3.3 WrightEagle 48

71 MAXQ 3.3 x (m) y (m) ẋ (m/s) ẏ (m/s) α (Deg) β (Deg) e MAXQ-OP MAXQ-OP RoboCup MAXQ kick turn dash tackle -1 KickTo TackleTo NavTo KickTo TackleTo NavTo KickTo kick turn NavTo dash turn turn / kick KickTo TackleTo NavTo 49

72 MAXQ 3.4 MAXQ Shoot Dribble Pass Position Intercept Block Trap Mark Formation 1. Shoot 2. Dribble 3. Pass 4. Position 5. Intercept 6. Block 7. Trap 8. Mark 9. Formation Shoot Dribble Pass Shoot Dribble Pass Intercept Position Attack Defense Attack Defense Attack Defense Root Root Attack Attack Defense MAXQ 3.4 Attack Pass Intercept KickTo kick s Q (Root, s, Attack) = V (Attack, s)+ s Pr(s s, Attack)V (Root, s ), (3.11) 50 V (Root, s) = max{q (Root, s, Attack), Q (Root, s, Defense)}, (3.12)

73 MAXQ V (Attack, s) = max{q (Attack, s, Pass), Q (Attack, s, Dribble), Q (Attack, s, Shoot), Q (Attack, s, Intercept), Q (Attack, s, Position)}, (3.13) Q (Attack, s, Pass) = V (Pass, s)+ s Pr(s s, Pass)V (Attack, s ), (3.14) Q (Attack, s, Intercept) = V (Intercept, s)+ s Pr(s s, Intercept)V (Attack, s ), (3.15) V (Pass, s) = max position p Q (Pass, s, KickTo(p)), (3.16) V (Intercept, s) = max position p Q (Intercept, s, NavTo(p)), (3.17) Q (Pass, s, KickTo(p)) = V (KickTo(p), s)+ s Pr(s s, KickTo(p))V (Pass, s ), (3.18) Q (Intercept, s, NavTo(p)) = V (NavTo(p), s)+ s Pr(s s, NavTo(p))V (Intercept, s ), (3.19) V (KickTo(p), s) = V (NavTo(p), s) = max power a, angle θ Q (KickTo(p), s, kick(a, θ)), (3.20) max power a, angle θ Q (NavTo(p), s, dash(a, θ)), (3.21) Q (KickTo(p), s, kick(a, θ)) = R(s, kick(a, θ))+ s Pr(s s, kick(a, θ))v (KickTo(p), s ), (3.22) Q (NavTo(p), s, dash(a, θ)) = R(s, dash(a, θ))+ s Pr(s s, dash(a, θ))v (NavTo(p), s ). (3.23) R(s, kick(a, θ)) = 1 Pr(s s, kick(a, θ)) KickTo(p) p 3.20 p Pass 51

74 MAXQ Attack Attack Intercept Position NavTo(p) p R(s, dash(a, θ)) = 1 Pr(s s, dash(a, θ)) 3.21 p 3.17 Attack Root Root Defense Defense p b = (b x, b y, bẋ, bẏ) p = (p x, p y, pẋ, pẏ, p α, p β ) p Pr(p b b, p) p Pr(p b b, p) = max {Pr(p b, t b, p)} Pr(p b, t b, p) p t Pr(p b, t b, p) = g(t f(p, b t )) b t t f(p, b t ) (p x, p y ) t g(δ) b t δ 3.5 g(δ) Pr(s s, Attack) = 1 (1 Pr(o b b, o)), (3.24) Pr(s s, Defense) = 1 opponent o teammate t (1 Pr(t b b, t)), (3.25) Pr(s s, Intercept) = 1[ player i : i b] Pr(i b b, i) (1 Pr(p b b, p)), (3.26) player p i Pr(s s, Position) = 1[ non-teammate i : i b] Pr(i b b, i) (1 Pr(p b b, p)), (3.27) player p i b = s[0] MAXQ kick dash MAXQ-OP Pass KickTo NavTo A* dash 52

75 MAXQ 1 Intercepting Probability 0.8 Probability Cycle Difference 3.5 turn MAXQ-OP Attack Impelling Speed V (Attack, s t ) s t t s s impelling_speed(s, s, α) = dist(s, s, α) + pre_dist(s, α), (3.28) step(s, s ) + pre_step(s ) α Aim Angle dist(s, s, α) α s s step(s, s ) pre_dist(s ) s α pre_step(s ) s α aim_angle(s) V (Attack, s) V (Attack, s t ) = impelling_speed(s 0, s t, aim_angle(s 0 )), (3.29) s 0 impelling_speed(s 0, s t, aim_angle(s 0 )) s t MAXQ-OP Full: MAXQ-OP Random: Full Attack Pass Dribble 53

76 MAXQ 3.6 RoboCup D Hand-coded: Random Pass Dribble 3 Full Random Hand-coded Attack Pass Dribble Full MAXQ-OP EvaluateState(Pass,, ) EvaluateState(Dribble,, ) Random Hand-coded Pass-Dribble, Shoot Pass Dribble Intercept Full RoboCup 2D Trainer 100 Helios2011 RoboCup 2011 Episode RoboCup WrightEagle success 2. x failure timeout 3.6 RoboCup 2011 # RoboCup #

77 MAXQ 3.4 WrightEagle Success Failure Timeout Full Random Hand-coded WrightEagle BrainsStomers : : ± 7.5% Helios : : ± 5.0% Helios : : ± 8.8% Oxsy : : ± 5.6% 3.4 Full Random Hand-coded 86.7% and 64.7% Random Hand-coded Pass-Dribble Full Pass Dribble Attack MAXQ-OP Pass Dribble Attack Defense Shoot Pass MAXQ-OP WrightEagle Full RoboCup 2D 4 4 BrainsStomers08 Helios10 Helios11 Oxsy11 BrainStormers08 Helios10 RoboCup 2008 RoboCup WrightEagle 3.5 p = n/n n N WrightEagle 82.0% 93.0% 83.0% 91.0% BrainsStomers08 Helios10 Helios11 Oxsy RoboCup 2D WrightEagle MAXQ MAXQ-OP RoboCup 2D 55

78 MAXQ 3.6 RoboCup 2D RoboCup : : 0.84 RoboCup : : 0.43 RoboCup : : 0.64 RoboCup : : 1.13 RoboCup : : 1.21 RoboCup : : 0.54 RoboCup : : 0.25 RoboCup : : 0.86 RoboCup : : MAXQ-OP 2. RoboCup 2D 3.6 MAXQ-OP MAXQ-OP MAXQ-OP MAXQ MAXQ-OP MAXQ-OP MDP RoboCup 2D MAXQ-OP MDP MAXQ-OP 56

79 MCTS MCTS MCTS Thompson MDP POMDP Dirichlet-NormalGamma Dirichlet-NormalGamma based Monte-Carlo Tree Search DNG-MCTS Dirichlet-Dirichlet-NormalGamma Dirichlet-Dirichlet-NormalGamma based Partially Observable Monte-Carlo Planning D²NG-POMCP MDP MDP [23] POMDP MDP [24] MDP POMDP MCTS [60] MCTS MCTS MCTS / [13, 14] UCB [63, 116] UCB 57

80 MAB [117] MAB UCB UCB UCB Auer UCB MAB [63] Thompson MAB Thompson Randomized Probability Matching [42] UCB Thompson [118] MAB Cumulative Regret Simple Regret [119] [120] Thompson UCB [121] Thompson MAB [ ] [97] [119] Bubeck MCTS [126] Thompson Thompson Thompson MDP POMDP Thompson MCTS Thompson MDP POMDP [43, 44] MDP POMDP Dirichlet-NormalGamma Dirichlet-NormalGamma based Monte-Carlo Tree Search DNG-MCTS Dirichlet-Dirichlet-NormalGamma Dirichlet-Dirichlet-NormalGamma based Partially Observable Monte-Carlo Planning D²NG-POMCP DNG-MCTS MCTS MDP 58

81 Dirichlet NormalGamma Thompson DNG-MCTS MDP POMDP POMDP POMDP D²NG-POMCP Dirichlet NormalGamma Thompson 4.2 DNG-MCTS [127] 4.3 [128] MCTS UCT UCT MAX/MIN UCB MAX/MIN MAB Thompson POMDP [77] 3 Branch-and-Bound Pruning [ ] [75, 76, 79, 80, 133, 134] [81, 83, ] 59

82 4.3 MDP MDP DNG-MCTS MDP [138, 139] ( ). X = {x 0, x 1,... } X w f X µ = E w [f] = X w(x)f(x) dx σ = Var w (f(x 0 )) + 2 i=1 Cov w (f(x 0 ), f(x i )) N(0, σ 2 ) x 0 n ( ) 1 n n f(x t ) µ N(0, σ 2 ). (4.1) n t= n 1 n n t=0 f(s t) N(µ, σ 2 /n) n 1 n n t=0 f(s t) n t=0 f(s t) n MDP π X s,π s π X s,a,π s a π X s,π X s,a,π π MDP S {s t } Pr(s s) = T (s s, π(s)) {s t } MDP H γ = 1 X s0,π = H t=0 R(s t, π(s t )) f(s t ) = R(s t, π(s t )) H X s0,π s 0 S γ 1 H γ 1 X s0,π π X s,π DNG-MCTS X s,π 60

83 s a π X s,a,π = R(s, a) + γx s,π, (4.2) s T(s s, a) Y s,a,π Y s,a,π = 1 γ (X s,a,π R(s, a)). (4.3) Y s,a,π s X s,π f Ys,a,π (x) = s S T(s s, a)f Xs,π (x). (4.4) s X s,π Y s,a,π X s,a,π Y s,a,π X s,a,π ( ). X θ L(x θ) θ Pr(θ) X Z = {x 1, x 2,... } θ Pr(θ Z) = η Pr(Z θ) Pr(θ) = η i L(x i θ) Pr(θ), (4.5) η = 1/ Pr(Z) N(µ s, 1/τ s ) X s,π µ s τ s τ = 1/σ 2 NormalGamma [140] (NormalGamma ). NormalGamma Hyper Parameters µ 0, λ, α, β λ > 0 α 1 β 0 Γ( ) Gamma (µ, τ) NormalGamma NormalGamma(µ 0, λ, α, β) (µ, τ) f(µ, τ µ 0, λ, α, β) = βα λ Γ(α) 2π τα 1 2 e βτ e λτ(µ µ 0 )2 2. (4.6) τ Gamma τ Gamma(α, β) τ µ µ N (µ 0, 1/(λτ)) 61

84 4.3.3 (NormalGamma ). X µ τ X N(µ, 1/τ) (µ, τ) NormalGamma (µ, τ) NormalGamma(µ 0, λ 0, α 0, β 0 ) n X {x 1, x 2,..., x n } x = n i=1 x i/n s = n i=1 (x i x) 2 /n (µ, τ) NormalGamma (µ, τ) NormalGamma(µ n, λ n, α n, β n ) µ n = λ 0µ 0 + n x λ 0 + n, (4.7) λ n = λ 0 + n, (4.8) α n = α 0 + n 2, (4.9) β n = β (ns + λ ) 0n( x µ 0 ) 2. (4.10) 2 λ 0 + n Y s,a,π Y s,a,π s S w s,a,s N(µ s, 1 τ s ), (4.11) w s,a,s = T(s s, a) w s,a,s 0 s S w s,a,s = 1 w s,a,s Dirichlet Dirichlet [140] s a Dirichlet Dirichlet(ρ s,a ) ρ s,a = (ρ s,a,s1, ρ s,a,s2,... ) Dirichlet (s, a) s ρ s,a,s 1 T(s s, a) (s, a) s T(s s, a) Dirichlet ρ s,a,s ρ s,a,s + 1. (4.12) X s,π X s,a,π MCTS s a µ s,0, λ s, α s, β s ρ s,a Thompson (Thompson ). Thompson a [ ] Pr(a) = 1 a = argmax E [X a θ a ] P a (θ a Z) dθ, (4.13) a a θ a a θ = (θ a1, θ a2,... ) E[X a θ a ] = xl a (x θ a ) dx θ a a 62

85 1 OnlinePlanning(s : state, T : tree) 2 Initialize H maximal planning horizon 3 repeat 4 DNG-MCTS(s, T, 0) 5 until resource budgets reached 6 return ThompsonSampling(s, 0, False) 7 DNG-MCTS(s : state, T : tree, d : depth) 8 if d H or s is terminal then 9 return 0 10 end 11 else if node s, d is not in tree T then 12 Initialize (µ s,0, λ s, α s, β s ), and ρ s,a for a A 13 Add node s, d to T 14 Play rollout policy by simulation for H d steps 15 Get the cumulative reward r 16 return r 17 end 18 else 19 a ThompsonSampling(s, d, T rue) 20 Execute a by simulation 21 Observe next state s and reward R(s, a) 22 r R(s, a) + γdng-mcts(s, T, d + 1) 23 α s α s β s β s + (λ s (r µ s,0 ) 2 /(λ s + 1))/2 25 µ s,0 (λ s µ s,0 + r)/(λ s + 1) 26 λ s λ s ρ s,a,s ρ s,a,s return r 29 end 4.1: Dirichlet-NormalGamma Thompson a A P a (θ a Z) θ a a = argmax E[X a θ a ]. (4.14) a DNG-MCTS Thompson s NormalGamma(µ s,0, λ s, α s, β s ) Dirichlet(ρ s,a ) s µ s w s,a,s Q(s, a) Q(s, a) = R(s, a) + γ s S w s,a,s µ s. (4.15) 63

86 1 ThompsonSampling(s : state, d : depth, sampling : boolean) 2 foreach a A do 3 q a QValue(s, a, d, sampling) 4 end 5 return argmax a q a 6 QValue(s : state, a : action, d : depth, sampling : boolean) 7 r 0 8 foreach s S do 9 if sampling = True then 10 Sample w s Dirichlet(ρ s,a ) 11 end 12 else 13 w s ρ s,a,s / s S ρ s,a,s 14 end 15 r r + w s Value(s, d + 1, sampling) 16 end 17 r R(s, a) + γr 18 return r 19 Value(s : state, d : depth, sampling : boolean) 20 if d H or s is terminal then 21 return 0 22 end 23 else 24 if sampling = T rue then 25 Sample (µ, τ) NormalGamma(µ s,0, λ s, α s, β s ) 26 return µ 27 end 28 else 29 return µ s,0 30 end 31 end 4.2: DNG-MCTS Thompson DNG-MCTS DNG-MCTS ThompsonSampling sampling sampling Thompson Q(s, a) Q(s, a) = R(s, a) + γ s S ρ s,a,s s S ρ s,a,s µ s,0. (4.16) DNG-MCTS Thompson T 64

87 Rollout OnlinePlanning s T OnlinePlanning DNG-MCTS Rollout Z n = POMDP POMDP D²NG- POMCP POMDP π POMDP s, b s b J = S B S B J { s t, b t } P ( s, b s, b ) = T (s s, π(b)) T + (b b, π(b)). (4.17) X b,a b a X s,b,π s, b π X b,π b π X b,a X s,b,π X b,π POMDP I I = {r 1, r 2,..., r k } r i = R(s, a) s a X b,a Multinomial Distribution Multinomial(p 1, p 2,..., p k ) k i=1 p i = 1 p i = s S 1[R(s, a) = r i]b(s) X b,a = r i [141]. POMDP b 0-65

88 s 0, b 0 H POMDP γ = 1 X s0,b 0,π = H t=0 R(s t, π(b t )) f(s t, b t ) = R(s t, π(b t )) H s 0, b 0 J X s0,b 0,π γ 1 1 H X s0,b 0,π b π X b,π = X s,b,π s b(s) X b,π X s,b,π f Xb,π (x) = s S b(s)f Xs,b,π (x). (4.18) s, b J X s,b,π X b,π π X b,π X b,π X b,a X b,a Multinomial(p 1, p 2,..., p k ) Dirichlet b a p i Dirichlet Dirichlet(ψ b,a ) ψ b,a = (ψ b,a,r1, ψ b,a,r2,..., ψ b,a,rk ) r Dirichlet ψ b,a,r ψ b,a,r + 1. (4.19) X s,b,π N(µ s,b, 1/τ s,b ) µ s,b τ s,b NormalGamma (µ s,b, τ s,b ) NormalGamma (µ s,b, τ s,b ) NormalGamma(µ s,b,0, λ s,b, α s,b, β s,b ) µ s,b,0 λ s,b α s,b β s,b X b,π b(s) and X s,b,π s b a π X b,a,π 66 X b,a,π = X b,a + γx b,π, (4.20)

89 b T + (b b, a) X b,a,π E[X b,a,π ] = E[X b,a ] + γ b B E[X b,π]t + (b b, a) = E[X b,a ] + γ o O 1[b = ζ(b, a, o)]ω(o b, a)e[x b,π]. (4.21) E[X b,a,π ] Q π (b, a) = r(b, a) + γ o O Ω(o b, a)v π (ζ(b, a, o)). (4.22) X b,a X b,π Ω( b, a) Ω( b, a) Dirichlet Dirichlet(ρ b,a ) ρ b,a = (ρ b,a,o1, ρ b,a,o2,... ) (b, a) o Ω( b, a) ρ b,a,o ρ b,a,o + 1. (4.23) X b,a,π MCTS b s a µ s,b,0, λ s,b, α s,b, β s,b ψ b,a ρ b,a Thompson D²NG-POMCP Thompson s X b,a,π Dirichlet(ρ b,a ) o O w b,a,o Dirichlet(ψ b,a ) r I w b,a,r NormalGamma(µ s,b,0, λ s,b, α s,b, β s,b ) s, b J µ s,b b = ζ(b, a, o) b a o Q(b, a) Q(b, a) = r I w b,a,r r + γ o O D²NG-POMCP 1[b = ζ(b, a, o)]w b,a,o µ s,b b (s ). (4.24) D²NG-POMCP h s S 67

90 1 OnlinePlanning(h : history, T : tree) 2 repeat 3 Sample s P(h) 4 D²NG-POMCP(s, h, T, 0) 5 until resource budgets reached 6 return ThompsonSampling(h, 0, False) 7 Agent(b 0 : initial belief) 8 Initialize H maximal planning horizon 9 Initialize I {possible immediate rewards} 10 Initialize h 11 Initialize P(h) b 0 12 repeat 13 a OnlinePlanning(h, ) 14 Execute a and get observation o 15 h hao 16 P(h) ParticleFilter(P(h), a, o) 17 until terminating conditions 18 D²NG-POMCP(s : state, h : history, T : tree, d : depth) 19 if d H or s is terminal then 20 return 0 21 end 22 else if node h is not in tree T then 23 Initialize (µ s,h,0, λ s,h, α s,h, β s,h ) for s S, and ρ h,a and ψ h,a for a A 24 Add node h to T 25 Play rollout policy for H d steps 26 Get cumulative reward r 27 return r 28 end 29 else 30 a ThompsonSampling(h, d, T rue) 31 Execute a by simulation 32 Get state s, observation o and reward i 33 h hao 34 P(h ) P(h ) s 35 r i + γd²ng-pomcp(s, h, T, d + 1) 36 α s,h α s,h β s,h β s,h + (λ s,h (r µ s,h,0 ) 2 /(λ s,h + 1))/2 38 µ s,h,0 (λ s,h µ s,h,0 + r)/(λ s,h + 1) 39 λ s,h λ s,h ρ h,a,o ρ h,a,o ψ h,a,i ψ h,a,i return r 43 end 4.3: Dirichlet-Dirichlet-NormalGamma POMCP 68

91 1 ThompsonSampling(h : history, d : depth, sampling : boolean) 2 foreach a A do 3 q a QValue(h, a, d, sampling) 4 end 5 return argmax a q a 6 QValue(h : history, a : action, d : depth, sampling : boolean) 7 r 0 8 foreach o O do 9 if sampling = True then 10 Sample w o Dirichlet(ρ h,a ) 11 end 12 else 13 w o ρ h,a,o / o O ρ h,a,o 14 end 15 h hao 16 r r + w o Value(h, d + 1, sampling) 17 end 18 r γr 19 foreach i I do 20 if sampling = T rue then 21 Sample w i Dirichlet(ψ h,a ) 22 end 23 else 24 w i ψ h,a,i / i I ψ h,a,i 25 end 26 r r + w i i 27 end 28 return r 29 Value(h : history, d : depth, sampling : boolean) 30 if d H then 31 return 0 32 end 33 else 34 if sampling = T rue then 35 Sample (µ s, τ s ) NormalGamma(µ s,h,0, λ s,h, α s,h, β s,h ) 36 for s P(h) 1 37 return P(h) s P(h) µ s 38 end 39 else 1 40 return P(h) s P(h) µ s,h,0 41 end 42 end 4.4: D²NG-POMCP Thompson 69

92 MCTS h s a µ s,h,0, λ s,h, α s,h, β s,h ψ h,a ρ h,a P(h) [142, 143] 4.24 Q(h, a) = r I w h,a,r r + γ o O w h,a,o s P(hao) µ s,hao, (4.25) w h,a,r w h,a,o µ s,hao Dirichlet(ψ h,a ) Dirichlet(ρ h,a ) NormalGamma(µ s,hao,0, λ s,hao, α s,hao, β s,hao) Q(h, a) Q(h, a) = r I ψ h,a,r r I ψ h,a,r r + γ o O ρ h,a,o o O ρ h,a,o s P(hao) µ s,hao,0. (4.26) D²NG-POMCP T Rollout OnlinePlanning h T P(h) D²NG-POMCP Agent OnlinePlanning ParticleFilter Uninformative Priors [144] Principle of Indifference NormalGamma τ µ N(µ 0, 1/(λτ)) 1/(λτ) λτ 0 τ E[τ] = α/β Gamma Gamma(α, β) λα/β 0 70

93 λ > 0, α 1, β 0 λ α = 1 β µ 0 = 0 β Dirichlet Informative Priors DNG-MCTS DNG-MCTS NormalGamma λ µ 0 2α α/β µ τ NormalGamma(µ 0, λ, α, β) [140] MAB Thompson [124] Thompson MAB 1 DNG-MCTS X s,π π Q Q X s,π H Thompson 1 MAB Rollout H 1 MAB Thompson H 1 DNG-MCTS Rollout D²NG-POMCP DNG-MCTS D²NG- POMCP H Rollout 4.6 Thompson DNG-MCTS D²NG-POMCP Linux GHz 8G 71

94 Simple Regret e-05 RoundRobin Randomized 0.5-Greedy UCB1 ThompsonSampling Number of Action Pulls Simple Regret e-05 RoundRobin Randomized 0.5-Greedy UCB1 ThompsonSampling Number of Action Pulls (a) 8 arms. (b) 32 arms Simple Regret e-05 RoundRobin Randomized 0.5-Greedy UCB1 ThompsonSampling Number of Action Pulls Simple Regret e-05 RoundRobin Randomized 0.5-Greedy UCB1 ThompsonSampling Number of Action Pulls (c) 128 arms. (d) 512 arms. 4.1 MAB Thompson Simple Regret MAB Thompson RoundRobin, Randomized, 0.5-Greedy UCB RoundRobin [97] Randomized 0.5-Greedy [126] UCB UCB Bernoulli UCB 2 Thompson Beta (α = 1, β = 1) MAB Thompson Thompson Thompson MCTS 72

95 4.6.2 MDP DNG-MCTS UCT Canadian Traveler Problem, CTP Racetrack Problem Sailing Problem c(s, a) R(s, a) min max MDP min max MDP MDP DNG-MCTS Thompson min MDP-engine * MDP s S, a A s S (µ s,0, λ s, α s, β s ) (0, 0.01, 1, 100) ρ s,a,s 0.01 [57] UCT Q(s, a, d) CTP [145] CTP POMDP MDP n 3 m n m γ = 1 Anytime AO* AOT [57] UCTB UCTO [146] UCTB UCTO UCT CTP Rollout CTP DNG-MCTS UCT [57] Rollout 4.1 [57] UCTB UCTO MDP Rollout DNG-MCTS UCT Rollout UCT * MDP-engine 73

96 CTP UCT Rollout Rollout UCTB UCTO UCT DNG UCT DNG ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ±3 total Avg. Accumulated Cost UCT DNG-MCTS Avg. Accumulated Cost UCT DNG-MCTS Number of Iterations Number of Iterations (a) Barto-Big (b) DNG-MCTS UCTO UCT [55] DNG-MCTS UCT Rollout H = 100 Barto-Big s = γ = (a) DNG-MCTS UCT [56] 74

97 Avg. Returns Avg. Time Usage (ms) RTDP AOT -90 UCT DNG-MCTS Grid Width RTDP AOT UCT DNG-MCTS Grid Width (a) (b) 4.3 etaxi γ = 0.95 H = DNG-MCTS UCT Rollout (b) DNG-MCTS UCT Taxi Taxi n etaxi[n] n n R G Y B (0, 0) (0, n 1) (n 2, 0) (n 1, n 1) n 1 (0, 0) (1, 0) (1, n 1) (2, n 1) 2 (n 3, 0) (n 2, 0) Taxi n = 5 etaxi[n] Taxi DNG-MCTS LRTDP [90] AOT [57] UCT LRTDP AOT MDP Min-Min [90] LRTDP AOT Min-Min UCT DNG-MCTS Rollout etaxi 4.3(a) 4.3(b) 1000 etaxi[5] 4.2 DNG-MCTS UCT LRTDP AOT DNG-MCTS DNG-MCTS 75

98 4.2 etaxi[5] (ms) LRTDP ± ± 3.71 AOT ± ± 2.37 UCT ± ± 4.24 DNG-MCTS ± ± POMDP RockSample Problem PocMan Problem D²NG-POMCP POMCP D²NG-POMCP POMCP * POMCP s S a A r I o O h (µ s,h,0, λ s,h, α s,h, β s,h ) (0, 0.01, 1, 100) ψ h,a,r 0.01 ρ h,a,o 0.01 D²NG-POMCP POMCP [81] Rollout POMCP RockSample(n, k) k n n γ = OnlinePlanning D²NG-POMCP OnlinePlanning D²NG-POMCP POMCP POMCP 4.3 D²NG-POMCP AEMS2 [80] HSVI-BFS [75, 77] SARSOP [76] POMCP [81] AEMS2 HSVI-BFS * POMCP 76

99 Avg. Discounted Return POMCP -5 D 2 NG-POMCP e+06 Number of Iterations (a) RockSample(7,8) Avg. Discounted Return POMCP -5 D 2 NG-POMCP 1e Avg. Time Per Action (Seconds) (b) RockSample(7,8) Avg. Discounted Return POMCP -5 D 2 NG-POMCP e+06 Number of Iterations (c) RockSample(11,11) Avg. Discounted Return POMCP -5 D 2 NG-POMCP 1e Avg. Time Per Action (Seconds) (d) RockSample(11,11) Avg. Discounted Return POMCP -5 D 2 NG-POMCP Number of Iterations (e) RockSample(15,15) Avg. Discounted Return POMCP -5 D 2 NG-POMCP Avg. Time Per Action (Seconds) (f) RockSample(15,15) 4.4 RockSample D²NG-POMCP SARSOP POMDP AEMS2 HSVI- BFS PBVI [73] [77] SARSOP 1000 [76] POMCP D²NG-POMCP Rollout POMCP [81] D²NG-POMCP RockSample(7, 8) RockSample(11, 11) RockSample(15, 15) POMCP 77

100 4.3 RockSample D²NG-POMCP RockSample [7, 8] [11,11] [15,15] s 12, ,808 7,372,800 AEMS ± 0.22 N/A N/A HSVI-BFS ± 0.22 N/A N/A SARSOP ± ± 0.11 N/A POMCP ± ± ± 0.28 D²NG-POMCP ± ± ± 0.24 Avg. Discounted Return POMCP D 2 NG-POMCP Number of Iterations (a) Pocman Avg. Discounted Return POMCP D 2 NG-POMCP 1e Avg. Time Per Action (Seconds) (b) Pocman 4.5 PocMan D²NG-POMCP Avg. Undiscounted Return POMCP -50 D 2 NG-POMCP Number of Iterations (a) Pocman Avg. Undiscounted Return POMCP -50 D 2 NG-POMCP 1e Avg. Time Per Action (Seconds) (b) Pocman 4.6 PocMan D²NG-POMCP PocMan [81] PocMan PocMan PocMan 78

101 γ = [81] 4.6 D²NG-POMCP POMCP DNG-MCTS D²NG-POMCP width depth width depth UCT POMCP 3D MCTS DNG-MCTS D²NG-POMCP PocMan 4.7 DNG-MCTS D²NG-POMCP DNG-MCTS D²NG- POMCP MDP POMDP Thompson Thompson MDP UCT DNG-MCTS CTP RaceTrack Sailing POMDP D²NG-POMCP RockSample PocMan AEMS2 HSVI-BFS SARSOP POMCP 79

102

103 Particle Filter over Sets PFS Target Identification EM PFS PETS2009 CoBot [147, 148] CoBot [149, 150] CoBot CoBot CoBot Multi-Object Tracking MOT Tracking-by-Detection [151, 152] Object Detector Tracker [151, 153, 154] Global Nearest Neighbor GNN 81

104 [155] Joint Probabilistic Data- Association JPDA Association Probability [153] Multiple Hypothesis Tracking MHT [154] Particle Filter over Sets PFS PFS EM PFS PETS2009 PFS PFS CoBot PFS 5.2 Joint Multi-Target Probability Density JMPD [156] JMPD JMPD [157] Markov Random Field MRF Monte-Carlo Markov Chain MCMC [158] Interacting 82

105 [159] Rao-Blackwellized Random Finite Set RFS [ ] Finite Set Statistics FISST [165] FISST 5.3 PFS PFS S n S = {X i } i=1:n n S = {x i } i=1:n Pr(S) = σ A n Pr(X 1 = x σ(1), X 2 = x σ(2),..., X n = x σ(n) ) A n {i} i=1:n Pr(X 1 = x σ(1), X 2 = x σ(2),..., X n = x σ(n) ) X 1 = x σ(1) X 2 = x σ(2) X n = x σ(n). S = {x i } i=1:n x {x i } i=1:n S X {X i } i=1:n S ψ ψ : {X i } i=1:n {x i } i=1:n S O = {o i } i=1:n n k S = {o (i) } i=1:k S Pr(S) = 1 k! = 1 n(n 1) (n k+1) ( n k) ( ) n k = n! k!(n k)! X f X (x) S = {X i } i=1:n S X n S = {x i } i=1:n Pr(S) = n! 1 i n f X(x i ) S = {X i } i=1:n n S 3 {Head} {Head, Tail} {Tail} Pr({Head, Tail}) = 1/ Head Tail 83

106 5.3.2 PFS S = {s i } i=1: S s = (x, y, ẋ, ẏ) (x, y) (ẋ, ẏ) (x, y) (x, y) + (ẋ, ẏ)τ + 1(ẍ, 2 ÿ)τ2 (ẋ, ẏ) (ẋ, ẏ) + (ẍ, ÿ)τ τ (ẍ, ÿ) (ẍ, ÿ) = (p cos θ, p sin θ) p N(0, σ 2 p) Dash Power θ U(0, 2π) Dash Direction N U σ 2 p Filed of View FOV Birth-Death Process λ S µ O = {o i } i=1: O o = (x, y, c) (x, y) c [0, 1] Human Detector Support Vector Machine SVM [166] 0.5 s = (x, y, ẋ, ẏ) Pr(o s) o = (x, y, c) Pr(o s) = Pr(c 1) Pr(x, y x, y) Pr(c 1) c Pr(x, y x, y) (x, y) (x, y ) Beta Pr(c 1) = Beta(c 2, 1) Pr(x, y x, y) = N(x, y x, y, Σ) Σ False Detection Pr(o ) o = (x, y, c) Pr(o ) = Pr(c 0)f b (x, y ) Pr(c 0) c f b (x, y ) (x, y ) Beta Pr(c 0) = Beta(c 1, 2) f b F O M S Missing Detection F M 84

107 O F = S M O S = { F i, M i } i=1: O S F-M O S = ( O )( S ) ( 0 i min{ O, S } i i = O + S ) O ν S ξ τ Pr(O S) = F,M O S Pr(O F S M) (ντ) F e ντ o F P(o ) ( S ξτ) M e S ξτ M! 1 ), (5.1) ( S M Pr(O F S M) f F (F) = (ντ) F e ντ o F P(o ) f M(M) = ( S ξτ) M e S ξτ 1 Ψ O F S M M! ( M ) S S M O F Pr(O F S M) = ψ Ψ O F s S M S M Pr(ψ(s) s). (5.2) ( O )( S ) 0 i min{ O, S } i i i! = Ω(( max{ O, S } ) min{ O, S } ) e PFS m! m = S M = O F m > 2 Pr(o s) c(s, o) = log(pr(o s)) Murty [167] Murty N N in O(kN 3 ) N N k Top k [168] F-M ( ) O + S O F-M F, M f F (F)f M (M) 85

108 Input: A set of detections O, and a set of humans S Output: Probability of observing O given S 1 Let Q a descending priority queue initially empty 2 Let F a list of all possible false detections F 3 Let M a list of all possible missing detections M 4 Sort F according to f F ( ) in descending order 5 Sort M according to f M ( ) in descending order 6 Add (1, 1) to Q with priority f F (F[1])f M (M[1]) 7 Let p 0 8 repeat 9 Let (i, j) Pop(Q) 10 Let q f F (F[i])f M (M[j]) 11 if F[i] = M[j] then 12 p p + q Murty(F[i], M[j]) 13 end 14 if i + 1 F then 15 Add (i + 1, j) to Q with priority f F (F[i + 1])f M (M[j]) 16 end 17 if j + 1 M then 18 Add (i, j + 1) to Q with priority f F (F[i])f M (M[j + 1]) 19 end 20 until q < threshold or Q is empty 21 return p 5.1: 5.1 Murty 5.2. f FM (i, j) = f F (F[i])f M (M[j]) k 1 k F M Q k Pop (i k, j k ) (i k, j k ) = argmax (i,j) Qk f FM (i, j) Q k+1 (i k, j k ) = Q k 1[i k +1 F ](i k +1, j k ) 1[j k +1 M ](i k, j k +1) f FM (i k +1, j k ) f FM (i k, j k ) f FM (i k, j k + 1) f FM (i k, j k ) f FM (i k+1, j k+1 ) f FM (i k, j k ) for 1 k F M F-M X = {s i } i=1: X t Pr(S t O 1, O 2,..., O t ) P t = { X (i) t, w (i) t } i=1:n N i=1 w = 1 Proposal Distribution π( X t 1, O t ) i N ˆX (i) t π( X (i) t 1, O t),

109 2. 1 i N, (a) Motion Weight m (i) t (b) Observation Weight o (i) t (c) Proposal Weight p (i) t (d) w (i) t 3. 1 i N ŵ (i) t = = w (i) t 1 w (i) t 1 j N w(j) t = Pr( ˆX (i) t X (i) t 1 ), = Pr(O t ˆX (i) t ), = π( ˆX (i) t m (i) t. o(i) t p (i) t. X (i) t 1, O t), 4. Resample 1 i N 1 i N ŵ(i) t { ˆX (i) t, ŵ (i) t } i=1:n X (i) t P t = { X (i) t, 1 } N i=1:n. δ ˆX (i) t (X (i) t ) w t w t 1 o t o = (x, y, c) Pr(1) = Pr(c 1) Pr(1) Pr(0) = 0.5 o Pr(1 c) = = c Pr(c 1) Pr(1)+Pr(c 0) Pr(0) o o s = (x, y,, ) Pr(s o) = η Pr(o s) Pr(s) = N(x, y x, y, Σ) o O c Pr(s o) s π s ( o) o = (x, y, c) s = (x, y,, ) π s ( o) = 1 c π s (s o) = cn(x, y x, y, Σ) s X o O X PFS X O O X O 3 φ = F, M, ψ F X M O ψ Ψ O F X M X M O F 5.1 Pr(O X) = φ Pr(O, φ X) φ = argmax φ Pr(O, φ X) Observation Likelihood φ = F, M, ψ F X π r ( X t 1, O t ) 87

110 1. X t Pr( X t 1 ), 2. φ = F, M, ψ given X t, 3. X = {s s π s ( o), o F }, 4. X t X t X, 5. ˆX t argmax X {X t,x t } Pr(O t X) Acceptance Test O P = {X i } i=1:n P P = {X X Pr( X), X P} P P = {X X π r ( X ), X P } [169] P P Probability Density Estimation Pr(X X) Pr(X P ) π r (X X) Pr(X P ) X X f s Pr(X P) = n! Pr( X = n P) s X f s(s P) γ γ Gamma (α 0, β 0 ) γ Gamma (α = α 0 + X P X, β = β 0 + N) Posterior Predictive Negative Binomial Distribution r = α p = 1 1+β Pr( X = n P) = NB(n; r, p) = ( ) n+r 1 n p n (1 p) r. s = (x, y, ẋ, ẏ) (ẋ, ẏ) x y Multivariate Kernel Density Estimator f s (s P) f f {x i } i=1:n f f ˆf(x {x i } i=1:n ) = 1 i n K(x x i) K( ) Kernel Function 1 n 88

111 H(P) = {s s X, X P} P f s f s (s P) 1 H(P) s H(P) ϕ(x x )ϕ(y y ) s = (x, y,, ) ϕ H(P) H(P) P t Human Identification Identified Human Identity 3 h = (s, c, ρ) s c [0, 1] ρ ID h H(h) H(P t ) s = 1 H(h) s H(h) s c = H(h) N State Pool L t = {h i } i=1: Lt t L 0 = o O t h o H(h o ) L Ot = {h o o O t } t C t = L t 1 L Ot h C t H(h) s Labeling H(h) = {s l(s) = h, s H(P t )} 1. s H(P t )!h C t : l(s) = h 2. X P t s 1 X s 2 X s 1 s 2 l(s 1 ) l(s 2 ) f h h C t P = {f h h C t } P P = argmax P max l Pr(P t, l P) EM Maximum a Posterior MAP K- K-Means [170] E step: l (k) = argmax l Pr(P t, l P (k 1) ), M step: P (k) = argmax P Pr(P t, l (k 1) P). E f h P (k 1) l s H(P t ) f l (s)(s) X P t s X f l (s)(s) N. 0 X P 0 s X s o O 0 X O 0 C 0 = L O0 = O 0 89

112 X P 0 : X C 0 0 l 0 t T T l T L T = {l T (s) s X, X P T } X P T : X L T T + 1 C T+1 = L T L OT +1 L T L OT+1 = C T+1 = L T + L OT+1 X P T +1 X = (X _ X ) X O X _ P T X X X _ X X _ X O T + 1 O O T+1 X _ X O = X = X _ X + X O X _ L T X O = O O T+1 = L OT+1 X C T+1 X P T+1 l T+1 t 0 E l t M O t Maximal Likelihood Estimation MLE o O t H(o) H(P t ) X P t 1. X F, M, ψ = φ = argmax φ Pr(O t, φ X) 2. s X H(ψ (s)) H(ψ (s)) s H(ψ (s)) h C t f (k) h f (k) h (s) = o O t f h (s, o) + f h (s, ) = o O t Pr(s o)f h (o) + f h (s, ) = o O t 1[s H(o)]f h (o) + 1[ o : s / H(o)]f h ( ) f h (o) o h f h ( ) h f h (o) = Pr(o h) Pr(h) 1 H(o) H(h) f N h( ) = Pr( h) Pr(h) 1 N H(h) o O t H(o) H(h), P (k) = {f (k) h h C t} M l (0) P (k) l (k+1) L t L t = {l(s) s H(P t )} L t C t L t 5.2 FindBestAssignment ApproximateHuman h L t H(h) 5.4 PETS2009 PFS CoBot PFS 90

113 Input: Identities L t 1, state pools H(h) for h L t 1, particles P t, observation O t, and maximal EM steps EM Output: Identities L t, and state pools H(h) for h L t 1 Let L t, L Ot 2 foreach o O t do 3 Let H o 4 Propose h o as a potential new identity from o 5 L Ot L Ot h o 6 Let H(h o ) 7 end 8 Let C t L t 1 L Ot 9 foreach X P t do 10 Let F, M, ψ = φ = argmax φ Pr(O t, φ X) 11 H(ψ (s)) H(ψ (s)) s for each s X 12 end 13 foreach h L t 1 do 14 H(h) H(h) taking account particle filtering from P t 1 to P t 15 end 16 foreach s H(P t ) do 17 if h C t : s H(h) then 18 l(s) h 19 end 20 end 21 Let n 0 22 repeat 23 foreach X P t do 24 n n Let converged T rue 26 Let c(s, h) log(f h (s)) for s X, h C t 27 Let l FindBestAssignment(c) 28 foreach s X do 29 if l(s) l (s) then 30 converged False 31 l(s) l (s) 32 end 33 end 34 foreach h C t do 35 H(h) = {s l(s) = h, s H(P t )} 36 end 37 end 38 until converged = T rue or n > EM 39 L t L t l(s) for s H(P t ) 40 h ApproximateHuman(H(h)) for each h L t 41 return L t, {H(h) h L t } 5.2: Human Identification 91

114 Avg. Relative error (%) Avg. Relative Error Avg. Time Usage Avg. Time Usage (µs) Avg. Relative error (%) Avg. Relative Error Avg. Time Usage Avg. Time Usage (ms) e Assignments pruning ratio threshold e False-missing pruning threshold (a) (b) T a = 0.1 T fm = ± ± ± ± % 97.95% 0.026% 3.30% λ s = 0.06/s µ s = 0.02 ν s = 0.5 ξ s = v v v v T v T T T PFS (a) T T 1 2 2% T 0.1 T PFS 5.1(b) T

115 5.2 PFS Parameter PETS2009 Real Robot λ (1/s) µ (1/s) σ p (m 2 /s) ν (1/s) ξ (1/s) τ (s) T T Σ 0.5I 0.3I α 0 Gamma α β 0 Gamma β A (m 2 ) A (m 2 ) R N H EM PETS2009 [47] S2L1 PFS 7fps [171] m Bounding Box * 0 PFS CLEAR MOT [177] PFS 1 Multiple Object Tracking Accuracy MOTA False Positive False Negative ID Identity Switch Multiple Object Tracking Precision MOTP d (1 d) 100% n t t * [172] 93

第五章表 5.3 基于集合粒子滤波的多对象跟踪算法 PETS2009 S2L1 数据集量化实验结果算法 PFS1 (本章提出) PFS12 (本章提出) Milan[172] Milan et al.[173] Segal et al.[174] Segal et al.[174]2 Zamir et al.2[175] Andriyenko et al.

116 第五章表 5.3 基于集合粒子滤波的多对象跟踪算法 PETS2009 S2L1 数据集量化实验结果算法 PFS1 (本章提出) PFS12 (本章提出) Milan[172] Milan et al.[173] Segal et al.[174] Segal et al.[174]2 Zamir et al.2[175] Andriyenko et al.[171] Breitenstein et al.[176]2 1 2 MOTA MOTP IDS MT FM 93.1% 90.6% 90.6% 90.3% 92% 90% 90.3% 81.4% 56.3% 76.1% 74.5% 80.2% 74.3% 75% 75% 69.0% 76.1% 79.7% 次运行的平均值整体范围内的评估结果图 5.2 PFS 算法在 PETS2009 S2L1 数据集中的跟踪结果举例个最优分配问题找到的匹配数目如果某个真实状态和估计状态之间的距离小 (i) 于 1 米就认为这两个状态匹配 dt 为真实的状态和与之匹配的估计状态之间的距离则 MOTP 计算如下 ( MOTP = 94 ) (i) t 1 i nt dt 1 100%. t nt (5.3)

117 第五章基于集合粒子滤波的多对象跟踪算法 (a) CoBot 硬件构造图 5.3 (b) 部署中的 CoBot 服务机器人 CoBot 实验平台图片来自 CORAL 研究组令 gt 为实际目标的真实数目 at 为算法报告的估计目标的数目 mt 为 ID 交换的错误数目 MOTA 定义为 ( ) at 2nt + mt ) t (gt + MOTA = 1 100%. t gt (5.4) MOTP 反应了算法精确估计目标状态的能力 MOTA 反应了算法成功跟踪和保持目标轨迹的能力另外本节也报告了文献[178] 提出的一些度量指标包括几乎全部跟踪 Mostly Tracked MT 数目跟踪碎片 Track Fragmentation FM 数目和 ID 交换数目一个目标如果在其 80% 的时间内都被成功跟踪就称其为几乎全部跟踪跟踪碎片数目即一个目标的真实轨迹在被跟踪和没有被跟踪之间切换的次数表 5.3展示了主要实验结果图 5.2显示了一些跟踪的例子图中白色的限位框为原始的探测结果估计出来的目标轨迹和当前粒子状态使用不同的颜色显示意为其属于不同的个体 * 作为比较文献[176] 使用贪心的数据关联方法在粒子滤波框架内单独跟踪每一个目标可以被看成是 PFS 的一个很好的基准 Baseline 方法文献[174] 把多 * 显示整个实验结果的完整视频见 95

118 5.4 CoBot I Switch Linear Dynamical System Outlier [175] Generalized Minimum Clique Graph [171] [172] [173] Energy Function Spline Extended Kalman Filter EKF PFS PFS CoBot PFS CoBot CoBot CoBot Carnegie Mellon University Manuela M. Veloso CORAL Cooperate, Observe, Reason, Act, and Learn 96

119 5.5 CoBot II [149, 150] CoBot-1 CoBot-2 CoBot-3 CoBot-4 CoBot-2 5.3(a) CoBot 4 PTZ Microsoft Kinect Hokuyo CoBot CoBot CoBot-4 Kinect CoBot CoBot-2 CoBot-2 5.3(b) CoBot-2 PFS CoBot-2 Kinect 30Hz Histogram of Oriented 97

120 5.6 PFS CoBot Depth HOD 10Hz [179] 2.7GHz CPU 4GB Linux 3.5 PFS X 1m 1m * 5.5 PFS PFS * 98

展开

.1.2 MAXQ.3.4.5

2014 10 27 .1.2 MAXQ.3.4.5 .1.2 MAXQ.3.4.5 自主智能体和多智能体系统 Figure 1 : 各种智能体系统 Figure 2 : Figure 3 : CoBot MDP POMDP (Puterman, 1994; Kaelbling et al., 1998) Pr(s t+1 s 0, a 0, s 1, a 1,..., s t, a t ) = Pr(s