- PDF 免费下载

Size: px

Start display at page:

Download ""

庸冯
5 years ago
Views:

3 University of Science and Technology of China A dissertation for doctor degree Decision-Theoretic Planning for Multi-Agent Systems Author : Feng Wu Speciality : Computer Application and Technology Supervisor : Professor Xiao-Ping Chen Finished Time : April 20th, 2011

7 ,

9 MDP POMDP DEC-POMDP NEXP DEC-POMDP DEC-POMDP DEC-POMDP DEC-POMDP MAOP-COMM PBPG TBDP PBPG TBDP I

10 DEC-POMDP DEC-POMDP DEC-POMDP DecRSPI DEC-POMDP (1) MAOP-COMM MAOP-COMM (2) PBPG PBPG (3) TBDP TBDP (4) DecRSPI DecRSPI II

11 ABSTRACT ABSTRACT Planning under uncertainty is one of the fundamental challenges in Artificial Intelligence. Decision theory offers a mathematical framework for optimizing decisions in these domains. In recent years, researchers have made rapid progresses for decision making in single-agent cases. This enables relatively large problems to be solved by state-of-the-art MDP or POMDP algorithms. However, the research of decentralized decision making is still in its infancy and existing solutions can only solve very small toy problems. As a nature extension of Markov decision theory to multi-agent systems, the DEC-POMDP model has very high computational complexity, namely NEXP-hard. This is not surprising given that agents in multi-agent settings not only have to keep tracking on the state transition of the environment but also the potential behaviors of the other agents. As a result, the joint policy space can be huge. Hence how to search over the large joint policy space and find the best one is a key challenge when solving a large DEC-POMDP. Due to the problem complexity, exact algorithms can solve merely tiny problems. Therefore, my thesis work focuses on developing effect algorithms for solving general DEC-POMDPs approximately. Generally, planning in multi-agent systems can work either in online or offline fashions. This thesis contributes to the literature by proposing both online and offline algorithms. Furthermore, a model-free algorithm is presented for planning when the exact model is not available. Online algorithms do planning when interacting with the environment. During execution time, only very small regions of the joint policy space can be visited. Thus, online algorithm can scale better when solving large real-world applications. However, online algorithms have their own challenges. The time for planning is very limited because agents have to react to the environment immediately. In DEC-POMDP settings, each agent can only get its own local observations. Hence a distributed planning framework is required to guarantee the coordination among agents. In order to cooperate with others, agents must reason about all possible information held by others and this type of information grows exponentially over time. The communication resource is often limited by the band-width, environment or computation device. Therefore online algorithms must optimize the usage of communication during planning. In this thesis, a novel approach called MAOP-COMM was presented to handle these challenges properly. Offline algorithms compute a complete plan prior to interacting with the environ- III

12 ABSTRACT ment. The key advantages are that there is no limit on planning time and planning can be done in a centralized manner, as long as the outcome policies can be executed by each agent according its local information. Currently, the leading offline approaches combine bottom-up dynamic programming with top-down heuristic search to construct policies. The bottleneck is that the policies trees built at each step grow exponentially and thereby may run out of time and memory very quickly. To address this challenge, this thesis improves the existing work and proposes two novel algorithms called PBPG and TBDP. The contribution of PBPG relies on constructing the best policy for each belief state directly instead of enumerating all candidates before selecting the best one. TBDP is developed to address the challenge of the large state space in DEC-POMDPs. The main contribution is to use trail-based policy evaluation for reachable states and only do the computation when necessary. The complete knowledge of a model is required by both offline and online algorithms. Unfortunately, the exact form of a DEC-POMDP may not be available. Therefore, it is significant to develop learning algorithms that can compute decentralized policies by only interacting with the environment. Motivated by the above reason, a Monte-Carlo algorithm called DecRSPI is proposed to learn policies using a set of rollout samples drew from the environment. DecRSPI is model-free and requires only a simulator or an environment that can be sampled. The contributions of this thesis to the multi-agent planning community are mainly fourfold: (1) It systematically studied multi-agent online planning problems and proposed the MAOP-COMM algorithm, which guarantees coordination among agents. MAOP-COMM has three key components: a fast policy search method based on linear programming that meets online time-constraints, a policy-based merging solution for histories to identify the most useful information while bound the usage of memory and a new communication strategy by detecting the inconsistency of the beliefs to make a better use of the communication resource. In the experiments, MAOP-COMM performed very well in a variety of testing domains. (2) It systematically studied the policy generation problem in multi-agent offline planning and proposed the PBPG algorithm. PBPG completely replaces the backup operation and re-formulates the policy generation problem as an optimization for finding the best mapping. An approximate algorithm was also proposed to find the mapping efficiently. Consequentially, more policy trees can be kept as building blocks of the next iteration and the total solution quality is greatly improved. In the experiments, PBPG achieved an order of magnitude improvement in runtime and get near-optimal solutions when the number of sub-policies is sufficiently large. (3) It systematically studied the policy evaluation problem in multi-agent of- IV

13 ABSTRACT fline planning and proposed the TBDP algorithm. TBDP uses trail-based evaluation for reachable states and only performs the computation when necessary. A new policy representation with layered structure and stochastic decision nodes was introduced. It formulates the policy construction as an optimization over the parameter space and further speeds up the policy generation process. Besides, TBDP can be implemented in a distributed manner and lead itself the computational power of multi-core computers. In the experiments, TBDP solved a problem with more than ten thousand states. (4) It introduced the model-free technique for multi-agent planning and proposed the DecRSPI algorithm. DecRSPI is a Monte-Carlo algorithm and requires only a simulated environment to learn a policy using rollout samples drew from the simulator. An important property of DecRSPI is its linear time and space complexity over the agent size. In the experiments, DecRSPI can solve problems up to 20 agents, an order of magnitude improvement over the agent size comparing with state-of-the-art algorithms. Keywords: Multi-Agent Systems, Markov Decision Model, Planning Under Uncertainty, Coordination and Cooperation, Decentralized Partially-Observable Markov Decision Process (DEC-POMDP). V

15 I ABSTRACT III VII XI XIII XV XVII VII

16 VIII

17 99 A IX

19 PBPG TBDP XI

21 MBDP [15] TBDP TBDP A.1 2D XIII

23 XV

25 MDP POMDP MMDP DEC-MDP DEC-POMDP I-POMDP POSG DP JESP PBDP MBDP IMBDP MBDP-OC PBIP PBIP-IPG MAOP-COMM PBPG TBDP DGD RTDP DecRSPI Dec-Comm BaGA I S s A i a i A a Ω i s S i i a i A i A = i I A i a A i XVII

26 o i Ω o P (s s, a) O( o s, a) R(s, a) b 0 Q i q i Q q V (s, q) V (b, q) Hi t h t i H t h t i o i Ω i Ω = i I Ω i o Ω s a s s a o s a b 0 (S) i i q i Q i Q = i I Q i q Q q s q b V (b, q) = s S b(s)v (s, q) i t i t h t i Hi t t H t = i I Hi t t h t H t XVIII

27 1 1.1 Agent Sequential Decision-Making Problem Planning Shortest Path Problem Automatic Planning Classical Planning 1

28 1.2 Markov Decision Theory Software Agent Physical Robot State Markovian Property Single-Agent MDP MDP Action Reward MDP Transition Function MDP Reward Function 2

29 MDP MDP MDP Horizon MDP Policy MDP P MDP MDP POMDP POMDP Observation POMDP Observation Function MDP POMDP POMDP POMDP MDP POMDP PSPACE Tiger Problem POMDP POMDP State={ Tiger_Left, Tiger_Right } Action ={ Open_Left, Open_Right, Listen } Observation = { Roar_Left, Roar_Right, None } 3

30 MDP POMDP POMDP POMDP POMDP The Curse of Dimensionality MDP N N 6 MDP POMDP 4

31 1.3 Multi-Agent Systems Joint Action Robot Soccer Game Theory Normal-Form Game Player Prisoner s Dilemma { } Nash Equilibrium Extensive-Form Game 5

32 1-1 0 Markov Game MDP MDP Non-Cooperative Game Cooperative Game Zero-Sum Game General-Sum Game Team Markov Game MMDP 6

33 ?? Roar R1 R2 Treasure Left Door Right Door Tiger 1.1 MMDP MDP MDP MMDP MDP MMDP MMDP MMDP [1] Internet DEC-POMDP POMDP Multi-Agent Tiger Problem [2] 7

34 POMDP DEC-POMDP DEC-POMDP NEXP [3] POMDP NSPACE DEC-POMDP POSG POSG 1.4 DEC-POMDP DEC-POMDP DEC-POMDP DEC-POMDP DEC-POMDP DEC-POMDP MAOP-COMM MAOP-COMM PBPG TBDP PBPG 8

35 MBDP PBPG DEC-POMDP TBDP TBDP DecRSPI DecRSPI DecRSPI DecRSPI DecRSPI DecRSPI TBDP TBDP PBPG TBDP PBPG TBDP PBPG DecRSPI TBDP PBPG DecRSPI 9

36 DEC-POMDP DecRSPI MAOP-COMM MAOP-COMM MAOP-COMM 10

37 2 BDI Negotiation DEC-POMDP Universal Model DEC-POMDP DEC-POMDP Benchmark Problem DEC-POMDP 2.1 DEC-POMDP (DEC-POMDP). (DEC-POMDP) I, S, {A i }, {Ω i }, P, O, R, b 0 I 1, 2,, n n = I n = 1 DEC-POMDP POMDP 11

38 S s S P (s t+1 s 0, a 0,, s t 1, a t 1, s t, a t ) = P (s t+1 s t, a t ) A i i A = i I A i a = a 1, a 2,, a n Ω i i Ω = i I Ω i o = o 1, o 2,, o n o i P : S A S [0, 1] P (s s, a) s a s P ( s, a) O : S A Ω [0, 1] O( o s, a) a s o R : S A R R(s, a) s a 12

39 R1 Environment State R2 Reward 2.1 b 0 (S) Horizon T t = 0, 1, 2,, T 1 r(t) = R(s, a) Cumulative Reward r(0) + r(1) + r(2) + + r(t 1) DEC-POMDP q V ( q) E ] R(s, a) q, b0 [ T 1 t=0 (2.1) ( ). i Q i Ω i A i Q i : Ω i A i q i Q i i o i = (o 1 i, o 2 i,, o t i) a i A i q = q 1, q 2,, q n Value Function s q V (s, q) = R(s, a) + P (s s, a)o( o s, a)v (s, q o ) (2.2) s S o Ω a q q o q 13

40 o b V (b, q) = s S b(s)v (s, q) DEC-POMDP q b 0 V (b 0, q) DEC-POMDP DEC-POMDP DEC-POMDP MTDP [4] POIPSG [5] DEC-POMDP DEC-POMDP DEC-POMDP NEXP P [6] DEC-POMDP MMDP [1] DEC-MDP DEC-MDP DEC-POMDP DEC-MDP DEC-MDP DEC-POMDP NEXP DEC-POMDP DEC-POMDP DEC- POMDP Transition-Independent DEC-MDP [7] Goal-Oriented DEC-POMDP [6] Event-Driven DEC-MDP [8] Network Distributed POMDP [9] 14

41 DEC-POMDP DEC-POMDP DEC-POMDP DEC-POMDP DEC-POMDP POMDP I-POMDP) [10] Nested Belief I-POMDP I-POMDP DEC-POMDP DEC- POMDP DEC-POMDP 2.2 DEC-POMDP DEC- POMDP Decision Tree DEC-POMDP 15

42 o 1 a 2 o 1 a 1 a 1 a 1 Agent 1 Agent 2 o 2 o 1 a 2 o 2 a 2 o 2 o 1 o 2 o 1 a 2 o 2 a 2 o 1 o 2 a 1 a 2 a 1 a 1 a 2 a T DEC-POMDP Brute-Force Method ( O A i ( A i Ω i ) ) T 1 A i Ω i 1 I (2.3) T DEC-POMDP NEXP DEC-POMDP DEC-POMDP A MAA* [11] 16

43 2.1 Initialize all depth-1 policy trees for t=1 to T do // backwards Perform full backup on Q t Evaluate policies in Q t+1 Prune dominated policies in Q t+1 return the best joint policy in Q T for b 0 MAA* t 1 t T 1 t 1 t T 1 MAA* DEC-POMDP MMDP MMDP MAA* MAA* DEC-POMDP MILP [12] Sequence Form DP [13] POSG DEC-POMDP

44 2.2 Generate a random joint policy repeat foreach agent i do Fix policies of all agents except i for t=t downto 1 do // forwards Generate a set of all possible belief state: B t ( B t+1, q i, t a i, o i ), a i A i, o i Ω i for t=1 to T do // backwards foreach b t B t do Compute the best value for V t (b t, a i ) forall the possible observation sequences do for t=t downto 1 do // forwards Update the belief state b t given q t i Select the best action by V t (b t, a i ) until no improvement in the policies of all agents return the current joint policy 0 t t + 1 Backup t + 1 t t + 1 t t + 1 t + 1 Evaluation t + 1 Prune t + 1 t + 2 T Q t+1 i = O ( A i Q t i Ω i ) (2.4) Q t i Q t+1 i i t t + 1 DEC-POMDP 18

45 2.3 Initialize all depth-1 policy trees for t=1 to T do // backwards Perform full backup on Q t Evaluate policies in Q t+1 foreach agent i do foreach possible history h T t 1 i do Generate the set of all possible multi-agent belief states: B( h T t 1 i, Q t+1 i ) for each b t+1 i B( h T t 1 i, Q t+1 i ) do Select the best policy q t+1 i for b t+1 i Prune all policies except the selected ones return the best joint policy in Q T for b JESP [2] DEC-POMDP Q i i b i (S Q i ) JESP i i i JESP PBDP [14] b 0 19

46 2.4 Initialize all depth-1 policy trees for t=1 to T do // backwards Perform full backup on Q t Evaluate policies in Q t+1 for k=1 to maxt rees do Select a heuristic h from the heuristic portfolio Generate a state distribution b (S) using h Select the best joint policy q t+1 in Q t+1 for b Prune all policies except the selected ones return the best joint policy in Q T for b 0 t PBDP t t t + 1 PBDP PBDP MBDP [15] MBDP DEC-POMDP T 10 MBDP MBDP 2.3 MBDP PBDP PBDP MBDP POMDP Joint Belief State MBDP Heuristic Portfolio MBDP maxt ree 20

47 a o 1 o 2 Top down Heuristics a b o 1 o 2 o 1 o 2 b b a a Identify Relevant Belief Points b a o1 o2 o1 o 2 o1 o 2 b a b a b b a Bottom up DP 2.3 MBDP [15] MBDP DEC-POMDP MBDP MBDP t t + 1 Exhaustive Backup t t + 1 IMBDP [16] Partial Backup IMBDP maxobs maxobs Ω i MBDP-OC [17] MBDP-OC MBDP 21

48 Point-Based Incremental Pruning PBIP [18] Branch-and-Bound MBDP PBIP PBIP PBIP PBIP-IPG [19] PBIP-IPG PBIP PBIP-IPG PBIP-IPG MBDP DEC-POMDP maxt ree maxt ree DEC-POMDP DEC-POMDP 22

49 DEC-POMDP DEC-POMDP Mis-Coordination DEC-POMDP DEC-POMDP 23

50 BaGA [20] BaGA DEC-POMDP Alternating-Maximization Type BaGA BaGA Dec-Comm) [21] Dec-Comm Q P OMDP Dec-Comm Dec-Comm Dec-Comm DEC-POMDP 24

51 Dec-Comm Communication DEC- POMDP DEC-POMDP DEC-POMDP POMDP NEXP NSPACE DEC-POMDP Broadcast Peer-to-Peer 25

52 DEC-POMDP Dec-Comm Value of Information [22] dec_pomdp_valued_com [23] Belief Divergence KL 26

53 2.4 DEC-POMDP DEC-POMDP DEC-POMDP POMDP DEC- POMDP POSG DEC-POMDP NEXP DEC-POMDP DEC-POMDP DEC-POMDP DEC-POMDP DEC-POMDP MBDP MBDP 27

54 28

55 3 100 MAOP-COMM 3.1 MAOP-COMM MAOP- COMM Planning Executing 29

56 Agent 1 Agent 2 Plan Communication Plan B Execute o 1 World o 2 Execute B a 1 a 2 Update Update 3.1 Updating ( ). t {H t i i I}, B t H t i i B t B t = {b(h t ) h t H t, H t = i I H t i } ( ). i δ i : H i A i δ i (h i ) δ i h i δ = δ 1, δ 2,, δ n δ(h) δ h MAOP-COMM δ t i h t i h t 1 i o t i a i = δ t i(h t i) ( ). 30

57 3.1 Input: H t, B t, δ t H t+1 ; B t+1 for h t H t, o Ω do a δ t (h t ) // append a, o to the end of h t. h t+1 h t a o // calculate the distribution of h t+1. p(h t+1 ) p( o, a h t )p(h t ) // test if h t+1 is a reachable joint history. if p(h t+1 ) > 0 then H t+1 H t+1 {h t+1 } // compute the belief state of h t+1. b t+1 ( h t+1 ) Update belief with b t ( h t ), a, o // add b t+1 into a hash table indexed by h t+1. B t+1 B t+1 {b t+1 (h t+1 )} return H t+1, B t+1 DEC-POMDP MAOP-COMM ( ). p p 31

58 MAOP-COMM MAOP-COMM MAOP-COMM. MAOP-COMM b 0 32

59 MAOP-COMM MAOP-COMM MAOP-COMM MAOP-COMM b 0 MAOP- COMM 3.2 MAOP-COMM NP MAOP-COMM MAOP-COMM ( ). t i h t i h t i = (a 0 i, o 1 i, a 1 i,, o t 1 i, a t 1, o t i) h t = h t 1, h t 2,, h t n ( ). b( h) (S) h i 33

60 3.2 Input: b 0, seed[1..t 1] foreach i I (in parallel) do a 0 arg max a Q( a, b 0 ) Execute the action a 0 i and initialize h 0 i H 0 { a 0 }; B 0 {b 0 }; τ comm false for t = 1 to T 1 do Set the same random seed by seed[t] H t, B t Expand histories and beliefs in H t 1, B t 1 o t i Get the observation from the environment h t i Update agent i s own local history with o t i if H t is inconsistent with o t i then τ comm true if τ comm = true and communication available then Synch h t i with other agents τ comm false if agents communicated then h t Construct the communicated joint history b t (h t ) Calculate the joint belief state for h t a t arg max a Q( a, b t (h t )) H t {h t }; B t {b t (h t )} else π t Search the stochastic policy for H t, B t a t i Select an action according to π t (a i h t i) H t, B t Merge histories based on π t h t i Update agent i s own local history with a t i Execute the action a t i s S b t (s h t ) = αo( o t s, a t 1 ) s S P (s s, a t 1 )b t 1 (s h t 1 ) (3.1) α b( h) b(h) 3.2 DEC-POMDP i h i q i 34

61 Agent 1 Agent 2 a 1 o 1 o 2 a 3 a 3 o 1 o 2 a 2 o 1 h 1 o 1 a 1 a 2 o 2 o 1 a 3 o 2 a 2 a 2 o 2 o 1 o 2 o 1 o 2 o 1 o 2 h 2 h 3 h 4 h 5 h 6 h 7 h 8 t o 1 h 1 o 1 a 2 a 1 o 2 o 1 a 1 o 2 a 3 a 3 o 2 o 1 o 2 o 1 o 2 o 1 o 2 h 2 h 3 h 4 h 5 h 6 h 7 h 8 q 1 q 2 q 3 q 4 t + 1 q 1 q 2 q 3 q h i MAOP-COMM δ V (δ) = p(s h)v (δ(h), s) (3.2) h H s S p(s h) h δ(h) = δ 1 (h 1 ), δ 2 (h 2 ),, δ n (h n ) MAOP-COMM δ(h) a i = δ i (h i ) MAOP-COMM MAOP-COMM DEC-POMDP T = 1 DEC-POMDP DEC-POMDP b 0 DEC-POMDP 35

62 NP [24] MAOP-COMM π i (q i h i ) h i i q i π = π 1, π 2,, π n π V (π) = h H p(h) q π i (q i h i )Q( q, b(h)) (3.3) i I p(h) h b(h) h Q( q, b(h)) q b(h) q Q( q, b(h)) b(h) q DEC-POMDP Q Q DEC-POMDP Q Q MAOP-COMM V (s) DEC-POMDP MDP MDP DEC-POMDP 36

63 3.1 Variables: ε, π i (q i h i ) Objective: maximize ε Improvement constraint: V (π) + ε h H p(h) q π i(q i h i ) k i π k(q k h k )Q( q, b(h)) Probability constraints: h i H i, q i π i (q i h i ) = 1 h i H i, q i, π i (q i h i ) 0 q i, q a i, a Q MDP [ ] Q(b, a) = s S b(s) R(s, a) + s S P (s s, a)v MDP (s ) (3.4) V MDP MDP MDP Q MDP Q Q POMDP POMDP DEC-POMDP MAOP-COMM Q MDP MDP MAOP-COMM MAOP-COMM π i

64 i π i (q i h i ) V (π) p(h) π i (q i h i )π i (q i h i )Q( q, b(h)) (3.5) h H q π i (q i h i ) = k i π k(q k h k ) 3.1 π i ε π Local Search Random Restarts 3.3 MAOP-COMM q i q i q i q i q i q i q i 3.3 DEC-POMDP 38

65 3.3 Input: H, B for several restarts do // select the start point randomly. π Initialize the parameters to be deterministic with random actions repeat ε 0 foreach i I do // optimize the policy alternatively. π i, ε Solve the linear program in Table 3.1 with H, B, π i ε ε + ε until ε is sufficiently small if π is the current best policy then π π return π O( Ω i T I ) DEC-POMDP

66 ( ). i h i, h i Probabilistically Equivalent h i, p(h) = p(h ) h i, s, b(s h) = b(s h ) h = h i, h i, h = h i, h i h i i (Oliehoek et al. 25 ). h i h i h i i h i i i h i h i i b( h i, h i ) i i h i qi 40

67 3.3.2 ( ). i h i h i Policy Equivalent h i h i Conditional Plan DEC-POMDP i h i h i. 0 0 T b 0 t i h t i h t i h t i h t i q t i t + k k h k i h t i h k i h t i h k i q t+k i h t i h t i q t+k i q t i h t i h k i h t i h k i q t i q t+k i h t i h t i h t i h t i h t i h t i h k i h t i h k i h t i h k i t h t i, h t i k h t+k i, h t+k i h t i, h t i t + k h t+k i, h t+k i k h t+k i, h t+k i Q [25] Q i h t+k i, h t+k i k = 1, 2, 3, h t i, h t i h t i, h t i 41

68 i h i OL h i OR h i h i h i {OL} h i {OR} h i {OL} h i {OR} POMDP DEC-POMDP DEC-POMDP MBDP 42

69 DEC-POMDP MAOP-COMM k k k k k k k i k MAOP-COMM k MAOP-COMM 3.4 MAOP-COMM k k k 3.4 MAOP-COMM

70 3.4 Input: H t, Q t, π t foreach i I do H i t // H i is a hash table indexed by q i. H i (q i ), q i Q i // group histories based on the policy. foreach h i Hi t do // get the policy of h i according to πi. t q i Select a policy according to πi(q t i h i ) // add h i to the hash table with key q i. H i (q i ) H i (q i ) {h i } // generate a new set of histories. foreach q i Q t i do // keep one history per policy. if H i (q i ) is not empty then h i Select a history from H i (q i ) randomly H t i H t i {h i } // fill up the history set. while H t i < Q t i do q i Select a policy from Q t i randomly if H i (q i ) is not empty then h i Select a history from H i (q i ) randomly H t i H t i {h i } return H t MAOP-COMM DEC-POMDP i o i ϵ 44

71 Agent 1 Agent 2 N B Need o 1 o 2 Need B N Y Y Available Y Communication Y Available N N Postpone h h Postpone Plan Plan 3.3 h t i i t o t i i t B(h t i) h t i ( ϵ ). t B t i o t i ϵ max b, o t i { s S O( o t s, a) s S P (s s, a)b(s) } < ϵ (3.6) o t i k i Ω k b B(h t i) a o i MAOP-COMM ϵ ϵ 45

72 MAOP-COMM MAOP-COMM MAOP-COMM MAOP-COMM MAOP-COMM MAOP-COMM [26] KL KL MAOP-COMM MAOP-COMM Dec-Comm 46

73 MAOP-COMM 3.5 MAOP-COMM MAOP-COMM h t = θ, b t θ = θ 1,, θ n θ i i b MAOP-COMM θ i = q t 1 i, o t i q t 1 i o t i θ i = a t 1 i, o t i h t H t h t = q1 t 1, o t 1, q2 t 1, o t 2,, qn t 1, o t n, b t }{{}}{{}}{{} } θ 1 θ 2 {{ θ n } θ h i a i q, 3.4 θ 0, b 0 q 1, q 2, θ 1, b 1, θ 2, b 2, θ 3, b 3, θ 4, b 4 1 o 1 q 1, q 1, o 1 2 o 2 q 2, q 2, o 2 47

74 Agent 1 q 1, t-1 o 1 Belief Pool θ 0, b 0 q 1, q 2, Agent 2 t-1 q 2, o 2 q 1, o 1 t θ 1, b 1 q 1, o 1 θ 2, b 2 q 1, o 1 θ 3, b 3 q 1, o 2 q 2, o 1 q 2, o 2 q 2, o 2 (o 1, o 1 ) (o 1, o 2 ) (o 2, o 2 ) θ 4, b 4 q 1, o 2 t q 2, o 2 q 2, o 1 (o 2, o 1 ) Agent Belief Pool Agent 2 q 1, o q 2, o 2 q 1, o 1 q 1, o 2 q 2, o 1 q 2, o 2 q 3 q 4 q 5 q 3, q 3, q 4, q 5, q 5, q 3 q q t i, o i q t i o i q t i, o i q t i, o i q t i, o i a 0 i, o 1 i, a 1 i, o 2 i,, a t 1 i, o t i 3.5 q 2, o 1 q 2, o 2 q 5 q 2, o 1 q 5, q 1, o 1 q 3 q 3, q 4 48

75 q 1, o 2 q 4, 1 q 1, o 2 q 3 1 q 3 q 3, 2 q 5 q 5, q 2, o 2 q 5 q i, o i q i o i 3.6 MAOP-COMM DEC-POMDP MDP MDP 20 Reward Time(s) Comm(%) MAOP-COMM MAOP-COMM MAOP-COMM MAOP-COMM Java 2GB 2.4GHz lp_solve 5.5 Java 0.01 CPU BaGA-Comm [27] Dec-Comm [21] MAOP-COMM particle Filtering Dec-Comm Dec-Comm-PF [28] BaGA-Comm Dec-Comm

76 Clustering BaGA-Cluster MAOP-COMM Dec-Comm-PF BaGA-Comm Dec-Comm Dec-Comm-PF Dec-Comm MAOP-COMM Dec-Comm-PF MAOP- COMM FULL-COMM ϵ MAOP ϵ 0 MAOP-COMM ϵ = 0.01 Dec-Comm-PF 100 Dec-Comm-PF 100 DEC-POMDP 1 FULL-COMM FULL-COMM MAOP-COMM MDP FULL-COMM POMDP MDP FULL-COMM DEC-POMDP POMDP NEXP NSPACE FULL-COMM MAOP-COMM FULL-COMM 50

77 3.2 T Algorithm Reward Time(s) Comm(%) Reward Time(s) Comm(%) Broadcast Channel Meeting in a 3 3 Grid MAOP < MAOP-COMM < Dec-Comm-PF < FULL-COMM < < MAOP < MAOP-COMM < Dec-Comm-PF 90.0 < FULL-COMM < < Cooperative Box Pushing Stochastic Mars Rover MAOP MAOP-COMM Dec-Comm-PF FULL-COMM < < MAOP MAOP-COMM Dec-Comm-PF FULL-COMM < < Broadcast Channel [29] Meeting in a Grid [29] Cooperative Box Pushing [16] Stochastic Mars Rover [19] DEC-POMDP Multi-Agent Tiger [2] Recycling Robots [30] Fire Fighting [31]

78 0.01 MAOP-COMM Dec-Comm-PF MAOP FULL-COMM MAOP-COMM Dec-Comm-PF MAOP-COMM MAOP Dec-Comm MAOP T = 100 MAOP-COMM Dec-Comm-PF Dec-Com-PF MAOP MAOP-COMM Dec-Comm-PF MAOP MAOP-COMM MAOP-COMM 52

79 1 O Dec-Comm-PF MAOP-COMM Dec-Comm-PF MAOP 100 MAOP-COMM MAOP-COMM ϵ ϵ

80 3.3 T Algorithm Reward Time(s) Comm(%) Reward Time(s) Comm(%) Grid Soccer 2 3 Grid Soccer 3 3 MAOP MAOP-COMM Dec-Comm-PF FULL-COMM < < MAOP MAOP-COMM Dec-Comm-PF FULL-COMM < < DEC-POMDP 3.3 MAOP-COMM Dec-Comm-PF MAOP MAOP-COMM ϵ MAOP-COMM MAOP Dec-Comm-PF 10 MAOP-COMM Dec-Comm-PF MAOP-COMM Dec-Comm-PF MAOP-COMM MAOP-COMM MAOP-COMM MAOP-COMM MAOP-COMM 54

81 3.6.2 MAOP-COMM Dec-Comm MAOP-COMM MAOP-COMM MAOP-COMM-POSTPONE MAOP-COMM-DROP MAOP-COMM- DROP MAOP-COMM-POSTPONE 55

82 Accumulated Reward MAOP-COMM-POSTPONE MAOP-COMM-DROP Dec-Comm Threshold of Possibility 3.7 MAOP-COMM Dec-Comm MAOP- COMM-POSTPONE Dec-Comm Dec-Comm MAOP-COMM- 56

83 Percentage of Communication MAOP-COMM-POSTPONE MAOP-COMM-DROP Dec-Comm Threshold of Possibility 3.8 POSTPONE MAOP-COMM-POSTPONE MAOP-COMM 3.7 MAOP-COMM MAOP-COMM Scalability DEC-POMDP MAOP-COMM 57

84 DEC-POMDP MDP Q DEC-POMDP MAOP-COMM MAOP-COMM MAOP-COMM DEC-POMDP MAOP-COMM MAOP-COMM MAOP-COMM 58

85 4 DEC-POMDP MBDP [15] maxtrees MBDP T [15] DEC-POMDP MBDP MBDP Exhaustive Backup i Q i A i Q i Ωi MBDP [16 19] maxtrees 10 PBPG MBDP PBPG DEC-POMDP PBPG PBPG MBDP MBDP PBPG MBDP 59

86 PBPG MBDP Trial MDP POMDP TBDP TBDP RTDP [32] TBDP Scalability TBDP PBPG TBDP MBDP TBDP PBPG 4.1 PBPG PBPG MBDP 60

87 b b b a 1 a 1 a 2 a 2 a 1 a 1 a 2 a 2 a 1 a 2 o1 o1 o 2 o 1 o 2 o 1 o 2 o 1 o 2 o 2 o 2 o 1 q 1 q 2 q 1 q 3 q 2 q 3 q 1 q 2 q 3 q 3 q 1 q 2 q 3 q 3 (1) (2) (3) ( ). b a t Q t = Q t 1, Q t 2,, Q t n V t : Q t S R i δ i : Ω i Q t i, i I t + 1 V t+1 ( a, b) = R( a, b) + s, o P r( o, s a, b)v t ( δ( o), s ) (4.1) δ( o) = δ 1 (o 1 ), δ 2 (o 2 ),, δ n (o n ) = q1, t q2, t, qn t R( a, b) = s b(s)r(s, a) P r( o, s a, b) = O( o s, a) s P (s s, a)b(s) t t + 1 i o i qi t δ i : Ω i Q t i, i I t + 1 i a i δ i δ 1 : o 1 q 3, o 2 q 1 2 δ 2 : o 1 q 2, o 2 q 3 b a * = a 1,, a n a * = arg max a A V t+1 ( a, b) (4.2) 61

88 b PBPG MBDP. PBPG MBDP 1 t b MBDP t t + 1 t + 1 q * t+1 V t+1 ( q * t+1, b) = s S b(s)v t+1 ( q * t+1, s) (4.3) V t+1 ( q * t+1, s) q * t+1 s PBPG V t+1 ( q * t+1, b) = s = s b(s)[r(s, a) + s, o P (s s, a) O( o s, a)v t ( q t o, s )] b(s)r(s, a) + s, o[o( o s, a) =R( a, b) + s, o P r( o, s a, b)v t ( q t o, s ) s P (s s, a)b(s)]v t ( q t o, s ) Eq.4.3 Eq.4.1 =V t+1 ( a, b) a q * t+1 q t o q *t+1 o t + 1 b 4.1 PBPG (1) b a 1, a 2 t (2) t (3) PBPG MBDP δ i, i I 62

89 ( Q t i Ωi ) I NP [24] PBPG PBPG π i : Ω i Q t i R, i I. π i (qi o t i ) i o i qi t δ π = π 1, π 2,, π n b a R(b, a) V t+1 ( δ, b) = s, o P r( o, s a, b)v t ( δ( o), s ) (4.4) V t+1 ( π, b) = s, o P r( o, s a, b) q t π i (qi o t i )V t ( q t, s ) (4.5) i π i, i I PBPG Nash π i, i I qi t Q t i π i 1 0 i 63

90 4.1 Variables: ε, π i(q t i o i ) Objective: maximize ε Subject to: Improvement constraint: V t+1 ( π, b) + ε s, o P r( o, s a, b) q π i(q t i o i ) π i (q t i o i )V t ( q t, s ) Probability constraints: o i Ω i, q t i Qt i π i(q t i o i ) = 1; o i Ω i, q t i Q t i, π i(q t i o i ) T horizon of the DEC-POMDP model maxt rees max number of trees at each step Q 1 initialize and evaluate all 1-step policy trees for t = 1 to T 1 do Q t+1 {} for k = 1 to maxt rees do b generate a belief using a heuristic portfolio ν for a A do π* compute the best mappings with b, a q build a joint policy tree based on a, π* ν evaluate q by given the belief state b if ν > ν then q * q, ν ν Q t+1 Q t+1 { q *} evaluate every joint policy in Q t+1 q * T select the best joint policy from Q T for b 0 return q* T π i(q i o t i ) V t+1 ( π, b) P r( o, s a, b)π i(q i o t i )π i (q i o t i )V t ( q t, s ) s, o, q t π i (q i o t i ) = k i π k(qk t o k) 4.1 ε π 64

91 4.1.3 i π i π i (qi o t i ) o i qi t 4.1 PBPG PBPG ( A maxt rees) MBDP ( A i maxt rees Ωi ) I PBPG ( A i I maxt rees + maxt rees I ) MBDP PBPG MBDP 0 PBPG MBDP PBPG PBPG MDP MDP DEC-POMDP MDP PBPG MBDP PBPG MBDP PBPG MBDP IMBDP MBDP-OC PBIP IPG 65

92 MBDP PBPG PBPG MBDP MBDP MBDP PBPG 4.2 (PO)MDP RTDP [32] TBDP DEC-POMDP TBDP TBDP MBDP TBDP 4.2 TBDP POMDP b (S) DEC-POMDP 66

93 4.2 Generate a random joint policy for t=1 to T do // bottom-up iteration foreach unimproved joint policy q t do Sample a state distribution b t by trials repeat foreach agent i do Fix the policies of all the agents except i begin formulate a linear program if V (s, q t 1 ) is required then Evaluate s, q t 1 by trials Improve the policy q t i by solving the LP until no improvement in the policies of all agents return the current joint policy PBDP MBDP b Bayes b (s ) = αo( o s, a) s S P (s s, a)b(s) (4.6) α b TBDP 4.3 TBDP MDP 67

94 4.3 Input: t: the sampled step, δ: the heuristic policy b(s) 0 for every state s S for several number of trials do s draw a state from the start distribution b T for k=t downto t do // top-down trial a execute a joint action according to δ s, o get the system responses s s b(s) b(s) + 1 return the normalized b TBDP TBDP Finite State Controller Mixed Strategy ( ). i t q i Q t i ψ i, η i ψ i A i p(a i q i ) η i o i q i Q t 1 i p(q i q i, o i ) ψ i η i p(q i, a i q i, o i ) = p(a i q i )p(q i q i, o i ) TBDP q i Q t i i t 68

95 Maximize subject to a x(a i q i ) p(a k q k )R(b, a) + P (s, o b, a) x(q i, a i q i, o i ) k i a, o,s q p(q k, a k q k, o k )V (s, q ) k i ai x(a i q i ) 0, ai,o i q x(q i, a i i q i, o i ) = x(a i q i ), a i x(a i q i ) = 1, ai,o i,q i x(q i, a i q i, o i ) q s V (s, q) = a p(a i q i )R(s, a) + P (s, o s, a) a, o,s q i p(q i, a i q i, o i )V (s, q ) i (4.7) P (s, o s, a) = s P (s s, a)o( o s, a) b q V (b, q) = s b(s)v (s, q) b q = arg max q V (b, q) TBDP PBPG TBDP i q i 4.2 i 1 TBDP MBDP PBPG

96 4.4 Input: s, q t 1, V : the value table, c: the count table for several number of trials do s s, v 0, w for k=t downto 1 do // forward trial if c(s, q k ) numt rials then w k s, q k, V (s, q k ) and break a execute a joint action according to q k s, o get the system responses r get the current reward w k s, q k, r q k 1 q k ( o) s s for k=1 to length(w) do // backward update s, q, r w k n c(s, q), v v + r V (s, q) [nv (s, q) + v] /(n + 1) c(s, q) n + 1 return V (s, q t 1 ) 4.2 TBDP Lazy Evaluation TBDP 4.2 i q i V (s, q ) P (s, o b, a) k i p(q k, a k q k, o k ) > 0 (4.8) 0 x(q i, a i q i, o i ) 0 TBDP q t 70

97 4.5 repeat in parallel // run on other processors t the current step of the main process foreach joint policy q of step t-1 do S sort states in descending order by c(s, q) foreach s S do while c(s, q) < numt rials do V (s, q) evaluate s, q by trial until the main process is terminated q t 1 q 1 TBDP s, q c(s, q) s, q s, q c(s, q) 1 V (s, q) Value Cache RTDP TBDP TBDP 4.5 TBDP 71

98 c(s, q) DEC-POMDP 4.3 PBPG TBDP PBPG maxtrees maxtrees PBPG maxtrees TBDP TBDP DEC-POMDP PBPG MBDP maxtrees maxtrees DEC-POMDP PBIP-IPG [19] MBDP IMBDP MBDP-OC PBIP maxtrees maxtrees PBPG maxtrees Anytime MDP PBIP-IPG 45% MDP 55% PBIP-IPG PBPG MDP PBPG 72

99 maxt rees 4.3 PBPG PBIP-IPG PBPG Average Time Average Value Average Time Average Value Meeting in a 3 3 Grid, S = 81, O = 9, T = s s x x s x x s x x s x x s V MDP = Cooperative Box Pushing, S = 100, O = 5, T = s s x x 69.12s x x s x x s x x s V MDP = Stochastic Mars Rover, S = 256, O = 8, T = s s x x 59.97s x x s x x s x x s V MDP = Grid Soccer 2 3, S = 3843, O = 11, T = 20 3 x x s V MDP = CUP maxtrees PBIP-IPG 4.3 x 12 MDP MDP PBPG Java 2GB 2.8GHz lp_solve Meeting in a 3 3 Grid [29] DEC-POMDP [19]

100 4.3 maxtrees PBPG PBIP-IPG maxtrees 20 PBPG maxtrees 3 PBIP-IPG maxtrees 20 maxtree 3 PBPG maxtrees 100 PBIP-IPG maxtrees MDP maxtrees 100 PBPG Cooperative Box Pushing [16] DEC- POMDP PBPG maxtrees 3 PBIP-IPG 10 maxtrees PBPG maxtrees PBPG maxtrees 100 PBPG maxtrees PBPG Stochastic Mars Rover [19] DEC-POMDP PBIP-IPG PBPG PBIP-IPG PBPG PBPG PBPG maxtrees 100 PBIP-IPG maxtrees 3 maxtrees PBPG 74

101 Value PBPG PBIP-IPG PBIP 20 IMBDP MBDP-OC Time (s) Value MDP RAND 4.3 PBPG maxtrees maxtrees 10 maxtrees 4.2 PBPG MDP PBPG maxtrees 75

102 4.3 x 0 MDP 1 MDP x = 3 30% 70% MDP MDP 45% 55% PBPG DEC-POMDP TBDP DEC-POMDP PBIP-IPG TBDP TBDP TBDP TBDP TBDP 20 CPU TBDP Java 2GB 2.8GHz TBDP Java lp_solve 5.5 DEC-POMDP 3 3 DEC-POMDP TBDP PBIP-IPG PBIP-IPG [19] MBDP IMBDP MBDP-OC PBIP DEC-POMDP TBDP 76

103 4.4 TBDP Horizon Algorithm Average Value Average Time Meeting in a 3 3 Grid S =81, A i =5, Ω i =9 PBIP-IPG s TBDP s PBIP-IPG s TBDP s Cooperative Box Pushing S =100, A i =4, Ω i = PBIP-IPG s TBDP s PBIP-IPG s TBDP s Stochastic Mars Rover S =256, A i =6, Ω i =8 PBIP-IPG s TBDP s PBIP-IPG s TBDP s TBDP PBIP-IPG 4.4 TBDP PBIP-IPG TBDP TBDP TBDP PBIP-IPG

104 W S N E 4.4 Value TBDP-Value MMDP-Value INDEP-Value TBDP-Time Upper Bound Lower Bound Horizon Time (s) 4.5 TBDP TBDP TBDP 20 TBDP 78

105 Value TBDP-Value TBDP-Time Number of Trials Time (s) 4.6 TBDP MMDP INDEP DEC-POMDP TBDP TBDP MMDP DEC-POMDP TBDP TBDP 10 TBDP TBDP TBDP 4.4 PBPG MBDP PBPG 79

106 maxtrees PBPG PBPG maxtrees PBPG PBPG PBPG PBPG MBDP MBDP PBPG TBDP TBDP TBDP MBDP TBDP RTDP TBDP TBDP TBDP TBDP DEC-POMDP TBDP 80

107 5 DEC-POMDP DEC-POMDP DEC-POMDP DEC-POMDP NEXP [3] Double Exponential DEC-POMDP (PO)MDP DEC-POMDP DEC-POMDP DEC-POMDP Reinforcement Learning [33] [34] Monte Carlo DEC-POMDP DEC-POMDP DecRSPI DEC-POMDP 81

108 5.1 generate a random joint policy Q given T, N sample a set of beliefs B for t 1..T, n 1..N for t=t to 1 do for n=1 to N do b B t n, q Q t n repeat foreach agent i I do keep the other agents policies q i fixed foreach action a i A i do Φ i estimate the parameter matrix build a linear program with Φ i π i solve the linear program i i { a i, π i } a i, π i arg max i Rollout(b, a i, π i ) update agent i s policy q i by a i, π i until no improvement in all agents policies return the joint policy Q DecRSPI TBDP TBDP DecRSPI TBDP TBDP DecRSPI DEC-POMDP RSPI 5.1 TBDP DecRSPI DecRSPI 5.1 DecRSPI DecRSPI DEC-POMDP 82

109 1 a 1 start node a 2 o 1 o 2 o 1 o a a o 1 o 2 o 1 o 2 T a 3 a DecRSPI DecRSPI 5.1 DecRSPI b t+1 = P r(s b t, a t, o t+1 ) b t DecRSPI t K θ t 83

110 θ t, t 1 T b t (s) = K k=1 {1 : bt k (s) θt }, s S (5.1) K b t k (s) θt k MDP MDP MDP DecRSPI DecRSPI q V (b, q) = R(b, a) + P r(s, o b, a) π(q i o i )V (s, q ) (5.2) s, o, q i P r(s, o b, a) = s S b(s)p (s s, a)o( o s, a) R(b, a) = s S b(s)r(s, a) DecRSPI DecRSPI 5.2 a i A i Maximize x o i Ω i q Φ i Qt+1 i (o i, q i)x(o i, q i) i subject to oi,q i x(o i, q i) 0, oi q x(o i i, q i) = 1 84

111 5.2 Input: b, a i, q i a i get actions from q i for k=1 to K do s draw a state from b s, o simulate the model with s, a ω oi (s, o i ) ω oi (s, o i ) + 1 normalize ω oi for o i Ω i foreach o i Ω i, q i Q t+1 i do for k=1 to K do s, o i draw a sample from ω oi q i get other agents policy π( q i, o i ) Φ i (o i, q i) k Rollout(s, q ) Φ i (o i, q i) 1 K K k=1 Φ i(o i, q i) k return the parameter matrix Φ i Φ i Φ i (o i, q i) = s,o i,q i P r(s, o b, a)π(q i o i )V (s, q ) π(q i o i ) = k i π(q k o k) 5.2 K P r(s, o b, a) π(q i o i ) K Φ i Φ i s, q s q K V (s, q) V (s, q) s, q V (s, q) v k Ṽ V (s, q) Hoeffding (Hoeffding ). V [V min, V max ] V = E[V ], v 1, v 2,, v K V Ṽ = 85

112 1 K K k=1 v k P r(ṽ V + ε) 1 exp ( 2Kε 2 /V 2 ) P r(ṽ V ε) 1 exp ( 2Kε 2 /V 2 ) V = V max V min. K δ Ṽ PAC ε ε = V 2 ln ( 1 δ ) 2K (5.3) ε δ Ṽ K(ε, δ) = V 2 ln ( 1 δ ) 2ε 2 (5.4) K DecRSPI Ṽ V DecRSPI T N n O(nT N) DecRSPI T = (T 2 + T )/2 T O(T 2 ) DecRSPI T n N

113 DecRSPI DecRSPI I DecRSPI DEC-POMDP DEC-POMDP DecRSPI DecRSPI DecRSPI DEC-POMDP DecRSPI 20 N 3 K 20 DEC-POMDP DGD [5] DGD PBIP-IPG [19] DEC-POMDP DecRSPI 5.2(a) 5.2(b) 5.2(c) DecRSPI 3 3 [29] [16] [19] DecRSPI DGD DecRSPI PBIP-IPG DecRSPI DecRSPI PBIP-IPG 87

114 Value DecRSPI DGD PBIP-IPG Horizon (a) Meeting in a 3 3 Grid Value DecRSPI DGD PBIP-IPG Horizon (b) Cooperative Box Pushing Value DecRSPI DGD PBIP-IPG Horizon (c) Stochastic Mars Rover Time (s) Box Pushing Meeting Grid Mars Rover Horizon (d) Horizon vs. Runtime Value Box Pushing Meeting Grid Mars Rover Number of Trials (e) Trials vs. Value Time (s) Box Pushing Meeting Grid Mars Rover Number of Trials (f) Trials vs. Runtime 5.2 DecRSPI DGD DecRSPI 5.2(d) DecRSPI T 5.2(e) T 20 DecRSPI K 88

115 PBIP-IPG 5.2(f) DecRSPI Distributed Sensor Network [35] I 4 I DecRSPI DEC-POMDP DEC-POMDP 5.4 DecRSPI 10 DecRSPI DGD 89

116 Value DecRSPI-Value DGD-Value DecRSPI-Time DecRSPI+SIM-Time Time (s) Number of Agents DecRSPI DecRSPI-Time DecRSPI DecRSPI+SIM-Time DecRSPI-Time DecRSPI DEC-POMDP 5.3 DecRSPI DEC-POMDP DecRSPI DEC-POMDP DEC-POMDP DecRSPI 90

117 DecRSPI DecRSPI 20 DEC-POMDP MBDP DecRSPI DecRSPI 91

118

119 6 6.1 Multi-Agent System Sequential Decision-Making MDP MDP 93

120 MDP POMDP MDP DEC-POMDP) DEC-POMDP NEXP [3] DEC-POMDP DEC-POMDP MAOP-COMM (1) (2) (3) (4) 94

121 MAOP-COMM MAOP-COMM MAOP-COMM PBPG PBPG MBDP PBPG PBPG PBPG MBDP PBPG NP PBPG MBDP TBDP PBPG TBDP TBDP 95

122 PBPG TBDP TBDP TBDP TBDP DEC-POMDP DecRSPI DEC-POMDP DecRSPI DEC-POMDP DecRSPI DecRSPI TBDP DecRSPI 96

123 DecRSPI DecRSPI 20 DEC-POMDP [15 19] 2 DEC-POMDP DEC-POMDP MAOP-COMM PBPG TBDP DecRSPI PBPG TBDP DecRSPI TBDP DecRSPI DEC-POMDP 6.2 DEC-POMDP NEXP [3] DEC-POMDP [7] TBDP DEC-POMDP DEC-POMDP 97

124 10 DecRSPI DEC-POMDP 20 DEC-POMDP Internet DEC-POMDP Internet DEC-POMDP DEC-POMDP Pre-Coordination [36] [37] Ad Hoc Team [38] 98

125 [1] Craig Boutilier. Planning, learning and coordination in multiagent decision processes. In Proceedings of the 6th Conference on Theoretical Aspects of Rationality and Knowledge, pages , [2] Ranjit Nair, Milind Tambe, Makoto Yokoo, David V. Pynadath, and Stacy Marsella. Taming decentralized POMDPs: Towards efficient policy computation for multiagent settings. In Proceedings of the 18th International Joint Conference on Artificial Intelligence, pages , [3] Daniel S. Bernstein, Shlomo Zilberstein, and Neil Immerman. The complexity of decentralized control of markov decision processes. In Proceedings of the 16th Conference on Uncertainty in Artificial Intelligence, pages 32 37, [4] David V. Pynadath and Milind Tambe. The communicative multiagent team decision problem: Analyzing teamwork theories and models. Journal of Artificial Intelligence Research, 16: , [5] Leonid Peshkin, Kee-Eung Kim, Nicolas Meuleau, and Leslie Pack Kaelbling. Learning to cooperate via policy search. In Proceedings of the 16th Conference on Uncertainty in Artificial Intelligence, pages , [6] Claudia V. Goldman and Shlomo Zilberstein. Decentralized control of cooperative systems: Categorization and complexity analysis. Journal of Artificial Intelligence Research, 22: , [7] Raphen Becker, Shlomo Zilberstein, Victor R. Lesser, and Claudia V. Goldman. Solving transition independent decentralized Markov decision processes. Journal of Artificial Intelligence Research, 22: , [8] Raphen Becker, Shlomo Zilberstein, and Victor R. Lesser. Decentralized markov decision processes with event-driven interactions. In Proceedings of the 3rd International Joint Conference on Autonomous Agents and Multi-Agent Systems, pages , [9] Ranjit Nair, Pradeep Varakantham, Milind Tambe, and Makoto Yokoo. Networked distributed POMDPs: A synthesis of distributed constraint optimization and POMDPs. In Proceedings of the 20st National Conference on Artificial Intelligence, pages , [10] P. J. Gmytrasiewicz and P. Doshi. A framework for sequential planning in multiagent settings. Journal of Artificial Intelligence Research, 24:49 79, [11] Daniel Szer, Francois Charpillet, and Shlomo Zilberstein. MAA*: A heuristic search algorithm for solving decentralized POMDPs. In Proceedings of the 21st Conference on Uncertainty in Artificial Intelligence, pages , [12] Raghav Aras, Alain Dutech, and Francois Charpillet. Mixed integer linear programming for exact finite-horizon planning in decentralized POMDPs. In Proceedings of the 17th International Conference on Automated Planning and Scheduling, pages 18 25, [13] Eric A. Hansen, Daniel S. Bernstein, and Shlomo Zilberstein. Dynamic programming for partially observable stochastic games. In Proceedings of the 19th National Conference on Artificial Intelligence, pages ,

126 [14] Daniel Szer and Francois Charpillet. Point-based dynamic programming for DEC-POMDPs. In Proceedings of the 21st National Conference on Artificial Intelligence, pages , [15] Sven Seuken and Shlomo Zilberstein. Memory-bounded dynamic programming for DEC-POMDPs. In Proceedings of the 20th Internationall Joint Conference on Artificial Intelligence, pages , [16] Sven Seuken and Shlomo Zilberstein. Improved memory-bounded dynamic programming for decentralized POMDPs. In Proceedings of the 23rd Conference in Uncertainty in Artificial Intelligence, pages , [17] Alan Carlin and Shlomo Zilberstein. Value-based observation compression for DEC-POMDPs. In Proceedings of the 7th International Joint Conference on Autonomous Agents and Multi-Agent Systems, pages , [18] Jilles Steeve Dibangoye, Abdel-Illah Mouaddib, and Brahim Chaib-draa. Point-based incremental pruning heuristic for solving finite-horizon DEC-POMDPs. In Proceedings of the 8th International Joint Conference on Autonomous Agents and Multi-Agent Systems, pages , [19] Chistopher Amato, Jilles Steeve Dibangoye, and Shlomo Zilberstein. Incremental policy generation for finitehorizon DEC-POMDPs. In Proceedings of the 19th International Conference on Automated Planning and Scheduling, pages 2 9, [20] Rosemary Emery-Montemerlo, Geoffrey J. Gordon, Jeff G. Schneider, and Sebastian Thrun. Approximate solutions for partially observable stochastic games with common payoffs. In Proceedings of the 3rd International Joint Conference on Autonomous Agents and Multi-Agent Systems, pages , [21] Maayan Roth, Reid G. Simmons, and Manuela M. Veloso. Reasoning about joint beliefs for execution-time communication decisions. In Proceedings of the 4th International Joint Conference on Autonomous Agents and Multiagent Systems, pages , [22] Raphen Becker, Victor R. Lesser, and Shlomo Zilberstein. Analyzing myopic approaches for multi-agent communication. In Proceedings of the 2005 IEEE/WIC/ACM International Conference on Intelligent Agent Technology, pages , [23] Simon A. Williamson, Enrico H. Gerding, and Nicholas R. Jennings. Reward shaping for valuing communications during multi-agent coordination. In Proceedings of the 8th International Joint Conference on Autonomous Agents and Multiagent Systems, pages , [24] John Tsitsiklis and Michael Athans. On the complexity of decentralized decision making and detection problems. IEEE Transaction on Automatic Control, 30: , [25] Frans A. Oliehoek, Shimon Whiteson, and Matthijs T. J. Spaan. Lossless clustering of histories in decentralized POMDPs. In Proceedings of the 8th International Joint Conference on Autonomous Agents and Multi-Agent Systems, pages , [26] Simon A. Williamson, Enrico H. Gerding, and Nicholas R. Jennings. A principled information valuation for communication during multi-agent coordination. In The 3rd Workshop on Multi-agent Sequential Decision- Making in Uncertain Domains,

127 [27] Rosemary Emery-Montemerlo. Game-Theoretic Control for Robot Teams. Doctoral Dissertation, Robotics Institute, Carnegie Mellon University, August [28] Maayan Roth. Execution-Time Communication Decisions for Coordination of Multi-Agent Teams. PhD thesis, The Robotics Institute, Carnegie Mellon University, [29] Daniel S. Bernstein, Eric A. Hansen, and Shlomo Zilberstein. Bounded policy iteration for decentralized POMDPs. In Proceedings of the 19th International Joint Conference on Artificial Intelligence, pages , [30] Christopher Amato, Alan Carlin, and Shlomo Zilberstein. Bounded dynamic programming for decetralized POMDPs. In AAMAS 2007 Workshop on Multi-Agent Sequential Decision Making in Uncertain Domains, [31] Frans A. Oliehoek, Matthijs T. J. Spaan, and Nikos Vlassis. Optimal and approximate q-value functions for decentralized POMDPs. Journal of Artificial Intelligence Research, 32: , [32] Andrew G. Barto, Steven J. Bradtke, and Satinder P. Singh. Learning to act using real-time dynamic programming. Artificial Intelligence, 72(1-2):81 138, [33] Richard Sutton and Andrew Barto. Reinforcement Learning: An Introduction. MIT Press, Cambridge, MA, [34] Lucian Busoniu, Robert Babuska, and Bart D. Schutter. A comprehensive survey of multiagent reinforcement learning. IEEE Transactions on Systems, Man, and Cybernetics, Part C: Applications and Reviews, 38(2): , [35] Xinhua Zhang, Douglas Aberdeen, and S. V. N. Vishwanathan. Conditional random fields for multi-agent reinforcement learning. In Proceedings of the 24th International Conference on Machine Learning, volume 227, pages , [36] Sven Seuken and Shlomo Zilberstein. Formal models and algorithms for decentralized decision making under uncertainty. Journal of Autonomous Agents and Multi-Agent Systems, 17(2): , [37] Feng Wu, Shlomo Zilberstein, and Xiaoping Chen. Online planning for multi-agent systems with bounded communication. Artificial Intelligence, 175:2: , [38] P. Stone, G. A. Kaminka, S. Kraus, and J. S. Rosenschein. Ad hoc autonomous agent teams: Collaboration without pre-coordination. In Proc. of the 24th AAAI Conf. on Artificial Intelligence, pages , [39] IF Akyildiz, W. Su, Y. Sankarasubramaniam, and E. Cayirci. Wireless sensor networks: a survey. Computer networks, 38(4): , [40] Christopher Amato, Daniel S. Bernstein, and Shlomo Zilberstein. Optimizing fixed-size stochastic controllers for POMDPs and decentralized POMDPs. Autonomous Agents and Multi-Agent Systems, [41] Christopher Amato, Daniel S. Bernstein, and Shlomo Zilberstein. Optimizing memory-bounded controllers for decentralized POMDPs. In Proceedings of the 23rd Conference on Uncertainty in Artificial Intelligence, pages 1 8,

128 [42] Christopher Amato and Shlomo Zilberstein. Achieving goals in decentralized POMDPs. In Proceedings of the 8th International Joint Conference on Autonomous Agents and Multi-Agent Systems, pages , [43] Raphen Becker, Alan Carlin, Victor Lesser, and Shlomo Zilberstein. Analyzing myopic approaches for multiagent communication. Computational Intelligence, 25(1):31 50, [44] Raphen Becker, Shlomo Zilberstein, Victor R. Lesser, and Claudia V. Goldman. Transition-independent decentralized markov decision processes. In Proceedings of the 2nd International Joint Conference on Autonomous Agents and Multiagent Systems, pages 41 48, [45] R. Bellman. Dynamic programming. Princeton University Press, [46] Daniel S. Bernstein, Christopher Amato, Eric A. Hansen, and Shlomo Zilberstein. Policy iteration for decentralized control of markov decision processes. Journal of Artificial Intelligence Research, 34:89 132, [47] Aurelie Beynier and Abdel-Illah Mouaddib. An iterative algorithm for solving constrained decentralized markov decision processes. In Proceedings of the 21st National Conference on Artificial Intelligence, pages , [48] Aurelie Beynier and Abdel-Illah Mouaddib. A polynomial algorithm for decentralized markov decision processes with temporal constraints. In Proceedings of the 4th International Joint Conference on Autonomous Agents and Multiagent Systems, pages , [49] Blai Bonet and Hector Geffner. Solving POMDPs: Rtdp-bel vs. point-based algorithms. In Proceedings of the 21st International Joint Conference on Artificial, pages , [50] Blai Bonet and Hector Geffner. Labeled rtdp: Improving the convergence of real-time dynamic programming. In Proceedings of the 13th International Conference on Automated Planning and Scheduling, pages 12 31, [51] Abdeslam Boularias and Brahim Chaib-draa. Exact dynamic programming for decentralized POMDPs with lossless policy compression. In Proceedings of the 18th International Conference on Automated Planning and Scheduling, [52] Alan Carlin and Shlomo Zilberstein. Myopic and non-myopic communication under partial observability. In Proceedings of the 2009 IEEE/WIC/ACM International Conference on Intelligent Agent Technology, pages , [53] A. R. Cassandra, L. P. Kaelbling, and M. L. Littman. Acting optimally in partially observable stochastic domains. In National Conference on Artificial Intelligence, [54] Rosemary Emery-Montemerlo, Geoffrey J. Gordon, Jeff G. Schneider, and Sebastian Thrun. Game theoretic control for robot teams. In Proceedings of the 2005 IEEE International Conference on Robotics and Automation, pages , [55] Hector Geffner and Blai Bonet. Solving large POMDPs using real time dynamic programming. In AAAI Fall Symposium on POMPDs,

展开

MAXQ BA ( ) / 20

MAXQ BA ( ) / 20 MAXQ BA11011028 2016 6 7 () 2016 6 7 1 / 20 1 2 3 4 () 2016 6 7 2 / 20 RoboCup 2D 11 11 100ms/ 1: RoboCup 2D () 2016 6 7 3 / 20 2: () 2016 6 7 4 / 20 () 2016 6 7 5 / 20 Markov Decision Theory [Puterman,