MAXQ BA11011028 2016 6 7 () 2016 6 7 1 / 20
1 2 3 4 () 2016 6 7 2 / 20
RoboCup 2D 11 11 100ms/ 1: RoboCup 2D () 2016 6 7 3 / 20
2: () 2016 6 7 4 / 20
() 2016 6 7 5 / 20
Markov Decision Theory [Puterman, 1994] 3 1 MDP 2 MDP POMDP 3 POMDP DEC-POMDP () 2016 6 7 6 / 20
MDP 1 S = { s 1, s 2,, s S } 2 A = { a 1, a 2,, a A } 3 T(s s, a) [0, 1] 4 R(s, a) R 5 π : S A POMDP 1 O = { o 1, o 2,, o O } 2 Z(o a, s ) [0, 1] 3 π : (S) A 3: MDP () 2016 6 7 7 / 20
DEC-POMDP POMDP 1 I 2 A = i I A i 3 O = o I O i 4 5 π i : H i A i () 2016 6 7 8 / 20
P NP PSAPCE EXP NEXP MDP P POMDP PSPACE DEC-POMDP NEXP () 2016 6 7 9 / 20
[Puterman, 1994] [Ross et al., 2008] [Barto and Mahadevan, 2003] () 2016 6 7 10 / 20
MAXQ MAXQ MAXQ [Dietterich, 1999] MDP M = {M 0, M 1,, M n } M i = {T i, A i, R i } T i A i 4: MAXQ R i M 0 M () 2016 6 7 11 / 20
POMDP[Cassandra et al., 1995] MAXQ[Dietterich, 1999] POMDP [Silver and Veness, 2010] [Browne et al., 2012] [Wu et al., 2011] () 2016 6 7 12 / 20
[Barry et al., 2011] MDP [Gmytrasiewicz and Doshi, 2005] POMDP [Ramírez and Geffner, 2011] Goal POMDP MAXQ () 2016 6 7 13 / 20
RoboCup 2D 2D RoboCup 2D 3-4 () 2016 6 7 14 / 20
MAXQ-OP MAXQ MDP [Bai et al., 2012b] MAXQ MAXQ [Bai et al., 2012c] RoboCup 2D [Bai et al., 2012a, Bai et al., 2013, Bai et al., 2012c] () 2016 6 7 15 / 20
Bai, A., Chen, X., MacAlpine, P., Urieli, D., Barrett, S. and Stone, P. (2012a). Wright Eagle and UT Austin Villa: RoboCup 2011 Simulation League Champions. In RoboCup-2011: Robot Soccer World Cup XV, (Roefer, T., Mayer, N. M., Savage, J. and Saranli, U., eds), vol. 7416, of Lecture Notes in Artificial Intelligence. Springer Verlag Berlin Bai, A., Wu, F. and Chen, X. (2012b). Online Planning for Large MDPs with MAXQ Decomposition (Extended Abstract). In Proc. of 11th Int. Conf. on Autonomous Agents and Multiagent Systems (AAMAS 2012) Bai, A., Wu, F. and Chen, X. (2012c). Online Planning for Large MDPs with MAXQ Decomposition. In Proc. of the Autonomous Robots and Multirobot Systems workshop (at AAMAS-12) Bai, A., Wu, F. and Chen, X. (2013). Towards a Principled Solution to Simulated Robot Soccer. In RoboCup-2012: Robot Soccer World Cup XVI, (Chen, X., Stone, P., Sucar, L. E. and der Zant, T. V., eds), vol. 7500, of Lecture Notes in Artificial Intelligence. Springer Verlag Berlin () 2016 6 7 16 / 20
1 2013.1-2013.3 2 2013.4-2013.7 3 2013.8-2013.11 4 2013.12-2014.2 5 2014.3-2014.6 () 2016 6 7 17 / 20
I Bai, A., Chen, X., MacAlpine, P., Urieli, D., Barrett, S. and Stone, P. (2012a). Wright Eagle and UT Austin Villa: RoboCup 2011 Simulation League Champions. In RoboCup-2011: Robot Soccer World Cup XV, (Roefer, T., Mayer, N. M., Savage, J. and Saranli, U., eds), vol. 7416, of Lecture Notes in Artificial Intelligence. Springer Verlag Berlin. Bai, A., Wu, F. and Chen, X. (2012b). Online Planning for Large MDPs with MAXQ Decomposition (Extended Abstract). In Proc. of 11th Int. Conf. on Autonomous Agents and Multiagent Systems (AAMAS 2012). Bai, A., Wu, F. and Chen, X. (2012c). Online Planning for Large MDPs with MAXQ Decomposition. In Proc. of the Autonomous Robots and Multirobot Systems workshop (at AAMAS-12). Bai, A., Wu, F. and Chen, X. (2013). Towards a Principled Solution to Simulated Robot Soccer. In RoboCup-2012: Robot Soccer World Cup XVI, (Chen, X., Stone, P., Sucar, L. E. and der Zant, T. V., eds), vol. 7500, of Lecture Notes in Artificial Intelligence. Springer Verlag Berlin. Barry, J., Kaelbling, L. and Lozano-Perez, T. (2011). DetH*: Approximate Hierarchical Solution of Large Markov Decision Processes. In International Joint Conference on Artificial Intelligence pp. 1928 1935,. Barto, A. and Mahadevan, S. (2003). Recent advances in hierarchical reinforcement learning. Discrete Event Dynamic Systems 13, 341 379. () 2016 6 7 18 / 20
II Browne, C., Powley, E. J., Whitehouse, D., Lucas, S. M., Cowling, P. I., Rohlfshagen, P., Tavener, S., Perez, D., Samothrakis, S. and Colton, S. (2012). A Survey of Monte Carlo Tree Search Methods. IEEE Trans. Comput. Intellig. and AI in Games 4, 1 43. Cassandra, A., Kaelbling, L. and Littman, M. (1995). Acting optimally in partially observable stochastic domains. In Proceedings of the National Conference on Artificial Intelligence pp. 1023 1023, JOHN WILEY & SONS LTD. Dietterich, T. G. (1999). Hierarchical Reinforcement Learning with the MAXQ Value Function Decomposition. Journal of Machine Learning Research 13, 63. Gmytrasiewicz, P. and Doshi, P. (2005). A framework for sequential planning in multiagent settings. Journal of Artificial Intelligence Research 24, 49 79. Puterman, M. L. (1994). Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons, Inc. Ramírez, M. and Geffner, H. (2011). Goal recognition over POMDPs: Inferring the intention of a POMDP agent. In Proceedings of the Twenty-Second international joint conference on Artificial Intelligence-Volume Volume Three pp. 2009 2014, AAAI Press. () 2016 6 7 19 / 20
III Ross, S., Pineau, J., Paquet, S. and Chaib-Draa, B. (2008). Online planning algorithms for POMDPs. Journal of Artificial Intelligence Research 32, 663 704. Silver, D. and Veness, J. (2010). Monte-Carlo planning in large POMDPs. Advances in Neural Information Processing Systems (NIPS) 46. Wu, F., Zilberstein, S. and Chen, X. (2011). Online planning for multi-agent systems with bounded communication. Artificial Intelligence 175, 487 511. () 2016 6 7 20 / 20