Feedback-Driven Learn to Reason in Adversarial Environments for Autonomic Cyber Systems
The growing complexity of cyber systems has made them difficult for human operators to defend, particularly in the presence of intelligent and resourceful adversaries who target multiple system components simultaneously, employ previously unobserved attack vectors, and use stealth and deception to evade detection. There is a need for developing autonomic cyber systems that can integrate statistical learning and rules-based formal reasoning to provide an adaptive and robust situational awareness and resilient system response. In this collaborative research effort, we propose to develop a feedback-driven Learn to Reason (L2R) framework, which aims to integrate statistical learning with formal reasoning, in adversarial environments. Our insight is that in order to realize the potential benefits of L2R, continuous interaction between the statistical and formal components is needed, both at intermediate time steps and at multiple layers of abstraction.
Date: Wed March 28, 2021
Multi-agent reinforcement learning involves multiple agents interacting with each other and a shared environment to complete tasks. When rewards provided by the environment are sparse, agents may not receive immediate feedback on the quality of actions that they take, thereby affecting learning of policies. In this paper, we propose a method called Shaping Advice in deep Multi-agent reinforcement learning (SAM) to augment the reward signal from the environment with an additional reward termed shaping advice. The shaping advice is given by a difference of potential functions at consecutive time-steps. Each potential function is a function of observations and actions of the agents. The shaping advice needs to be specified only once at the start of training, and can be easily provided by non-experts. We show through theoretical analyses and experimental validation that shaping advice provided by SAM does not distract agents from completing tasks specified by the environment reward. Theoretically, we prove that convergence of policy gradients and value functions when using SAM implies convergence of these quantities in the absence of SAM. Experimentally, we evaluate SAM on three tasks in the multi-agent Particle World environment that have sparse rewards. We observe that using SAM results in agents learning policies to complete tasks faster, and obtain higher rewards than: i) using sparse rewards alone; ii) a state-of-the-art reward redistribution method, Iterative Relative Credit Refinement (IRCR).Read more...
Github Link: https://github.com/baicenxiao/SAM
This code presents a Python implementation of the SAM algorithm from the paper:
Shaping Advice in Deep Multi-Agent Reinforcement Learning
It is configured to be run in conjunction with multi-agent reinforcement learning environments from the Multi-Agent Particle Environments (MPE). Different from the original MPE environment where rewards were dense, our work uses a sparse reward structure. Note: This code base has been restructured compared to the original paper, and some results may be different. • Python version - 3.5.4 • Python libraries required - OpenAI gym (0.10.5), TensorFlow (1.9.0), numpy (1.15.2)
For additional information, contact: Baicen Xiao, email: firstname.lastname@example.org Acknowledgement: This work was supported by the U.S. Office of Naval Research via Grant N00014-17-S-B001. The code of MADDPG is based on the publicly available implementation: https://github.com/openai/maddpg.