聯(lián)合隨機(jī)性策略的深度強(qiáng)化學(xué)習(xí)探索方法-AET-電子技術(shù)應(yīng)用

聯(lián)合隨機(jī)性策略的深度強(qiáng)化學(xué)習(xí)探索方法

信息技術(shù)與網(wǎng)絡(luò)安全

楊尚彤，王子磊

(中國科學(xué)技術(shù)大學(xué) 網(wǎng)絡(luò)空間安全學(xué)院，安徽合肥230027)

摘要： 目前深度強(qiáng)化學(xué)習(xí)算法已經(jīng)可以解決許多復(fù)雜的任務(wù)，然而如何平衡探索和利用的關(guān)系仍然是強(qiáng)化學(xué)習(xí)領(lǐng)域的一個基本的難題，為此提出一種聯(lián)合隨機(jī)性策略的深度強(qiáng)化學(xué)習(xí)探索方法。該方法利用隨機(jī)性策略具有探索能力的特點(diǎn)，用隨機(jī)性策略生成的經(jīng)驗(yàn)樣本訓(xùn)練確定性策略，鼓勵確定性策略在保持自身優(yōu)勢的前提下學(xué)會探索。通過結(jié)合確定性策略算法DDPG和提出的探索方法，得到基于隨機(jī)性策略指導(dǎo)的確定性策略梯度算法(SGDPG)。在多個復(fù)雜環(huán)境下的實(shí)驗(yàn)表明，面對探索問題，SGDPG的探索效率和樣本利用率要優(yōu)于DDPG算法。

關(guān)鍵詞： 強(qiáng)化學(xué)習(xí) 深度強(qiáng)化學(xué)習(xí) 探索利用困境

中圖分類號： TP18
文獻(xiàn)標(biāo)識碼： A
DOI： 10.19358/j.issn.2096-5133.2021.06.008
引用格式：楊尚彤，王子磊. 聯(lián)合隨機(jī)性策略的深度強(qiáng)化學(xué)習(xí)探索方法[J].信息技術(shù)與網(wǎng)絡(luò)安全，2021，40(6)：43-49.

Efficient exploration with stochastic policy for deep reinforcement learning

Yang Shangtong，Wang Zilei

(School of Cyberspace Security，University of Science and Technology of China，Hefei 230027，China)

Abstract： At present, deep reinforcement learning algorithms have been shown to solve many complex tasks, but how to balance the relationship between exploration and exploitation is still a basic problem. Thus, this paper proposes an efficient exploration strategy combined with stochastic policy for deep reinforcement learning. The main contribution is to use the experience generated by stochastic policies to train deterministic policies, which encourages deterministic strategies to learn to explore while maintaining their own advantages. This takes advantage of the exploration ability of stochastic policies. By combining DDPG(Deep Deterministic Policy Gradient) and the proposed exploration method, the algorithm called stochastic guidance for deterministic policy gradient(SGDPG) is obtained. Finally, the results of the experiment in several complex environments show that SGDPG has higher exploration and sample efficiency than DDPG when faced with deep exploration problems.

Key words : reinforcement learning；deep reinforcement learning；exploration-exploitation dilemma

0 引言

目前，強(qiáng)化學(xué)習(xí)(reinforcement learning)作為機(jī)器學(xué)習(xí)領(lǐng)域的一個研究熱點(diǎn)，已經(jīng)在序列決策問題中取得了巨大的進(jìn)步，廣泛應(yīng)用于游戲博弈[1]、機(jī)器人控制[2]、工業(yè)應(yīng)用[3]等領(lǐng)域。近年來，許多強(qiáng)化學(xué)習(xí)方法利用神經(jīng)網(wǎng)絡(luò)來提高其性能，于是有了一個新的研究領(lǐng)域，被稱為深度強(qiáng)化學(xué)習(xí)(Deep Reinfor-

cement Learning，DRL)[4]。但是強(qiáng)化學(xué)習(xí)仍然面臨一個主要的問題：探索利用困境(exploration-exploitation dilemma)。在智能體學(xué)習(xí)過程中，探索(exploration)意味著智能體嘗試之前沒有做過的動作，有可能獲得更高的利益，而利用(exploitation)是指智能體根據(jù)之前的經(jīng)驗(yàn)選擇當(dāng)前最優(yōu)的動作。目前，深度強(qiáng)化學(xué)習(xí)方法的研究主要集中在結(jié)合深度學(xué)習(xí)提高強(qiáng)化學(xué)習(xí)算法的泛化能力，如何有效地探索狀態(tài)空間仍然是一個關(guān)鍵的挑戰(zhàn)。

本文詳細(xì)內(nèi)容請下載：http://forexkbc.com/resource/share/2000003599

作者信息：

楊尚彤，王子磊

(中國科學(xué)技術(shù)大學(xué) 網(wǎng)絡(luò)空間安全學(xué)院，安徽合肥230027)

原創(chuàng)聲明：此內(nèi)容為AET網(wǎng)站原創(chuàng)，未經(jīng)授權(quán)禁止轉(zhuǎn)載。

相關(guān)內(nèi)容