文章编号5896(2023)03-0437-07May2023JourmnalofJilinUniversitlnformationScienceEdition)2023年5月No.3Vol.41吉林大学生信息科学版)第3期第41卷基于相关熵诱导度量的近端策略优化算法张会珍,王强(东北石油大学电气信息工程学院,黑龙江大庆163318)摘要:在深度强化学习算法中,近端策略优化算法PPO(ProximalPolicyOptimization)在许多实验任务中表现优异,但具有自适应KL(Kullback-Leibler)散度的KL-PPO由于其不对称性而影响了KL-PPO策略更新效率,为此,提出了一种基于相关熵诱导度量的近端策略优化算法CIM-PPO(CorrentropyInducedMetric-PPO)。该算法具有对称性更适合表征新旧策略的差异,能准确地进行策略更新,进而改善不对称性带来的影响。通过OpenAIgym实验测试表明,相比于主流近端策略优化算法Clip-PPO和KL-PPO算法均能获得高于50%以上的奖励,收敛速度在不同环境均有500~1100回合左右的加快,同时也具有良好的鲁棒性。关键词:KL散度;近端策略优化(PPO);相关熵诱导度量(CIM);替代目标;深度强化学习中图分类号:TP273文献标志码:AProximalPolicyOptimizationAlgorithmBasedonCorrentropyInducedMetricZHANGHuizhen,WANGQiang(SchoolofElectricalandInformatioinEngineering,NortheastPertroleumUniversity,Daqing163318,China)Abstract:InthedeepReinforcementLearning,thePPO(ProximalPolicyOptimization)performsverywellinmanyexperimentaltasks.However,KL(Kullback-Leibler)-PPOwithadaptiveKLdivergenceaffectstheupdateefficiencyofKL-PPOstrategybecauseofitsasymmetry.Inordertosolvethenegativeimpactofthisasymmetry,ProximalPolicyOptimizationalgorithmbasedonCIM(CorrentropyInducedMetric)isproposedcharacterizethedifferencebetweentheoldandnewstrategies,updatethepoliciesmoreaccurately,andthentheexperimentaltestofOpenAIgymshowsthatcomparedwiththemainstreamnearendstrategyoptimizationalgorithmsclipPPOandKLPPO,theproposedalgorithmcanobtainmorethan50%reward,andtheconvergencespeedisacceleratedbyabout500~1100episodesindifferentenvironments.Anditalsohasgoodrobustness.Keywords:kullback-leibler(KL)divergence;proximalpolicyoptimization(PPO);correntropyinducedmetric(CIM);alternativetarget;deepreinforcementlearning0引言近端策略优化是一种无模型的深度强化学习[算...