蜜桃麻豆影像在线观看_秋霞av国产精品一区_久久激情五月婷婷_久久激情综合

<返回

Reinforcement Learning from Diverse Human Preferences

Wanqi Xue, Bo An, Shuicheng Yan, Zhongwen Xu

IJCAI 2024 Conference

August 2024

Keywords: Reinforcement Learning, Human Preferences, Human Feedback, Rewards

Abstract:

The complexity of designing reward functions has been a major obstacle to the wide application of deep reinforcement learning (RL) techniques. Describing an agent s desired behaviors and properties can be difficult, even for experts. A new paradigm called reinforcement learning from human preferences (or preference-based RL) has emerged as a promising solution, in which reward functions are learned from human preference labels among behavior trajectories. However, existing methods for preference-based RL are limited by the need for accurate oracle preference labels. This paper addresses this limitation by developing a method for crowd-sourcing preference labels and learning from diverse human preferences. The key idea is to stabilize reward learning through regularization and correction in a latent space. To ensure temporal consistency, a strong constraint is imposed on the reward model that forces its latent space to be close to the prior distribution. Additionally, a confidence-based reward model ensembling method is designed to generate more stable and reliable predictions. The proposed method is tested on a variety of tasks in DMcontrol and Meta-world and has shown consistent and significant improvements over existing preference-based RL algorithms when learning from diverse feedback, paving the way for real-world applications of RL methods.

View More PDF>>

主站蜘蛛池模板: 闻喜县| 崇州市| 广安市| 巩留县| 锡林浩特市| 买车| 开阳县| 古蔺县| 吴旗县| 合水县| 颍上县| 蚌埠市| 沽源县| 新密市| 景德镇市| 泉州市| 固安县| 蒙阴县| 斗六市| 成都市| 天气| 大洼县| 周至县| 贵溪市| 色达县| 若尔盖县| 苏尼特左旗| 江达县| 华坪县| 周口市| 濉溪县| 文安县| 镇平县| 屯昌县| 集安市| 和平区| 乐清市| 聂拉木县| 介休市| 拜泉县| 柳江县|