A fundamental observation in nature is that nearly all forms of locomotion are inherently periodic.
Rhythmic gaits of quadrupeds, the oscillatory motions of fish, and even human walking patterns
share a distinct periodic structure. However, existing unsupervised skill discovery methods have rarely addressed the role of
periodicity. To address this gap, we propose a novel unsupervised skill discovery objective for learning periodic
behaviors, which we call Periodic Skill Discovery (PSD).
Periodic Skill Discovery (PSD) is a framework for unsupervised skill discovery
that captures the periodic structure of behaviors by mapping states into a
circular latent space. By optimizing a constrained objective that encodes
temporal distance, PSD enables agents to learn periodic behaviors with
controllable periods across multiple timescales.
1. Circular Latent Representation
PSD trains an encoder \( \phi \) that maps each state \( s \) to a point on a circle of diameter \( L \), where \( L \) denotes the period variable. The objective encourages states \( s_t \) and \( s_{t+L} \) to lie at opposite points on the circle while maintaining uniform angular spacing between consecutive states:
These constraints ensure that the latent representation forms a regular \( 2L \)-gon on the circle, making each skill’s trajectory periodic with a period of \( 2L \) steps.
2. Single-step Intrinsic Reward
While a circular representation is being learned, the RL agent is jointly trained with a single-step intrinsic reward
that encourages periodic behaviors. Since the circular latent space is designed to capture periodicity,
rewarding the policy for moving along this circular space naturally promotes the learning of periodic
behaviors.
Formally, the deviation from the
ideal single-step length \( L \sin(\pi/2L) \) defines the reward:
Maximizing \( r_{\text{PSD}} \) makes the policy follow a circular trajectory in latent space, yielding behaviors that naturally repeat every \( 2L \) steps.
3. Adaptive Sampling Method
To enable the agent to discover a maximally diverse range of periods without any prior knowledge of its inherent period ranges, we introduce an adaptive sampling method that dynamically adjusts the sampling range during training.