Periodic Skill Discovery

Abstract

Unsupervised skill discovery in reinforcement learning (RL) aims to learn diverse behaviors without relying on external rewards. However, current methods often overlook the periodic nature of learned skills, focusing instead on increasing the mutual dependency between states and skills or maximizing the distance traveled in latent space. Considering that many robotic tasks—particularly those involving locomotion—require periodic behaviors across varying timescales, the ability to discover diverse periodic skills is essential. Motivated by this, we propose Periodic Skill Discovery (PSD), a framework that discovers periodic behaviors in an unsupervised manner. The key idea of PSD is to train an encoder that maps states to a circular latent space, thereby naturally encoding periodicity in the latent representation. By capturing temporal distance, PSD can effectively learn skills with diverse periods in complex robotic tasks, even with pixel-based observations. We further show that these learned skills achieve high performance on downstream tasks such as hurdling. Moreover, integrating PSD with an existing skill discovery method offers more diverse behaviors, thus broadening the agent’s repertoire.

Periodic Skill Discovery

A fundamental observation in nature is that nearly all forms of locomotion are inherently periodic. Rhythmic gaits of quadrupeds, the oscillatory motions of fish, and even human walking patterns share a distinct periodic structure. However, existing unsupervised skill discovery methods have rarely addressed the role of periodicity. To address this gap, we propose a novel unsupervised skill discovery objective for learning periodic behaviors, which we call Periodic Skill Discovery (PSD).

Periodic Skill Discovery (PSD) is a framework for unsupervised skill discovery that captures the periodic structure of behaviors by mapping states into a circular latent space. By optimizing a constrained objective that encodes temporal distance, PSD enables agents to learn periodic behaviors with controllable periods across multiple timescales.

1. Circular Latent Representation

PSD trains an encoder \( \phi \) that maps each state \( s \) to a point on a circle of diameter \( L \), where \( L \) denotes the period variable. The objective encourages states \( s_t \) and \( s_{t+L} \) to lie at opposite points on the circle while maintaining uniform angular spacing between consecutive states:

\begin{aligned} \mathcal{J}_{\text{PSD}} = \mathbb{E}\bigl[ &\|\phi_L(s_{t+L}) - \phi_L(s_t)\| - k\,\|\phi_L(s_{t+L}) + \phi_L(s_t)\| \bigr] \\[4pt] \text{s.t.}\quad & \|\phi_L(s_{t+L}) - \phi_L(s_t)\| \le L, \\[3pt] & \|\phi_L(s_{t+1}) - \phi_L(s_t)\| \le L\,\sin\!\left(\tfrac{\pi}{2L}\right). \end{aligned}

These constraints ensure that the latent representation forms a regular \( 2L \)-gon on the circle, making each skill’s trajectory periodic with a period of \( 2L \) steps.

2. Single-step Intrinsic Reward

While a circular representation is being learned, the RL agent is jointly trained with a single-step intrinsic reward that encourages periodic behaviors. Since the circular latent space is designed to capture periodicity, rewarding the policy for moving along this circular space naturally promotes the learning of periodic behaviors.

Formally, the deviation from the ideal single-step length \( L \sin(\pi/2L) \) defines the reward:

r_{\text{PSD}}(s_t, s_{t+1}, L) = \exp\!\Big(-\kappa \big( \|\phi_L(s_{t+1}) - \phi_L(s_t)\| - L \sin\!\tfrac{\pi}{2L} \big)^2\Big),

Maximizing \( r_{\text{PSD}} \) makes the policy follow a circular trajectory in latent space, yielding behaviors that naturally repeat every \( 2L \) steps.

3. Adaptive Sampling Method

To enable the agent to discover a maximally diverse range of periods without any prior knowledge of its inherent period ranges, we introduce an adaptive sampling method that dynamically adjusts the sampling range during training.

As shown in the figure above, the feasible period range of each agent — HalfCheetah (left) and Humanoid (right) — gradually expands as training progresses. The key idea is to evaluate the performance of the policy conditioned on the boundary of the current sampling range. When the policy successfully maintains periodicity for the current bounds, the range is expanded. Conversely, if the policy fails to maintain periodicity, the current bound is rejected and the previous value is restored. This mechanism enables each agent to discover its own dynamically feasible period bounds, thereby broadening the range of achievable periods.

Results

1. State-based Environment : Latent space of PSD (right)

2. Pixel-based Environment : Top-view (left), Real-time observation (center), Latent space of PSD (right)

3. PSD with METRA ( Park et al., 2023 ) : Latent space of METRA (left), Latent space of PSD (right)

Closing Remarks

While our experiments primarily focus on locomotion tasks, due to their suitability for showcasing multi-timescale behaviors, the PSD framework is applicable to any domain that exhibits periodic structure. An interesting future direction is to extend PSD to non-periodic tasks, such as robotic manipulation, by generalizing the latent geometry beyond circular structures. Moreover, directly integrating frequency-domain analysis, such as Fourier representations, into the training process could further improve PSD in capturing temporal patterns.

BibTeX

@article{park2025periodic,
      title={Periodic Skill Discovery},
      author={Park, Jonghae and Cho, Daesol and Lee, Jusuk and Shim, Dongseok and Jang, Inkyu and Kim, H Jin},
      journal={arXiv preprint arXiv:2511.03187},
      year={2025}
    }