This page is an attempt to organize papers into distinct research areas, and to provide an overview of my works. Naturally, the page is selective both in terms of research areas and the papers listed. For a more complete picture, please see the publications page.

Model Acceleration

RLx2: Training a Sparse Deep Reinforcement Learning Model from Scratch [OpenReview]
Yiqin Tan*, Pihe Hu*, Ling Pan, Longbo Huang
ArXiv preprint, 2022.
Abstract
Training deep reinforcement learning (DRL) models usually requires high computation costs. Therefore, compressing DRL models possesses immense potential for training acceleration and model deployment. However, existing methods that generate small models mainly adopt the knowledge distillation based approach by iteratively training a dense network, such that the training process still demands massive computing resources. Indeed, sparse training from scratch in DRL has not been well explored and is particularly challenging due to non-stationarity in bootstrap training. In this work, we propose a novel sparse DRL training framework, “the Rigged Reinforcement Learning Lottery” (RLx2), which is capable of training a DRL agent emph{using an ultra-sparse network throughout} for off-policy reinforcement learning. The systematic RLx2 framework contains three key components: gradient-based topology evolution, multi-step Temporal Difference (TD) targets, and dynamic-capacity replay buffer. RLx2 enables efficient topology exploration and robust Q-value estimation simultaneously. We demonstrate state-of-the-art sparse training performance in several continuous control tasks using RLx2, showing \(7.5\times\)-\(20\times\) model compression with less than \(3\%\) performance degradation, and up to \(20\times\) and \(50\times\) FLOPs reduction for training and inference, respectively.

REINFORCEMENT LEARNING THEORY

1. Towards Minimax Optimal Reward-free Reinforcement Learning in Linear MDPs [OpenReview]
Pihe Hu*, Yu Chen*, Longbo Huang
ICLR, 2022.
Abstract
We study reward-free reinforcement learning with linear function approximation for episodic Markov decision processes (MDPs). In this setting, an agent first interacts with the environment without accessing the reward function in the exploration phase. In the subsequent planning phase, it is given a reward function and asked to output an \(\epsilon\)-optimal policy. We propose a novel algorithm LSVI-RFE under the linear MDP setting, where the transition probability and reward functions are linear in a feature mapping. We prove an \(\widetilde{O}(H^{4} d^{2}/\epsilon^2)\) sample complexity upper bound for LSVI-RFE, where \(H\) is the episode length and \(d\) is the feature dimension. We also establish a sample complexity lower bound of \(\Omega(H^{3} d^{2}/\epsilon^2)\). To the best of our knowledge, LSVI-RFE is the first computationally efficient algorithm that achieves the minimax optimal sample complexity in linear MDP settings up to an \(H\) and logarithmic factors. Our LSVI-RFE algorithm is based on a novel variance-aware exploration mechanism to avoid overly-conservative exploration in prior works. Our sharp bound relies on the decoupling of UCB bonuses during two phases, and a Bernstein-type self-normalized bound, which remove the extra dependency of sample complexity on \(H\) and \(d\), respectively.

2. Nearly Minimax Optimal Reinforcement Learning with Linear Function Approximation [pdf]
ICML, 2022.
Abstract
We study reinforcement learning with linear function approximation where the transition probability and reward functions are linear with respect to a feature mapping \(\boldsymbol{\phi}(s,a)\). Specifically, we consider the episodic inhomogeneous linear Markov Decision Process (MDP), and propose a novel computation-efficient algorithm, LSVI-UCB\(^+\), which achieves an \(\widetilde{O}(Hd\sqrt{T})\) regret bound where \(H\) is the episode length, \(d\) is the feature dimension, and \(T\) is the number of steps. LSVI-UCB\(^+\) builds on weighted ridge regression and upper confidence value iteration with a Bernstein-type exploration bonus. Our statistical results are obtained with novel analytical tools, including a new Bernstein self-normalized bound with conservatism on elliptical potentials, and refined analysis of the correction term. To the best of our knowledge, this is the first minimax optimal algorithm for linear MDPs up to logarithmic factors, which closes the \(\sqrt{Hd}\) gap between the best known upper bound of \(\widetilde{O}(\sqrt{H^3d^3T})\) and lower bound of \(\Omega(Hd\sqrt{T})\) for linear MDPs.
Erratum: an issue in building the over-optimistic value function is addressed by the ‘‘rare-switching’’ mechanism in [He et al. 2022.], and the fixed version is given in [arxiv]

DEEP REINFORCEMENT LEARNING

Effective Multi-User Delay-Constrained Scheduling with Deep Recurrent Reinforcement Learning [pdf]
Pihe Hu, Ling Pan, Yu Chen, Zhixuan Fang, Longbo Huang
MobiHoc, 2022.
Abstract
Multi-user delay constrained scheduling is important in many real-world applications including wireless communication, live streaming, and cloud computing. Yet, it poses a critical challenge since the scheduler needs to make real-time decisions to guarantee the delay and resource constraints simultaneously without prior information of system dynamics, which can be time-varying and hard to estimate. Moreover, many practical scenarios suffer from partial observability issues, e.g., due to sensing noise or hidden correlation. To tackle these challenges, we propose a deep reinforcement learning (DRL) algorithm, named Recurrent Softmax Delayed Deep Double Deterministic Policy Gradient (RSD4), which is a data-driven method based on a Partially Observed Markov Decision Process (POMDP) formulation. RSD4 guarantees resource and delay constraints by Lagrangian dual and delay-sensitive queues, respectively. It also efficiently tackles partial observability with a memory mechanism enabled by the recurrent neural network (RNN) and introduces user-level decomposition and node-level merging to ensure scalability. Extensive experiments on simulated/real-world datasets demonstrate that RSD4 is robust to system dynamics and partially observable environments, and achieves superior performances over existing DRL and non-DRL-based methods.

NETWORK OPTIMIZATION

1. Optimal Multi-User Scheduling for the Unbalanced Full-Duplex Buffer-Aided Relay Systems [pdf]
Cheng Li, Pihe Hu, Yao Yao, Bin Xia, Zhiyong Chen
IEEE TWC, 2019.
Abstract Multi-User scheduling is challenging due to the channel unbalance problem, which leads to system performance degradation. In this paper, two optimal multi-user scheduling schemes maximizing the system throughput are proposed for the fixed and adaptive power transmission scenarios of the full-duplex (FD) multi-user buffer-aided relay system, respectively. Independent and non-identically distributed (i.ni.d.) model is used to characterize the unbalanced channels of different links. In particular, the optimal weight factor of each pair is designed based on the statistical channel state information in both scenarios. With the weight factors, the proposed schemes are able to balance the throughput gaps between different links. In addition, we propose an optimal power allocation scheme with closed-form expressions under average power constraint for the adaptive power transmission scenario. By combining the optimal weight factor and the power allocation scheme, novel optimal selection function is obtained to facilitate the selection process. Considering the specific i.ni.d. Rayleigh fading, the system throughput is further derived in both cases. Theoretical analysis is verified by the numerical simulations and the results demonstrate the superiority of the proposed schemes.

2. Optimal multi-user scheduling of buffer-aided relay systems [pdf]
Pihe Hu, Cheng Li, Dingjie Xu, Bin Xia
IEEE ICC, 2019.
Abstract Multi-User scheduling is a challenging problem under the relaying scenarios. Traditional schemes, which are based on the instantaneous signal-to- interference-plus-noises ratios (SINRs), cannot solve the inherent disparities of the qualities between different links. Hence, the system performance is always limited by the weaker links. In this paper, from the whole system throughput view, we propose an optimal multi-user scheduling scheme for the multi-user full-duplex (FD) buffer aided relay systems. We first formulate the throughput maximization problem. Then, according to the characteristics of the Karush-Kuhn-Tucker conditions, we obtain the optimal decision functions and the optimal weighted factors of different links of the proposed scheme. Simulation results show that the proposed scheme not only solves the disparities of the qualities between \(S_i-R\) and \(R-D_i\) links, but also that between different \(S_i-R\) or \(D_i-R\) links, which can be used as guidance in the design of the practical systems.