All datasets are well prepared. We will employ CodaLab to host these two tracks.


Human-centric video matting

This challenge aims to perform efficient and accurate portrait matting in videos, which can be applied to real video conferencing scenarios such as setting virtual background. To this end, we collect 320 videos from real-world video conferencing scenarios using VooV Meeting and annotate the alpha matte at 5~6 fps. This portrait matting dataset covers a wide range of scenarios from both indoor scenes (e.g., office, bedroom) and outdoor scenes (e.g., park, street). Similar to image matting, several popular metrics such as MSE and SAD are used to evaluate the performance, and MSE is used as the key metric for ranking. The entire dataset is divided into training (220), validation (50) and testing (50) sets. The dataset can be downloaded from both GoogleDrive and Baidu Cloud(extract code: f1o7). More details can be found at

Split Clips Annotated Frames
Train 220 19,449
Validation 50 4,377
Test 50 4,183


Human-centric video coding for action analysis

This challenge aims to develop new image/video pre-editing methods for human-centric frame reconstruction jointly with the related analytics, which provides the technical basis for highly efficient video compression and transmission for conferencing scenarios. To this end, PKU-MMD is adopted as an evaluation dataset. The dataset features human-centric videos, captured in indoor scenes. In this challenge, we select the video sequences of 10 action categories for training and testing. In the evaluation, we will calculate two kinds of metrics on the decoded results of the pre-editing videos: human region frame reconstruction error and human analytics accuracy. For the former, for a frame, we will calculate PSNR and SSIM results of the regions in the bounding boxes that tightly enclose persons. For the latter, we will evaluate human analytics performances using state-of-the-art methods on the reconstructed videos.

We have released the training and validation splits of the dataset, both of which can be downloaded from Google Drive and Microsoft OneDrive. All videos and images are 512x512 in resolutions. Properties of all splits are summarized below:

Split Clips Frames Codec
Train 1,128 109,624 HEVC QP=1
Validation 125 13,411 HEVC QP=1
Test 176 16,861 JPEG QF=2

The training and validation set are encoded with x265. We set QP=1 when we encode the frames to video. Such settings guarantee all videos are visually losslessly encoded. The testing split is provided in frames. We set QF=2 to guarantee visual losslessness. Participants are required to submit exactly the same number of frames, either in JPEG or PNG, as the testing set provides. The competitors are allowed to pre-edit video frames. Then, we will compress these frames with the same bit-rate constraint with existing codecs, e.g. HEVC, and calculate the evaluation metrics on the decoded frames. More details can be found at