Invited Talks (YouTube Recordings)

Speaker Topic
Zhengyou Zhang, Tencent Transcending Space Through Immersive Telecommunications [Youtube]
Ira Kemelmacher-Shlizerman, U Washington Future of Communication [Youtube] [Bilibili]
Ming-Yu Liu, NVIDIA Face-VID2VID: Neural Talking Head Synthesis For Video Conf [Youtube]
Chuo-Ling Chang & Tingbo Hou, Google Cross-Platform ML for Video Conf with MediaPipe [Youtube] [Bilibili]
Catherine Qi Zhao, U Minnesota Attention in AI Tasks [Youtube] [Bilibili]
Sergi Caelles, Google Research Video Object Segmentation for Video Conferencing [Youtube] [Bilibili]
Lexing Xie, ANU Image Captioning with Knowledge and Style [Youtube] [Bilibili]

Session 1 Challenge Results (Live Zoom Link)

US Western US Eastern UK Beijing Speaker Topic
6:20 - 6:30 9:20 - 9:30 14:20 - 14:30 21:20 - 21:30 Chairs Opening remarks
6:30 - 6:45 9:30 - 9:45 14:30 - 14:45 21:30 - 21:45 Alibaba DAMO Academy Track1: Challenge Winner
6:45 - 7:00 9:45 - 10:00 14:45 - 15:00 21:45 - 22:00 Bytedance Track2: Challenge Winner

Session 2 Invited Speakers Q&A and Panel (Live Zoom Link)

US Western US Eastern UK Beijing Participants Topic
7:00 - 9:00 10:00 - 12:00 15:00 - 17:00 22:00 - 24:00 Invited Speakers & Chairs Q&A and Panel Discussion


CV/AI techniques are quickly taking the central role in driving this growth by creating video conferencing applications that deliver more natural, contextual, and relevant meeting experiences. For example, high-quality video matting and synthesis is crucial to the now-essential functionality of virtual background; gaze correction and gesture tracking can add to interactive user engagement; automatic color and light correction can improve the user’s visual appearance and self-image; and all those have to be backed up by high-efficacy video compression/transmission and efficient edge processing which can also benefit from AI advances nowadays. Those challenges have drawn increasing R&D attraction, e.g. NVIDIA recently released their fully accelerated platform for building video conferencing services with many advanced AI features: .
While we seem to already start embracing a mainstream adoption of AI-based video collaboration, we recognize that building the next-generation video conferencing system involves multi-fold interdisciplinary challenges, and face many technical gaps to close. Centered at this theme, this proposed workshop aims to provide the first comprehensive forum for CVPR researchers, to systematically discuss relevant techniques that we can contribute to as a community. Examples include but are not limited to:

  • Image display and quality enhancement for teleconferencing
  • Video compression and transmission for teleconferencing
  • Video object segmentation, matting and synthesis (for virtual background, etc.)
  • HCI (gesture recognition, head tracking, gaze tracking, etc.), AR and VR applications in video conferencing
  • Efficient video processing on the edge and IoT camera devices
  • Multi-modal information processing and fusion in video conferencing (audio transcription, image to text, video captioning, etc.)
  • Societal and Ethical Aspects: privacy intrusion & protection, attention engagement, fatigue avoidance, etc
  • Emerging Applications where video conferencing would be the cornerstone: remote education, telemedicine, etc.
... and many more interesting features.
We aim to collectively address this core question: what CV techniques are/will be ready for the next-generation video conference, and how will they fundamentally change the experience of remote work, education and more? We aim to bring together experts in interdisciplinary fields to discuss the recent advances along these topics and to explore new directions. As one of the expected workshop outcomes, we expect to generate a joint report defining the key CV problems, characterizing the technical demands and barriers, and discussing potential solutions or discussions.