Posts by Collection

coursework

highlights

publications

LRW-1000: A Naturally-Distributed Large-Scale Benchmark for Lip Reading in the Wild

Published in IEEE F&G, 2019



We present a naturally-distributed large-scale benchmark for lip reading in the wild, named LRW-1000, which contains 1000 classes with about 745,187 samples from more than 2000 individual speakers. To the best of our knowledge, it is the largest word-level lipreading dataset and also the only public large-scale Mandarin lipreading dataset.

Recommended citation: Yang, S., Zhang, Y., Feng, D., Yang, M., Wang, C., Xiao, J., Long, K., Shan, S. and Chen, X., 2019. LRW-1000: A Naturally-Distributed Large-Scale Benchmark for Lip Reading in the Wild. International Conference on Automatic Face and Gesture Recognition, 2019 (Oral). https://arxiv.org/pdf/1810.06990.pdf

Can We Read Speech Beyond the Lips? Rethinking RoI Selection for Deep Visual Speech Recognition

Published in IEEE F&G, 2020



We revisit a fundamental yet somehow overlooked problem in visual speech recognition: can VSR models benefit from reading speech in extraoral facial regions, i.e. beyond the lips? We find by experimenting on three benchmarks, that incorporating information from extraoral facial regions is always beneficial, even using only the upper face. We also introduce a simple yet effective method based on Cutout to learn more discriminative features for face-based VSR, which show clear improvements over existing state-of-the-art methods that use the lip region as inputs.

Recommended citation: Zhang, Y., Yang, S., Xiao, J., Shan, S. and Chen, X., 2020. Can We Read Speech Beyond the Lips? Rethinking RoI Selection for Deep Visual Speech Recognition. International Conference on Automatic Face and Gesture Recognition, 2020 (oral). https://arxiv.org/pdf/2003.03206.pdf

UniCon: Unified Context Network for Robust Active Speaker Detection

Published in ACM Multimedia 2021, 2021



We propose a new efficient framework, the Unified Context Network (UniCon), for robust active speaker detection (ASD). Traditional methods for ASD usually operate on each candidate’s pre-cropped face track separately and do not sufficiently consider relationships among the candidates. This potentially limits performance, especially in challenging scenarios with low-resolution faces, multiple candidates, etc. Our solution is a novel, unified framework that focuses on jointly modeling multiple types of contextual information: spatial context to indicate the position and scale of each candidate’s face, relational context to capture the visual relationships among the candidates and contrast audio-visual affinities with each other, and temporal context to aggregate long-term information and smooth out local uncertainties. Based on such information, our model optimizes all candidates in a unified process for robust and reliable ASD. A thorough ablation study is performed on several challenging ASD benchmarks under different settings. In particular, our method outperforms the state-of-the-art by a large margin of about 15% mean Average Precision (mAP) absolute on two challenging subsets: one with three candidate speakers, and the other with faces smaller than 64 pixels. Together, our UniCon achieves 92.0% mAP on the AVA-ActiveSpeaker validation set, surpassing 90% for the first time on this challenging dataset at the time of submission.

Recommended citation: Zhang, Y., Liang, S., Liu, X., Wu, Z., Shan, S. and Chen, X., 2021. UniCon: Unified Context Network for Robust Active Speaker Detection. 29th ACM International Conference on Multimedia (ACM Multimedia 2021), Chengdu, China, October 20-24, 2021 (oral). https://arxiv.org/pdf/2108.02607

updates

June 2017

  • Sparse Matrix Library and Multi-key Sorting Demo

December 2017

  • Ship Identification with general-purpose object detectors

Projects

List of projects I am currently working on or have finished.

Translations

This is a somewhat comprehensive list of all materials I am currently translating.