Note (10/02/19). The purpose of this page is to share some large-scale datasets that are not easy to obtain (due to YouTube’s tightened limits on systematic crawling, file size, download speed, and network restrictions etc.). It seems more and more datasets are being released as Youtube IDs, and this is making these material increasingly inaccessible to researchers, especially those in China. Feeling pretty upset about this, and having had quite a few frustrating experience myself (which involved two heavy hard drives being mailed all the way across China!), I’ve decided to share some of my downloads here.

A large scale ASL dataset by Microsoft that covers over 200 signers, signer independent sets, challenging and unconstrained recording conditions and a large class count of 1000 signs.

Crawled in September 2019. A number of videos are already missing from the validation and training set as of Oct. 2nd, 2019. A list of the missing videos can be found here.

CST Mirror: [Link]

NOTE(09/15/21): IIRC a little birdie told me there is a copy to the full data (but not all videos are saved at the raw resolution) on the RWTH FTP server : )


A large-scale dataset of manually annotated audio events from Google, consisting of an expanding ontology of 632 audio event classes and a collection of 2,084,320 human-labeled 10-second sound clips drawn from YouTube videos.

Crawled from December 2018 to January 2019. Audio tracks only. A number of files are already missing.

CST Mirror: [Balanced Train] [Eval]

[Unbalanced 1-150k] [Unbalanced 150k-300k] [Unbalanced 300k-450k] [Unbalanced 450k-600k] [Unbalanced 600k-750k] [Unbalanced 750k-900k] [Unbalanced 900k-1050k] [Unbalanced 1050k-1200k] [Unbalanced 1200k-1350k] [Unbalanced 1350k-1500k] [Unbalanced 1500k-1650k] [Unbalanced 1650k-1800k] [Unbalanced 1800k-1950k] [Unbalanced 1950k-]

NOTE(09/15/21): Someone had their own copy uploaded to [Baidu Netdisk].


A large-scale, high-quality video recognition dataset from DeepMind with approximately 650,000 video clips that covers 700 human action classes.

Crawled in July 2019. A number of files are already missing.

Validation set: 53G, 34815 of 35000 clips

CST Mirror: [Link]

Test set: 105G, 69456 of 70000 clips

CST Mirror (incomplete, 7000 missing files): [Link]

Training set: N/A

NOTE (09/15/21): The CVDF has made a copy available at [AWS]. This would be the ideal way to obtain the dataset now.


A new, large-scale audio-visual dataset from Google comprising speech video clips with no interfering backgruond noises. The segments are 3-10 seconds long, and in each clip the audible sound in the soundtrack belongs to a single speaking person, visible in the video. In total, the dataset contains roughly 4700 hours of video segments, from a total of 290k YouTube videos, spanning a wide variety of people, languages and face poses.

Crawled from November 2018 to December 2018. Many files are corrupted, due to incompatibility between ffmpeg and multiprocessing. Coming soon.

NOTE (09/15/21): A different but nice version has been made available at [AcademicTorrents]. This would be the ideal way to obtain the dataset now.