Type Name Files Added Size DLs
PMC Open Access Subset 16 2020-05-24 84.14GB 329 8+ 1
Synthetic Data for Text Localisation in Natural Images 15 2021-11-15 73.50GB 4,040 7 3
Microsoft Academic Graph - 2016/02/05 1 2016-12-25 28.94GB 269 3+ 2
OpenWebText (Gokaslan's distribution, 2019), GPT-2 Tokenized 395 2019-06-01 16.02GB 221 5 1
Reading Text in the Wild with Convolutional Neural Networks 1 2021-11-12 10.68GB 48,392 35 1
Enwiki Word2vec model 1000 Dimensions 1 2015-04-09 8.63GB 3,493 7 0
WMT 2015 French/English parallel texts 1 2018-10-16 2.60GB 2,270 1+ 1
UN corpus - training-parallel-un.tgz (ES-EN, FR-EN) 1 2019-02-04 2.37GB 66 1+ 1
Flickr8k Dataset 2 2019-03-09 1.12GB 15,554 15+ 2
Common Crawl corpus - training-parallel-commoncrawl.tgz (CS-EN, DE-EN, ES-EN, FR-EN, RU-EN) 1 2019-02-04 918.31MB 122 3+ 1
Amazon reviews - Polarity 1 2018-10-16 688.34MB 1,125 3+ 1
Europarl v7 - training-parallel-europarl-v7.tgz (CS-EN, DE-EN, ES-EN, FR-EN) 1 2019-02-04 657.63MB 53 3+ 1
Amazon reviews - Full 1 2018-10-16 643.70MB 1,353 4+ 1
30M Factoid Question-Answer Corpus (30MQA) 2 2018-11-29 529.34MB 5,043 11+ 1
Yale YouTube Video Text 1 2014-10-20 434.77MB 8,299 7+ 1
Sogou news 1 2018-10-16 384.27MB 267 3+ 1
Lerman Twitter 2010 Dataset 3 2014-08-15 292.17MB 3,613 14+ 2
Structured Web Data Extraction Dataset (SWDE) 1 2015-11-29 207.31MB 3,933 8 1
MovieLens 20M Dataset 1 2016-12-16 198.70MB 2,237 4+ 1
Yelp reviews - Full 1 2018-10-16 196.15MB 413 3+ 1
Wikitext-103 1 2018-10-16 190.20MB 1,324 5+ 1
Yelp reviews - Polarity 1 2018-10-16 166.37MB 456 3+ 1
r/WritingPrompts, Text (2018) 1 2019-06-19 87.47MB 431 6 1
DBPedia ontology 1 2018-10-16 68.34MB 156 3+ 1
Phishing corpus 4555 2019-01-02 37.48MB 1,103 6+ 1
IMDb Large Movie Review Dataset 1 2018-10-16 26.40MB 1,002 4+ 1
AG News 1 2018-10-16 11.78MB 225 3+ 1
Online News Popularity Data Set 1 2016-02-11 7.48MB 3,119 6+ 1
Wikitext-2 1 2018-10-16 4.07MB 275 4+ 1
Indiana University - Chest X-Rays (XML Reports) 1 2018-11-22 1.11MB 49,812 22+ 0
SMS Spam Collection Data Set 2 2015-11-28 695.38kB 848 11+ 0
Sentiment Labelled Sentences Data Set 1 2016-08-26 512.21kB 542 8+ 1


Send Feedback