laion400m-met-release (1265 files)
laion400m-embeddings/images/img_emb_268.npy | 1.02GB |
laion400m-embeddings/images/img_emb_267.npy | 1.02GB |
laion400m-embeddings/images/img_emb_266.npy | 1.02GB |
laion400m-embeddings/images/img_emb_265.npy | 1.02GB |
laion400m-embeddings/images/img_emb_264.npy | 1.02GB |
laion400m-embeddings/images/img_emb_263.npy | 1.02GB |
laion400m-embeddings/images/img_emb_262.npy | 1.02GB |
laion400m-embeddings/images/img_emb_261.npy | 1.02GB |
laion400m-embeddings/images/img_emb_260.npy | 1.02GB |
laion400m-embeddings/images/img_emb_26.npy | 1.02GB |
laion400m-embeddings/images/img_emb_259.npy | 1.02GB |
laion400m-embeddings/images/img_emb_258.npy | 1.02GB |
laion400m-embeddings/images/img_emb_257.npy | 1.02GB |
laion400m-embeddings/images/img_emb_256.npy | 1.02GB |
laion400m-embeddings/images/img_emb_255.npy | 1.02GB |
laion400m-embeddings/images/img_emb_254.npy | 1.02GB |
laion400m-embeddings/images/img_emb_253.npy | 1.02GB |
laion400m-embeddings/images/img_emb_252.npy | 1.02GB |
laion400m-embeddings/images/img_emb_251.npy | 1.02GB |
laion400m-embeddings/images/img_emb_250.npy | 1.02GB |
laion400m-embeddings/images/img_emb_25.npy | 1.02GB |
laion400m-embeddings/images/img_emb_249.npy | 1.02GB |
laion400m-embeddings/images/img_emb_248.npy | 1.02GB |
laion400m-embeddings/images/img_emb_247.npy | 1.02GB |
laion400m-embeddings/images/img_emb_246.npy | 1.02GB |
laion400m-embeddings/images/img_emb_245.npy | 1.02GB |
laion400m-embeddings/images/img_emb_244.npy | 1.02GB |
laion400m-embeddings/images/img_emb_243.npy | 1.02GB |
laion400m-embeddings/images/img_emb_242.npy | 1.02GB |
laion400m-embeddings/images/img_emb_241.npy | 1.02GB |
laion400m-embeddings/images/img_emb_240.npy | 1.02GB |
laion400m-embeddings/images/img_emb_24.npy | 1.02GB |
laion400m-embeddings/images/img_emb_239.npy | 1.02GB |
laion400m-embeddings/images/img_emb_238.npy | 1.02GB |
laion400m-embeddings/images/img_emb_237.npy | 1.02GB |
laion400m-embeddings/images/img_emb_236.npy | 1.02GB |
laion400m-embeddings/images/img_emb_235.npy | 1.02GB |
laion400m-embeddings/images/img_emb_234.npy | 1.02GB |
laion400m-embeddings/images/img_emb_233.npy | 1.02GB |
laion400m-embeddings/images/img_emb_232.npy | 1.02GB |
laion400m-embeddings/images/img_emb_231.npy | 1.02GB |
laion400m-embeddings/images/img_emb_230.npy | 1.02GB |
laion400m-embeddings/images/img_emb_23.npy | 1.02GB |
laion400m-embeddings/images/img_emb_229.npy | 1.02GB |
laion400m-embeddings/images/img_emb_228.npy | 1.02GB |
laion400m-embeddings/images/img_emb_227.npy | 1.02GB |
laion400m-embeddings/images/img_emb_226.npy | 1.02GB |
laion400m-embeddings/images/img_emb_225.npy | 1.02GB |
laion400m-embeddings/images/img_emb_224.npy | 1.02GB |
Type: Dataset
Tags:
Bibtex:
Tags:
Bibtex:
@article{, title= {LAION-400-MILLION OPEN DATASET}, journal= {}, author= {}, year= {}, url= {https://laion.ai/laion-400-open-dataset/}, abstract= {LAION-400M The world’s largest openly available image-text-pair dataset with 400 million samples. # Concept and Content The LAION-400M dataset is completely openly, freely accessible. All images and texts in the LAION-400M dataset have been filtered with OpenAI‘s CLIP by calculating the cosine similarity between the text and image embeddings and dropping those with a similarity below 0.3 The threshold of 0.3 had been determined through human evaluations and seems to be a good heuristic for estimating semantic image-text-content matching. The image-text-pairs have been extracted from the Common Crawl web data dump and are from random web pages crawled between 2014 and 2021. # Download Information You can find The CLIP image embeddings (NumPy files) The parquet files KNN index of image embeddings # LAION-400M Dataset Statistics The LAION-400M and future even bigger ones are in fact datasets of datasets. For instance, it can be filtered out by image sizes into smaller datasets like this: ``` Number of unique samples 413M Number with height or width >= 1024 26M Number with height and width >= 1024 9.6M Number with height or width >= 512 112M Number with height and width >= 512 67M Number with height or width >= 256 268M Number with height and width >= 256 211M ``` By using the KNN index specialized datasets can also be extracted by domains of interest. They are (or will be) sufficient in size to train domain specialized models. # Disclaimer & Content Warning Our filtering protocol only removed NSFW images that were detected as illegal but the dataset still has NSFW content accordingly marked in the metadata. Please use the demo links with caution. You can extract a “safe” subset by filtering out samples marked with NSFW or via stricter CLIP filtering. There is a certain degree of duplication because we used URL+text as deduplication criteria. The same image with the same caption may sit at different URLs causing duplicates. The same image with different captions is not, however, considered duplicated. Using KNN clustering should make it easy to further deduplicate by image content. # License We are distributing the metadata dataset (the parquet files) under the most open creative common CC-BY 4.0 license. It poses no particular restriction. The images are under their own copyright.}, keywords= {}, terms= {}, license= {https://creativecommons.org/licenses/by/4.0/}, superseded= {} }
No comments yet
Add a comment