Hello dear Wikidata enthusiasts,
I'm one of the authors of a recent paper where we made heavy use of Wikidata to create a large-scale image-text dataset for training CLIP vision-language models. I thought this list might find the approach interesting.
What we did: We used SPARQL queries over the subclass-of (P279) and parent-taxon (P171) hierarchies to extract ~135k visual entities from 21 super-entity categories (animals, plants, tools, buildings, etc.). For each entity we collected its name, description, aliases, and sitelink count. We then extracted attributes from Wikidata related to color, partonomy, behavior, and other aspects. These entities and attributes were used to generate search queries for downloading images from the web.
The resulting EntityNet dataset contains 33M images, 46M text descriptions, and 613k text labels linked back to Wikidata entities. The dataset and trained models are openly available on HuggingFace: https://huggingface.co/datasets/lmb-freiburg/entitynet https://huggingface.co/datasets/lmb-freiburg/entitynet
A key finding was that training on a mix of Wikidata's structured entity information and noisy web alt-texts works better than either alone. The knowledge graph metadata genuinely improves model quality.
Paper: https://arxiv.org/abs/2505.02746 https://arxiv.org/abs/2505.02746 | Code: https://github.com/lmb-freiburg/entitynet https://github.com/lmb-freiburg/entitynet
Happy to answer any questions about how we used Wikidata or discuss potential improvements to the entity extraction.
Hi Simon,
Incredible! This looks like a great - maybe even groundbreaking - demonstration of how training on structured data (and Wikidata specifically) can make AI models (in this case, CLIP models) "smarter" than the standard use of just massive amounts of unstructured text - at (if I understand the paper correctly) maybe 1/1000th the cost.
An obvious question is: are you thinking of expanding EntityNet to cover anything besides living organisms?
Also - have you considered using images from Wikimedia Commons instead of, or in addition to, just the open web? These images also have structured data around them (both categories and, to a much lesser extent, semantic triples) instead of "noisy alt-texts". And of course it gets around the copyright issues that theoretically plague AI models.
-Yaron
On Thu, Apr 30, 2026 at 2:40 AM Simon Ging via Wikidata < wikidata@lists.wikimedia.org> wrote:
Hello dear Wikidata enthusiasts,
I'm one of the authors of a recent paper where we made heavy use of Wikidata to create a large-scale image-text dataset for training CLIP vision-language models. I thought this list might find the approach interesting.
What we did: We used SPARQL queries over the subclass-of (P279) and parent-taxon (P171) hierarchies to extract ~135k visual entities from 21 super-entity categories (animals, plants, tools, buildings, etc.). For each entity we collected its name, description, aliases, and sitelink count. We then extracted attributes from Wikidata related to color, partonomy, behavior, and other aspects. These entities and attributes were used to generate search queries for downloading images from the web.
The resulting EntityNet dataset contains 33M images, 46M text descriptions, and 613k text labels linked back to Wikidata entities. The dataset and trained models are openly available on HuggingFace: https://huggingface.co/datasets/lmb-freiburg/entitynet
A key finding was that training on a mix of Wikidata's structured entity information and noisy web alt-texts works better than either alone. The knowledge graph metadata genuinely improves model quality.
Paper: https://arxiv.org/abs/2505.02746 | Code: https://github.com/lmb-freiburg/entitynet
Happy to answer any questions about how we used Wikidata or discuss potential improvements to the entity extraction.
Wikidata mailing list -- wikidata@lists.wikimedia.org Public archives at https://lists.wikimedia.org/hyperkitty/list/wikidata@lists.wikimedia.org/mes... To unsubscribe send an email to wikidata-leave@lists.wikimedia.org
Hi Yaron,
Thanks for the kind words!
are you thinking of expanding EntityNet to cover anything besides
living organisms?
Just to clarify — EntityNet already covers a lot more than living organisms. The 21 super-entity categories are:
product, substance, physical tool, animal, plant, material, vehicle, geographical feature, food, architectural structure, anatomical structure, facility, physical activity, clothing, building, musical instrument, organ, furniture, body of water, weather, precipitation
Living organisms (here categorized as animal and plant) are just some of them, though they are where we ran our most detailed expert-domain evaluation (e.g. iNaturalist benchmarks). The full coverage spans natural and man-made objects pretty broadly. You can browse the 135k entities and their parent relations here:
https://huggingface.co/datasets/lmb-freiburg/entitynet/blob/main/entitynet-e...
have you considered using images from Wikimedia Commons instead of,
or in addition to, just the open web?
Yes, we did consider it, and you're right that the structured metadata (categories, depicts statements, etc.) would be a much cleaner training signal than alt-text scraped from random web pages. The reason we ultimately didn't go that route was uncertainty about scale. We ended up with ~33M unique images from open web search, and we weren't sure how many we'd get from Commons. Proper CLIP training from scratch requires at least tens of millions of image-text pairs.
Cheers,
Simon
On 30/04/2026 18:18, Yaron Koren wrote:
Hi Simon,
Incredible! This looks like a great - maybe even groundbreaking - demonstration of how training on structured data (and Wikidata specifically) can make AI models (in this case, CLIP models) "smarter" than the standard use of just massive amounts of unstructured text - at (if I understand the paper correctly) maybe 1/1000th the cost.
An obvious question is: are you thinking of expanding EntityNet to cover anything besides living organisms?
Also - have you considered using images from Wikimedia Commons instead of, or in addition to, just the open web? These images also have structured data around them (both categories and, to a much lesser extent, semantic triples) instead of "noisy alt-texts". And of course it gets around the copyright issues that theoretically plague AI models.
-Yaron
On Thu, Apr 30, 2026 at 2:40 AM Simon Ging via Wikidata wikidata@lists.wikimedia.org wrote:
Hello dear Wikidata enthusiasts, I'm one of the authors of a recent paper where we made heavy use of Wikidata to create a large-scale image-text dataset for training CLIP vision-language models. I thought this list might find the approach interesting. What we did: We used SPARQL queries over the subclass-of (P279) and parent-taxon (P171) hierarchies to extract ~135k visual entities from 21 super-entity categories (animals, plants, tools, buildings, etc.). For each entity we collected its name, description, aliases, and sitelink count. We then extracted attributes from Wikidata related to color, partonomy, behavior, and other aspects. These entities and attributes were used to generate search queries for downloading images from the web. The resulting EntityNet dataset contains 33M images, 46M text descriptions, and 613k text labels linked back to Wikidata entities. The dataset and trained models are openly available on HuggingFace: https://huggingface.co/datasets/lmb-freiburg/entitynet A key finding was that training on a mix of Wikidata's structured entity information and noisy web alt-texts works better than either alone. The knowledge graph metadata genuinely improves model quality. Paper: https://arxiv.org/abs/2505.02746 | Code: https://github.com/lmb-freiburg/entitynet Happy to answer any questions about how we used Wikidata or discuss potential improvements to the entity extraction. _______________________________________________ Wikidata mailing list -- wikidata@lists.wikimedia.org Public archives at https://lists.wikimedia.org/hyperkitty/list/wikidata@lists.wikimedia.org/message/3CMRFILT6MYLTVTKJKHZEQJVJQTATQNJ/ To unsubscribe send an email to wikidata-leave@lists.wikimedia.org-- WikiWorks · MediaWiki Consulting · http://wikiworks.com