Greetings XML Dump users and contributors!
This is your automatic monthly Dumps FAQ update email. This update
contains figures for the 20220301 full revision history content run.
We are currently dumping 945 projects in total.
---------------------
Stats for csbwiktionary on date 20220301
Total size of page content dump files for articles, current content only:
2,005,634
Total size of page content dump files for all pages, current content only:
2,900,582
Total size of page content dump files for all pages, all revisions:
46,688,480
---------------------
Stats for enwiki on date 20220301
Total size of page content dump files for articles, current content only:
88,064,759,123
Total size of page content dump files for all pages, current content only:
193,569,131,075
Total size of page content dump files for all pages, all revisions:
24,339,335,307,700
---------------------
Sincerely,
Your friendly Wikimedia Dump Info Collector
Hello data dump enthusiasts,
I've been working on an API (written in Python) for downloading files from Wikimedia Data Dump files, and I've just released its earliest version here: https://github.com/jon-edward/wiki_dump.
I'd love to know what you all think (how it can be improved, how you are interested in using it, positive/negative remarks) if you can spare the time.
All the best,
jon-edward
Hello.
I am doing some converts to aarddict https://aarddict.org/ offline
wikipedia and wiktionary app. I use mw2slob and the N0 files found on
https://dumps.wikimedia.org/other/enterprise_html/runs/ for this
conversions.
But in the spanish Wikipedia for example the article
https://es.wikipedia.org/wiki/Anexo:Aves_de_Canarias seems not to be
part of the tar.gz file.
And in the french Wiktionary the article
https://fr.wiktionary.org/wiki/Conjugaison:espagnol/aumentar also is
missing in the respective tar.gz file.
Can they be found somewhere else? In N6 or N14? For me it seems that
articles/pages that have a colon like Anexo: or Conjugaison: are not
part. But why? And where could I find them? Or are they to big or what
is the idea of not including them?
Regards,
Erik
Hello
I'm looking through the dump files and am not sure 'what contains what'. Maybe there's a descriptive page that I've missed somewhere?
I'd like XML or HTML, no images, to make a crawl of UK local elections, via keywords or simulating a web crawler (or a mixture of both, some pruning, then crawling).
Sorry about the question, best regards Hugh Barnard
---------
https://www.hughbarnard.org
Twitter: @hughbarnard
Greetings XML Dump users and contributors!
This is your automatic monthly Dumps FAQ update email. This update
contains figures for the 20220201 full revision history content run.
We are currently dumping 945 projects in total.
---------------------
Stats for thwikibooks on date 20220201
Total size of page content dump files for articles, current content only:
29,982,138
Total size of page content dump files for all pages, current content only:
42,139,996
Total size of page content dump files for all pages, all revisions:
363,425,740
---------------------
Stats for enwiki on date 20220201
Total size of page content dump files for articles, current content only:
87,651,925,597
Total size of page content dump files for all pages, current content only:
192,724,767,532
Total size of page content dump files for all pages, all revisions:
24,174,134,255,220
---------------------
Sincerely,
Your friendly Wikimedia Dump Info Collector
Hi!
I am trying to find a dump of all imageinfo data [1] for all files on
Commons. I thought that "Articles, templates, media/file descriptions,
and primary meta-pages" XML dump would contain that, given the
"media/file descriptions" part, but it seems this is not the case. Is
there a dump which contains that information? And what is "media/file
descriptions" then? Wiki pages of files?
[1] https://www.mediawiki.org/wiki/API:Imageinfo
Mitar
--
http://mitar.tnode.com/https://twitter.com/mitar_m
Greetings XML Dump users and contributors!
This is your automatic monthly Dumps FAQ update email. This update
contains figures for the 20220101 full revision history content run.
We are currently dumping 945 projects in total.
---------------------
Stats for nnwiki on date 20220101
Total size of page content dump files for articles, current content only:
683,266,535
Total size of page content dump files for all pages, current content only:
743,779,759
Total size of page content dump files for all pages, all revisions:
15,451,351,380
---------------------
Stats for enwiki on date 20220101
Total size of page content dump files for articles, current content only:
87,092,585,288
Total size of page content dump files for all pages, current content only:
191,642,115,143
Total size of page content dump files for all pages, all revisions:
23,994,550,626,483
---------------------
Sincerely,
Your friendly Wikimedia Dump Info Collector
Hi!
I just published the first version of a Go package which provides
utilities for processing
Wikidata entities JSON dumps and Wikimedia Enterprise HTML dumps. It
processes them in parallel on multiple cores, so processing is rather
fast. I hope it will be useful to others, too.
https://gitlab.com/tozd/go/mediawiki
Any feedback is welcome.
Mitar
--
http://mitar.tnode.com/https://twitter.com/mitar_m
Greetings XML Dump users and contributors!
This is your automatic monthly Dumps FAQ update email. This update
contains figures for the 20211201 full revision history content run.
We are currently dumping 945 projects in total.
---------------------
Stats for tlhwiktionary on date --
Total size of page content dump files for articles, current content only:
No info available (no full run date found)
Total size of page content dump files for all pages, current content only:
No info available (no full run date found)
Total size of page content dump files for all pages, all revisions:
No info available (no full run date found)
---------------------
Stats for enwiki on date 20211201
Total size of page content dump files for articles, current content only:
86,591,331,146
Total size of page content dump files for all pages, current content only:
190,740,317,768
Total size of page content dump files for all pages, all revisions:
23,832,535,435,128
---------------------
Sincerely,
Your friendly Wikimedia Dump Info Collector
[my first post to this list - hello, everyone]
In conversation elsewhere, it was mentioned that the XML dumps for
Wikispecies don't include the Wikidata link (QID) of the corresponding
entries there. Is that so? Is it the same for Wikipedia's dumps?
If so, how feasible would it be to add QIDs to the dumps?
--
Andy Mabbett
@pigsonthewing
http://pigsonthewing.org.uk