Hiya
I have a question about wikipedia xml database dump. Apologies if this
wasn't an appropriate place for asking a question.
On a wikipedia page, it's mentioned that the current number of articles in
english is: 6,144,248
https://en.wikipedia.org/wiki/Wikipedia:Size_of_Wikipedia
However when I count the number of page elements in recent dump (excluding
redirects) it's about ~10 million
I was just wondering what would be the reason for this?
Thank you in advance
--
*Yuki Kumagai*
Senior Engineer
CognitionX <https://cognitionx.com/>
Driving the acceleration and responsible deployment of AI
Stay up-to-date with our daily All Things AI
<https://confirmsubscription.com/h/d/13A269E463396CB2> newsletter
Hi!
I am currently working on a dump search and download tool for all Wikimedia
wikis. In order to find out which Wikimedia wikis exist I used Wikidata.
While comparing the list of wikis from Wikidata with the list of dumped
projects I found out that the following wikis are currently not being
dumped:
- alswikibooks (last dump 20180101)
- alswikiquote (last dump 20180101)
- alswiktionary (last dump 20180101)
- ecwikimedia (never dumped, private but not marked private in Wikidata?)
- fixcopyrightwiki (last dump 20200220)
- labswiki (never dumped?)
- labtestwiki (never dumped?)
- mowiki (last dump 20180101)
- mowiktionary (last dump 20180101)
- ru_sibwiki (last dump 20071011)
- ukwikiversity (never dumped?)
Is there an uptodate machine-readable list of currently dumped wikis
besides https://dumps.wikimedia.org/backup-index.html?
(Off-topic) Spoiler for dump searching tool on my laptop:
$ target/release/wdgrep "asdfdefased"
/c/Users/xyz/wpdumps/dewiki-20200701-pages-articles-multistream.xml -v --ns
0
Searched 21437.064 MiB in 8.467969 seconds (2531.5474 MiB/s).
Best regards,
Count Count
Hi,
If you don't mind, please, starting next time, insert commas into those
huge counts. Without commas they are VERY difficult to read.
Thanks!
Sincerely,
Todd Shandelman
Austin, TX
On Sun, Aug 2, 2020, 07:01 <xmldatadumps-l-request(a)lists.wikimedia.org>
wrote:
> Send Xmldatadumps-l mailing list submissions to
> xmldatadumps-l(a)lists.wikimedia.org
>
> To subscribe or unsubscribe via the World Wide Web, visit
> https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l
> or, via email, send a message with subject or body 'help' to
> xmldatadumps-l-request(a)lists.wikimedia.org
>
> You can reach the person managing the list at
> xmldatadumps-l-owner(a)lists.wikimedia.org
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of Xmldatadumps-l digest..."
>
>
> Today's Topics:
>
> 1. XML Dumps FAQ monthly update (noreply.xmldatadumps(a)wikimedia.org)
> 2. List of dumped wikis, discrepancy with Wikidata (Count Count)
>
>
> ----------------------------------------------------------------------
>
> Message: 1
> Date: Sat, 01 Aug 2020 16:07:36 +0000
> From: noreply.xmldatadumps(a)wikimedia.org
> To: xmldatadumps-l(a)lists.wikimedia.org
> Subject: [Xmldatadumps-l] XML Dumps FAQ monthly update
> Message-ID: <20200801160736.AneN_%noreply.xmldatadumps(a)wikimedia.org>
>
>
> Greetings XML Dump users and contributors!
>
> This is your automatic monthly Dumps FAQ update email. This update
> contains figures for the 20200701 full revision history content run.
>
> We are currently dumping 916 projects in total.
>
>
> ---------------------
> Stats for lmowiki on date 20200701
>
> Total size of page content dump files for articles, current content only:
> 151410097
>
> Total size of page content dump files for all pages, current content only:
> 179774126
>
> Total size of page content dump files for all pages, all revisions:
> 3555369968
> ---------------------
> Stats for enwiki on date 20200701
>
> Total size of page content dump files for articles, current content only:
> 78326324425
>
> Total size of page content dump files for all pages, current content only:
> 173926604054
>
> Total size of page content dump files for all pages, all revisions:
> 21045320844828
> ---------------------
>
>
> Sincerely,
>
> Your friendly Wikimedia Dump Info Collector
>
>
>
> ------------------------------
>
> Message: 2
> Date: Sun, 2 Aug 2020 00:04:22 +0200
> From: Count Count <countvoncount123456(a)gmail.com>
> To: xmldatadumps-l(a)lists.wikimedia.org
> Subject: [Xmldatadumps-l] List of dumped wikis, discrepancy with
> Wikidata
> Message-ID:
> <CAOHwkzAk6R+W4Xj673h=
> p44zxwX+22Pt+Zd3UBg_NbSUUTg+1w(a)mail.gmail.com>
> Content-Type: text/plain; charset="utf-8"
>
> Hi!
>
> I am currently working on a dump search and download tool for all Wikimedia
> wikis. In order to find out which Wikimedia wikis exist I used Wikidata.
> While comparing the list of wikis from Wikidata with the list of dumped
> projects I found out that the following wikis are currently not being
> dumped:
>
> - alswikibooks (last dump 20180101)
> - alswikiquote (last dump 20180101)
> - alswiktionary (last dump 20180101)
> - ecwikimedia (never dumped, private but not marked private in
> Wikidata?)
> - fixcopyrightwiki (last dump 20200220)
> - labswiki (never dumped?)
> - labtestwiki (never dumped?)
> - mowiki (last dump 20180101)
> - mowiktionary (last dump 20180101)
> - ru_sibwiki (last dump 20071011)
> - ukwikiversity (never dumped?)
>
> Is there an uptodate machine-readable list of currently dumped wikis
> besides https://dumps.wikimedia.org/backup-index.html?
>
> (Off-topic) Spoiler for dump searching tool on my laptop:
> $ target/release/wdgrep "asdfdefased"
> /c/Users/xyz/wpdumps/dewiki-20200701-pages-articles-multistream.xml -v --ns
> 0
> Searched 21437.064 MiB in 8.467969 seconds (2531.5474 MiB/s).
>
> Best regards,
>
> Count Count
>
Greetings XML Dump users and contributors!
This is your automatic monthly Dumps FAQ update email. This update
contains figures for the 20200701 full revision history content run.
We are currently dumping 916 projects in total.
---------------------
Stats for lmowiki on date 20200701
Total size of page content dump files for articles, current content only:
151410097
Total size of page content dump files for all pages, current content only:
179774126
Total size of page content dump files for all pages, all revisions:
3555369968
---------------------
Stats for enwiki on date 20200701
Total size of page content dump files for articles, current content only:
78326324425
Total size of page content dump files for all pages, current content only:
173926604054
Total size of page content dump files for all pages, all revisions:
21045320844828
---------------------
Sincerely,
Your friendly Wikimedia Dump Info Collector