Greetings XML Dump users and contributors!
This is your automatic monthly Dumps FAQ update email. This update
contains figures for the 20240201 full revision history content run.
We are currently dumping 982 projects in total.
---------------------
Stats for itwikisource on date 20240201
Total size of page content dump files for articles, current content only:
1,885,904,585
Total size of page content dump files for all pages, current content only:
1,959,855,466
Total size of page content dump files for all pages, all revisions:
20,508,962,333
---------------------
Stats for enwiki on date 20240201
Total size of page content dump files for articles, current content only:
98,402,833,845
Total size of page content dump files for all pages, current content only:
202,673,661,885
Total size of page content dump files for all pages, all revisions:
28,190,423,319,469
---------------------
Sincerely,
Your friendly Wikimedia Dump Info Collector
Hey data dump friends! Sorry for the pseudo off-topic post but just wanted to let you know that the 2021.01.2 data dump is now located on the South Pole of the Moon, for up to 5 billion years, in digital form, as nickel DVD masters. Landed by my foundation - www.archmission.org - on the Intuitive Machines IM-1 mission. So we now have a pretty good offsite backup in place :)
Nova Spivack
Confidential
The information contained in this transmission may contain privileged and confidential information, protected by federal and state privacy laws. It is intended only for the use of the person(s) named above. If you are not the intended recipient, you are hereby notified that any review, dissemination, distribution or duplication of this communication is strictly prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message.
DISCLAIMER: Sender is NOT a United States Securities Dealer or Broker or U.S. Investment adviser. Sender is a Consultant and makes no warranties or representations as to all members of the Transaction. This E-mail letter and the attached related documents are never to be considered a solicitation for any purpose in any form or content. Upon receipt of these documents, the Recipient hereby acknowledges this Disclaimer. If acknowledgement is not accepted, Recipient must return any and all documents in their original receipted condition to Sender. This electronic communication is covered by the Electronic Communications Privacy Act of 1986, Codified at 18 U.S.C 1367,2510-2521, 2701-2710, 3121-3126
Hello M.Srilasya,
The XML data dumps of all the Wikipedias are free to download and use as
per the licensing discussed here <https://dumps.wikimedia.org/legal.html>.
So you can just download anything you'd like from the website here:
https://dumps.wikimedia.org/backup-index.html.
If you let me know a specific language you're interested in, I can point
you to the exact download link. But since you asked for a smaller download,
let me offer simplewiki, which is a smaller English wiki that uses
"Simplified English'', yet it is big enough to be interesting to do proof
of concepts with:
All pages with complete page edit history (.bz2)
- simplewiki-20240201-pages-meta-history.xml.bz2
<https://dumps.wikimedia.org/simplewiki/20240201/simplewiki-20240201-pages-m…>
2.9
GB
-
All pages, current versions only.
- simplewiki-20240201-pages-meta-current.xml.bz2
<https://dumps.wikimedia.org/simplewiki/20240201/simplewiki-20240201-pages-m…>
356.7
MB
On Thu, Feb 22, 2024 at 1:10 AM 21131A0564 MANCHUKONDA SRILASYA <
21131a0564(a)gvpce.ac.in> wrote:
> Dear xmldatadumps owner,
> I'm a student working on a search engine project for which i
> need the xml data dumps. i do not have excess storage capabilities. so, I
> just need a small xml data dump. so that I can use it for my project.
> I will make sure that I will not misuse the data provided by
> you. please consider my request.
>
> Yours obediently,
> M.Srilasya
>
--
Xabriel J. Collazo Mojica (he/him, pronunciation
<https://commons.wikimedia.org/wiki/File:Xabriel_Collazo_Mojica_-_pronunciat…>
)
Sr Software Engineer
Wikimedia Foundation
Greetings XML Dump users and contributors!
This is your automatic monthly Dumps FAQ update email. This update
contains figures for the 20240101 full revision history content run.
We are currently dumping 982 projects in total.
---------------------
Stats for tnwiktionary on date 20240101
Total size of page content dump files for articles, current content only:
2,013,030
Total size of page content dump files for all pages, current content only:
2,691,260
Total size of page content dump files for all pages, all revisions:
12,375,755
---------------------
Stats for enwiki on date 20240101
Total size of page content dump files for articles, current content only:
97,866,837,150
Total size of page content dump files for all pages, current content only:
201,781,306,071
Total size of page content dump files for all pages, all revisions:
28,002,734,610,233
---------------------
Sincerely,
Your friendly Wikimedia Dump Info Collector
(*apologies for cross-posting*)
Hello,
This is a breaking change announcement relevant to those working with
Lexeme dumps.
In Lexeme dumps, "senses" and "forms" values, when not empty, are shown as
arrays. When these lists are empty, they are currently displayed as
objects. For example, values with content are displayed in array
format: "senses":[{"id":"L4-S1",...]
but empty values are treated as objects: "senses":{}
However, empty lists should be presented as arrays as well: "senses":[]
In this change, empty lists of forms and senses will be switched from
objects to arrays. This adjustment makes the dumps more consistent and
matches the same way non-empty values are presented. We will roll this
change out on February 8th.
We anticipate the impact of this change to be minimal and harmless for most
use cases. Therefore, we haven't generated a test dump, as it would demand
substantial resources and time. If you have any questions or concerns about
this change, please don’t hesitate to reach out to us in this ticket (
T305660 <https://phabricator.wikimedia.org/T305660>).
Cheers,
--
Mohammed S. Abdulai
*Community Communications Manager, Wikidata*
Wikimedia Deutschland e. V. | Tempelhofer Ufer 23-24 | 10963 Berlin
Phone: +49 (0) 30 577 116 2466
https://wikimedia.de
Grab a spot in my calendar for a chat: calendly.com/masssly.
A lot is happening around Wikidata - Keep up to date!
<https://www.wikidata.org/wiki/Wikidata:Status_updates> Current news and
exciting stories about Wikimedia, Wikipedia and Free Knowledge in our
newsletter (in German): Subscribe now <https://www.wikimedia.de/newsletter/>
.
Imagine a world in which every single human being can freely share in the
sum of all knowledge. Help us to achieve our vision!
https://spenden.wikimedia.de
Wikimedia Deutschland — Gesellschaft zur Förderung Freien Wissens e. V.
Eingetragen im Vereinsregister des Amtsgerichts Charlottenburg, VR 23855 B.
Als gemeinnützig anerkannt durch das Finanzamt für Körperschaften I Berlin,
Steuernummer 27/029/42207. Geschäftsführende Vorstände: Franziska Heine,
Dr. Christian Humborg
Hello!
I am having some unexpected messages, so I tried the following:
curl -s
https://dumps.wikimedia.org/wikidatawiki/latest/wikidatawiki-latest-pages-a…
| bzip2 -d | tail
an got this:
bzip2: Compressed file ends unexpectedly;
perhaps it is corrupted? *Possible* reason follows.
bzip2: Inappropriate ioctl for device
Input file = (stdin), output file = (stdout)
It is possible that the compressed file(s) have become corrupted.
You can use the -tvv option to test integrity of such files.
You can use the `bzip2recover' program to attempt to recover
data from undamaged sections of corrupted files.
<parentid>1227967782</parentid>
<timestamp>2023-12-07T00:22:05Z</timestamp>
<contributor>
<username>Renamerr</username>
<id>2883061</id>
</contributor>
<comment>/* wbsetdescription-add:1|uk */ бактеріальний білок, наявний
у Listeria monocytogenes EGD-e,
[[:toollabs:quickstatements/#/batch/218434|batch #218434]]</comment>
<model>wikibase-item</model>
<format>application/json</format>
The first part is an error message which I could also see when running my
PHP-script from within the toolserver-cloud (php 7.4 because class
XMLReader with the installed php 8.2 simple core dumps, see T352886). The
second part is the output from the "tail" command.
Just as a crosschek: I have no such problem with
curl -s
https://dumps.wikimedia.org/dewiki/latest/dewiki-latest-pages-meta-current.…
| bzip2 -d | tail
No error and the last line is "</mediawiki>"
Cheers,
Wolfgang
Greetings XML Dump users and contributors!
This is your automatic monthly Dumps FAQ update email. This update
contains figures for the 20231201 full revision history content run.
We are currently dumping 982 projects in total.
---------------------
Stats for eowikiquote on date 20231201
Total size of page content dump files for articles, current content only:
25,649,838
Total size of page content dump files for all pages, current content only:
26,954,003
Total size of page content dump files for all pages, all revisions:
830,995,797
---------------------
Stats for enwiki on date 20231201
Total size of page content dump files for articles, current content only:
97,431,055,664
Total size of page content dump files for all pages, current content only:
201,191,572,243
Total size of page content dump files for all pages, all revisions:
27,833,207,510,221
---------------------
Sincerely,
Your friendly Wikimedia Dump Info Collector
Hello,
I'm currently looking for latest Wikipedia data dumps that includes the complete history of Wikipedia edits for research purpose. I'm aware that a similar data dump exists for Wikidata edits, but I haven't been able to locate the same for Wikipedia. Despite checking https://dumps.wikimedia.org/, I couldn't find the latest dump featuring Wikipedia edits. I would greatly appreciate any help in this matter.
Cheers,
Hrishi
Greetings XML Dump users and contributors!
This is your automatic monthly Dumps FAQ update email. This update
contains figures for the 20231101 full revision history content run.
We are currently dumping 982 projects in total.
---------------------
Stats for snwiktionary on date 20231101
Total size of page content dump files for articles, current content only:
506,341
Total size of page content dump files for all pages, current content only:
607,858
Total size of page content dump files for all pages, all revisions:
741,521
---------------------
Stats for enwiki on date 20231101
Total size of page content dump files for articles, current content only:
96,959,256,118
Total size of page content dump files for all pages, current content only:
200,144,987,436
Total size of page content dump files for all pages, all revisions:
27,663,890,580,870
---------------------
Sincerely,
Your friendly Wikimedia Dump Info Collector