Xmldatadumps-l July 2020

xmldatadumps-l@lists.wikimedia.org

8 participants
7 discussions

Re: [Xmldatadumps-l] Xmldatadumps-l Digest, Vol 119, Issue 7

by Sami Mourad

help i would like to unsubscribe On Tue, Jul 28, 2020 at 1:01 PM <xmldatadumps-l-request(a)lists.wikimedia.org> wrote: > Send Xmldatadumps-l mailing list submissions to > xmldatadumps-l(a)lists.wikimedia.org > > To subscribe or unsubscribe via the World Wide Web, visit > https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l > or, via email, send a message with subject or body 'help' to > xmldatadumps-l-request(a)lists.wikimedia.org > > You can reach the person managing the list at > xmldatadumps-l-owner(a)lists.wikimedia.org > > When replying, please edit your Subject line so it is more specific > than "Re: Contents of Xmldatadumps-l digest..." > > > Today's Topics: > > 1. Has anyone had success with data deduplication? (griffin tucker) > 2. Re: Has anyone had success with data deduplication? (Count Count) > > > ---------------------------------------------------------------------- > > Message: 1 > Date: Tue, 28 Jul 2020 01:50:09 +0000 > From: griffin tucker <gtucker4.une(a)hotmail.com> > To: "xmldatadumps-l(a)lists.wikimedia.org" > <xmldatadumps-l(a)lists.wikimedia.org> > Subject: [Xmldatadumps-l] Has anyone had success with data > deduplication? > Message-ID: > < > TY2PR03MB3997DB2177073F2871ABE5E2D2730(a)TY2PR03MB3997.apcprd03.prod.outlook.com > > > > Content-Type: text/plain; charset="utf-8" > > I've tried using freenas/truenas with a data deduplication volume to store > multiple sequential dumps, however it doesn't seem to save much space at > all - I was hoping someone could point me in the right direction so that I > can download multiple dumps and not have it take up so much room > (uncompressed). > > Has anyone tried anything similar and had success with data deduplication? > > Is there a guide? >

3 years, 10 months

Has anyone had success with data deduplication?

by griffin tucker

I've tried using freenas/truenas with a data deduplication volume to store multiple sequential dumps, however it doesn't seem to save much space at all - I was hoping someone could point me in the right direction so that I can download multiple dumps and not have it take up so much room (uncompressed). Has anyone tried anything similar and had success with data deduplication? Is there a guide?

3 years, 10 months

no XML header?

by Alex Voss

Hi all, I am trying to read the dump from https://dumps.wikimedia.your.org/enwiki/20200701/enwiki-20200701-pages-arti… using a Java XMLStreamReader but it complains about the format. It looks like the file does not contain an XML header (<?xml version="1.0"?> or such) and after unpacking and prepending the header all seems fine. Is there a good reason why headers are missing? Cheers, Alex

3 years, 10 months

Re: [Xmldatadumps-l] Request for Wikipedia dump of February 2017

by Ariel Glenn WMF

Dear Rajakumaran Archulan, Older dumps can often be found on the Internet Archive. The February 2017 full dumps for the English language Wikipedia are here: https://archive.org/details/enwiki-20170201 A reminder for all new and older members of this list: comprehensive documentation for dumps users is available on MetaWiki: https://meta.wikimedia.org/wiki/Data_dumps In the section "Getting the dumps" there are pointers for locating older dumps that are no longer available on the Wikimedia dumps download host. Ariel Glenn ariel(a)wikimedia.org On Mon, Jul 20, 2020 at 6:54 AM Rajakumaran Archulan < archulan.16(a)cse.mrt.ac.lk> wrote: > Dear sir/madam, > > I am a final year undergrad at the department of computer science & > engineering at University of Moratuwa, Sri Lanka. We are in the process of > building an evaluator for word embeddings for our final year project. > > We need the *Wikipedia dump of February 2017* for our research purpose. > We searched across the web for several hours. But we couldn't find it. It > would be grateful if you grant us access to the above corpus to > continue our research. > > Thank you! > > -- > *Best regards,* > *R.Archulan* > *Final year undergrad (16' Batch),* > *Dept. of Computer Science & Engineering,* > *Faculty of Engineering,* > *University of Moratuwa, Sri Lanka.* > *Mobile: (+94) 771761696* > *Linkedin* <https://www.linkedin.com/in/archulan> >

3 years, 10 months

sample html dumps available FOR QA ONLY

by Ariel Glenn WMF

NOTE: I did not produce the HTML dumps, they are being managed by another team. If you are interested in weighing in on the output format, what's missing, etc, here is the phabricator task: https://phabricator.wikimedia.org/T257480 Your comments and suggestions would be welcome! Ariel

3 years, 11 months

Commons structured data dumps

by Ariel Glenn WMF

RDF dumps of structured data from commons are now available at https://dumps.wikimedia.org/other/wikibase/commonswiki/ They are run on a weekly basis. See https://lists.wikimedia.org/pipermail/wikidata/2020-July/014125.html for more information. Enjoy!

3 years, 11 months

XML Dumps FAQ monthly update

by noreply.xmldatadumps＠wikimedia.org

Greetings XML Dump users and contributors! This is your automatic monthly Dumps FAQ update email. This update contains figures for the 20200601 full revision history content run. We are currently dumping 914 projects in total. --------------------- Stats for bswiktionary on date 20200601 Total size of page content dump files for articles, current content only: 9873045 Total size of page content dump files for all pages, current content only: 11245466 Total size of page content dump files for all pages, all revisions: 87550070 --------------------- Stats for enwiki on date 20200601 Total size of page content dump files for articles, current content only: 77759894873 Total size of page content dump files for all pages, current content only: 172875370489 Total size of page content dump files for all pages, all revisions: 20881062786426 --------------------- Sincerely, Your friendly Wikimedia Dump Info Collector

3 years, 11 months

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

Xmldatadumps-l July 2020