help
i would like to unsubscribe
On Tue, Jul 28, 2020 at 1:01 PM <xmldatadumps-l-request(a)lists.wikimedia.org>
wrote:
> Send Xmldatadumps-l mailing list submissions to
> xmldatadumps-l(a)lists.wikimedia.org
>
> To subscribe or unsubscribe via the World Wide Web, visit
> https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l
> or, via email, send a message with subject or body 'help' to
> xmldatadumps-l-request(a)lists.wikimedia.org
>
> You can reach the person managing the list at
> xmldatadumps-l-owner(a)lists.wikimedia.org
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of Xmldatadumps-l digest..."
>
>
> Today's Topics:
>
> 1. Has anyone had success with data deduplication? (griffin tucker)
> 2. Re: Has anyone had success with data deduplication? (Count Count)
>
>
> ----------------------------------------------------------------------
>
> Message: 1
> Date: Tue, 28 Jul 2020 01:50:09 +0000
> From: griffin tucker <gtucker4.une(a)hotmail.com>
> To: "xmldatadumps-l(a)lists.wikimedia.org"
> <xmldatadumps-l(a)lists.wikimedia.org>
> Subject: [Xmldatadumps-l] Has anyone had success with data
> deduplication?
> Message-ID:
> <
> TY2PR03MB3997DB2177073F2871ABE5E2D2730(a)TY2PR03MB3997.apcprd03.prod.outlook.com
> >
>
> Content-Type: text/plain; charset="utf-8"
>
> I've tried using freenas/truenas with a data deduplication volume to store
> multiple sequential dumps, however it doesn't seem to save much space at
> all - I was hoping someone could point me in the right direction so that I
> can download multiple dumps and not have it take up so much room
> (uncompressed).
>
> Has anyone tried anything similar and had success with data deduplication?
>
> Is there a guide?
>
I've tried using freenas/truenas with a data deduplication volume to store multiple sequential dumps, however it doesn't seem to save much space at all - I was hoping someone could point me in the right direction so that I can download multiple dumps and not have it take up so much room (uncompressed).
Has anyone tried anything similar and had success with data deduplication?
Is there a guide?
Hi all,
I am trying to read the dump from
https://dumps.wikimedia.your.org/enwiki/20200701/enwiki-20200701-pages-arti…
using a Java XMLStreamReader but it complains about the format. It looks
like the file does not contain an XML header (<?xml version="1.0"?> or
such) and after unpacking and prepending the header all seems fine.
Is there a good reason why headers are missing?
Cheers,
Alex
Dear Rajakumaran Archulan,
Older dumps can often be found on the Internet Archive. The February 2017
full dumps for the English language Wikipedia are here:
https://archive.org/details/enwiki-20170201
A reminder for all new and older members of this list: comprehensive
documentation for dumps users is available on MetaWiki:
https://meta.wikimedia.org/wiki/Data_dumps In the section "Getting the
dumps" there are pointers for locating older dumps that are no longer
available on the Wikimedia dumps download host.
Ariel Glenn
ariel(a)wikimedia.org
On Mon, Jul 20, 2020 at 6:54 AM Rajakumaran Archulan <
archulan.16(a)cse.mrt.ac.lk> wrote:
> Dear sir/madam,
>
> I am a final year undergrad at the department of computer science &
> engineering at University of Moratuwa, Sri Lanka. We are in the process of
> building an evaluator for word embeddings for our final year project.
>
> We need the *Wikipedia dump of February 2017* for our research purpose.
> We searched across the web for several hours. But we couldn't find it. It
> would be grateful if you grant us access to the above corpus to
> continue our research.
>
> Thank you!
>
> --
> *Best regards,*
> *R.Archulan*
> *Final year undergrad (16' Batch),*
> *Dept. of Computer Science & Engineering,*
> *Faculty of Engineering,*
> *University of Moratuwa, Sri Lanka.*
> *Mobile: (+94) 771761696*
> *Linkedin* <https://www.linkedin.com/in/archulan>
>
NOTE: I did not produce the HTML dumps, they are being managed by another
team.
If you are interested in weighing in on the output format, what's missing,
etc, here is the phabricator task: https://phabricator.wikimedia.org/T257480
Your comments and suggestions would be welcome!
Ariel
Greetings XML Dump users and contributors!
This is your automatic monthly Dumps FAQ update email. This update
contains figures for the 20200601 full revision history content run.
We are currently dumping 914 projects in total.
---------------------
Stats for bswiktionary on date 20200601
Total size of page content dump files for articles, current content only:
9873045
Total size of page content dump files for all pages, current content only:
11245466
Total size of page content dump files for all pages, all revisions:
87550070
---------------------
Stats for enwiki on date 20200601
Total size of page content dump files for articles, current content only:
77759894873
Total size of page content dump files for all pages, current content only:
172875370489
Total size of page content dump files for all pages, all revisions:
20881062786426
---------------------
Sincerely,
Your friendly Wikimedia Dump Info Collector