Xmldatadumps-l November 2015

xmldatadumps-l@lists.wikimedia.org

4 participants
4 discussions

by Richard Jelinek

Hi, don't know if this issue came up already - in case it did and has been dismissed, I beg your pardon. In case it didn't... I hereby propose, that pbzip2 (https://launchpad.net/pbzip2) is used to compress the xml dumps instead of bzip2. Why? Because its sibling (pbunzip2) has a bug bunzip2 hasn't. :-) Strange? Read on. A few hours ago, I filed a bug report for pbzip2 (see https://bugs.launchpad.net/pbzip2/+bug/922804) together with some test results done even some few hours before that. The results indicate that: bzip2 and pbzip2 are vice-versa compatible each one can create archives, the other one can read. But if it is for uncomressing, only pbzip2 compressed archives are good for pbunzip2. I propose compressing the archives with pbzip2 for the following reasons: 1) If your archiving machines are SMP systems this could lead to a better usage of system ressources (i.e. faster compression). 2) Compression with pbzip2 is harmless for regular users of bunzip2, so everything should run for these people as usual. 3) pbzip2-compressed archives can be uncompressed with pbunzip2 with a speedup that scales nearly linearly with the number of CPUs in the host. So to sum up: It's a no loose and two win situation if you migrate to pbzip2. And that just because pbunzip2 is slightly buggy. Isn't that interesting? :-) cheers, -- Dipl.-Inf. Univ. Richard C. Jelinek PetaMem GmbH - www.petamem.com Geschäftsführer: Richard Jelinek Human Language Technology Experts Sitz der Gesellschaft: Fürth 69216618 Mind Units Registergericht: AG Fürth, HRB-9201

8 years, 3 months

Redirect dump format

by Alan Said

Hi all, I want to recreate the redirect graph form the redirect sql file. Having loaded the file into a local MySQL database I have a table with the fields listed in https://www.mediawiki.org/wiki/Manual:Redirect_table (except the rf_interwiki column). For instance, the second row in the table contains the values: rd_from rd_namespace rd_title 13 0 History_of_Afghanistan I interpret this as page ID 13 in namespace 0 redirects to https://en.wikipedia.org/wiki/History_of_Afghanistan Now, trying to use the id in the rf_from column, I attempt to access the page information using the example at https://www.mediawiki.org/wiki/Manual:Page_table#page_id i.e. loading the url https://www.mediawiki.org/w/api.php?action=query&prop=info&pageids=13 The result tells me that pageid 13 is missing. So, can any one please tell me what I'm doing wrong? Best, Alan -- Alan Said Recorded Future e: alansaid(a)acm.org t: @alansaid w: www.alansaid.com

8 years, 5 months

Re: [Xmldatadumps-l] [Wiki-research-l] Download of pageviews dataset

by Federico Leva (Nemo)

Cristian Consonni, 11/11/2015 15:09: > I am working with a student on scientific citation on Wikipedia and, > very simply put, we would like to use the pageview dataset to have a > rough measure of how many times a paper was viewed thanks to > Wikipedia.[*] > > The full dataset is, as of now, ~ 4.7TB in size. > > I have two questions: > * if we download this dataset this would entail, from a first > estimation, ~ 30 days of continuous download (assuming an average > download speed of ~ 2MB/s, which was what we measured over the > download of a month of data (~ 64GB)). Here at my University (Trento, > Italy) this kind of downloads have to be notified to the IT > department. I was wondering if this would be a useful information for > the WMF, too. No need to notify such small downloads. > * given the estimation above I was wondering if it is possible to > obtain this data over FedEx Bandwith (cit. [1]). i.e. via shipping of > a physical disk, I know that in some fields (e.g. neuroscience) this > is the standard way to exchange big dataset (in the order of TBs). This assumes that some point of the network has faster download from that machine. The server is very slow for pretty much anyone except rare exceptions (https://phabricator.wikimedia.org/T45647 ), possibly even inside the cluster. Copying to a hard drive might take many days. You have two more alternatives: * scp from Labs, /public/dumps/pagecounts-all-sites/ (sometimes reaches 3-4 MB/s for me); * archive.org for pagecounts-raw https://archive.org/search.php?query=wikipedia_visitor_stats (you can start download of all months at once and use torrent, will hopefully saturate your bandwidth because you will download from dozens servers rather than one). Nemo > > Thanks in advance for your help. > > Cristian > [*] I know these are pageviews and not unique visitors, furthermore > there is no guarantee that viewing a citation means anything. I am > approaching to this data the same way "impressions" versus > "clicktroughs" are treated in the online advertising world. > [1] https://what-if.xkcd.com/31/ > > _______________________________________________ > Wiki-research-l mailing list > Wiki-research-l(a)lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/wiki-research-l >

8 years, 5 months

Proposal: Splitting of dewiki and frwiki into smaller chunks

by Hydriz Scholz

Hi all, If you are a user of the dewiki and frwiki dumps, please provide your input on the proposal for splitting the dewiki and frwiki dumps into smaller pieces, similar to that of enwiki. The proposal is available on Phabricator. [1] All comments are welcome! Also, as a reminder, please provide your comments for the next generation of dumps on Phabricator. [2] This is your chance to propose changes to the dumps to suit your needs. Thanks! [1]: https://phabricator.wikimedia.org/T116907 [2]: https://phabricator.wikimedia.org/T114019 -- Best regards, Hydriz Scholz

8 years, 5 months

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

Xmldatadumps-l November 2015