On Mon, Jul 8, 2013 at 6:53 AM, Randall Farmer <randall(a)wawd.com> wrote:
> > Keeping the dumps in a text-based format doesn't make sense, because
> that can't be updated efficiently, which is the whole reason for the new
> dumps.
>
> First, glad to see there's motion here.
>
> It's definitely true that recompressing the entire history to .bz2 or .7z
> goes very, very slowly. Also, I don't know of an existing tool that lets
> you just insert new data here and there without compressing all of the
> unchanged data as well. Those point towards some sort of format change.
>
> I'm not sure a new format has to be sparse or indexed to get around those
> two big problems.
>
> For full-history dumps, delta coding (or the related idea of long-range
> redundancy compression) runs faster than bzip2 or 7z and produces good
> compression ratios on full-history dumps, based on some tests<https://www.mediawiki.org/wiki/Dbzip2#rzip_and_xdelta3>
> . (I'm going to focus mostly on full-history dumps here because they're
> the hard case and one Ariel said is currently painful--not everything here
> will apply to latest-revs dumps.)
>
> For inserting data, you do seemingly need to break the file up into
> independently-compressed sections containing just one page's revision
> history or a fragment of it, so you can add new diff(s) to a page's
> revision history without decompressing and recompressing the previous
> revisions. (Removing previously-dumped revisions is another story, but it's
> rarer.) You'd be in new territory just doing that; I don't know of existing
> compression tools that really allow that.
>
> You could do those two things, though, while still keeping full-history
> dumps a once-every-so-often batch process that produces a sorted file. The
> time to rewrite the file, stripped of the big compression steps, could be
> bearable--a disk can read or write about 100 MB/s, so just copying the 70G
> of the .7z enwiki dumps is well under an hour; if the part bound by CPU and
> other steps is smallish, you're OK.
>
> A format like the proposed one, with revisions inserted wherever there's
> free space when they come in, will also eventually fragment the revision
> history for one page (I think Ariel alluded to this in some early notes).
> Unlike sequential read/writes, seeks are something HDDs are sadly pretty
> slow at (hence the excitement about solid-state disks); if thousands of
> revisions are coming in a day, it eventually becomes slow to read things in
> the old page/revision order, and you need fancy techniques to defrag (maybe
> a big external-memory sort <http://en.wikipedia.org/wiki/External_sorting>)
> or you need to only read the dump on fast hardware that can handle the
> seeks. Doing occasional batch jobs that produce sorted files could help
> avoid the fragmentation question.
>
These are some interesting ideas.
You're right that the copying the whole dump is fast enough (it would
probably add about an hour to a process that currently takes several days).
But it would also pretty much force the use of delta compression. And while
I would like to use delta compression, I don't think it's a good idea to be
forced to use it, because I might not have the time for it or it might not
be good enough.
Because of that, I decided to stay with my indexed approach.
> There's a great quote about the difficulty of "constructing a software
> design...to make it so simple that there are obviously no deficiencies."
> (Wikiquote came through with the full text/attribution, of course<http://en.wikiquote.org/wiki/C._A._R._Hoare>.)
> I admit it's tricky and people can disagree about what's simple enough or
> even what approach is simpler of two choices, but it's something to strive
> for.
>
> Anyway, I'm wary about going into the technical weeds of other folks'
> projects, because, hey, it's your project! I'm trying to map out the
> options in the hope that you could get a product you're happier with and
> maybe give you more time in a tight three-month schedule to improve on your
> work and not just complete it. Whatever you do, good luck and I'm
> interested to see the results!
>
Feel free to comment more. I am the one implementing the project, but
that's all. Input from others is always welcome.
Petr Onderka
Hello,
This is a reminder that the Language Engineering team will be hosting an
IRC office hour later today, i.e. July 10, 2013 at 1700 UTC/1000 PDT on
#wikimedia-office (Freenode).
Thanks
Runa
Agenda:
======
1. ULS Rollout
2. Other updates
3. Q/A - We shall be taking questions during the session. Questions
can also be sent to runa at wikimedia dot org or siebrand at wikimedia
dot org before the event and can be addressed during the office-hour.
---------- Forwarded message ----------
From: Runa Bhattacharjee <rbhattacharjee(a)wikimedia.org>
Date: Wed, Jul 3, 2013 at 10:11 PM
Subject: [Language Engineering] Office hour on July 10, 2013 at 1700
UTC/1000 PDT
To: Wikimedia developers <wikitech-l(a)lists.wikimedia.org>, Wikimedia
Mailing List <wikimedia-l(a)lists.wikimedia.org>, MediaWiki
internationalisation <mediawiki-i18n(a)lists.wikimedia.org>
Hello,
The Wikimedia Language Engineering team [1] invites everyone to join
the team’s monthly office hour on July 10, 2013 at 1700 UTC/ 1000 PDT
on #wikimedia-office. During this session we would be talking about
some of our recent activities, including the Universal Language
Selector (ULS) rollout and updates from the ongoing projects.
See you all at the IRC office hour!
regards,
Runa
Event Details:
==========
Date: July 10, 2013 (Wednesday)
Time: 1700-1800 UTC, 1000-1100 AM PDT
IRC channel: #wikimedia-office on irc.freenode.net
Agenda:
1. ULS Rollout
2. Other updates
3. Q/A - We shall be taking questions during the session. Questions
can also be sent to runa at wikimedia dot org or siebrand at wikimedia
dot org before the event and can be addressed during the office-hour.
[1] http://wikimediafoundation.org/wiki/Language_Engineering_team
--
Language Engineering - Outreach and QA Coordinator
Wikimedia Foundation
I know it has been annoying a couple of people other than me, so now that I've learned how to make it work I'll share the knowledge here.
tl;dr: Star the repositories. No, seriously. (And yes, you need to star each extension repo separately.)
(Is there a place on mw.org to put this tidbit on?)
------- Forwarded message -------
From: "Brian Levine" <support(a)github.com> (GitHub Staff)
To: matma.rex(a)gmail.com
Cc:
Subject: Re: Commits in mirrored repositories not showing up on my profile
Date: Tue, 09 Jul 2013 06:47:19 +0200
Hi Bartosz
In order to link your commits to your GitHub account, you need to have some association with the repository other than authoring the commit. Usually, having push access gives you that connection. In this case, you don't have push permission, so we don't link you to the commit.
The easy solution here is for you to star the repository. If you star it - along with the other repositories that are giving you this problem - we'll see that you're connected to the repository and you'll get contribution credit for those commits.
Cheers
Brian
--
Matma Rex
Hello together,
in the framework of a GLAM project, we are looking for ways to
(1) identify the number of pages in a given category - including via
subcategories - on a given wiki
(2) get the pageview stats for all these pages, including on aggregate
(3) do the above across languages or projects
(4) estimate what outcomes to expect in terms of Wikipedia pageviews
and related metrics after an image donation of X files to a given
category on Commons.
I assume that part of it is available via the API but couldn't find
anything close enough.
Any pointers would be appreciated.
Thanks and cheers,
Daniel
Good morning all,
I have a question about a problem that cropped up during my update from 1.18 to
1.21.1.
With one exception everything went smoothly during my update, but now all of my
images, appear to be without thumbnails and inadvertently using File protocol
links.
The generated source for embedded images looks like this:
<p>[<a rel="nofollow" class="external text"
href="File:ReportedTime.jpg%7C451px%7CReported">time per activity on
Project</a>]</p>
Generated from:
[[File:ReportedTime.jpg|451px|Reported time per activity on Project]]
Any idea what I may have done wrong. Is there a new setting that I may have
missed? Has anyone ever seen this sort of issue before?
Thank you,
Derric Atzrott
Computer Specialist
Alizee Pathology
Various parts of Mediawiki will apply tags to specific edits in recent
changes and histories.
For example, the recently introduced Visual Editor is adding Tag:
VisualEditor to all of its edits.
Are such tags included in the XML dumps of Wikipedia? It will be a
while before a new dump of enwiki is released, but once it is ready,
I'm wondering if we can use it to track the adoption of Visual Editor
by looking for such Tags in the dump file. Are the Tags included, and
if so, which dump files are they contained in?
I've looked briefly at the dump documentation and didn't see any
mention of Tags, and I don't recall noticing them during any of the
times I've worked with dump files in the past.
-Robert Rohde
>Good morning all,
>
>I have a question about a problem that cropped up during my update from 1.18 to
1.21.1.
>
>With one exception everything went smoothly during my update, but now all of my
images, appear to be without thumbnails and inadvertently using File protocol
links.
>
>...snip...
>
>Any idea what I may have done wrong. Is there a new setting that I may have
missed? Has anyone ever seen this sort of issue before?
>
>Thank you,
>Derric Atzrott
I determined the issue. I had $wgUrlProtocols[] = "file:"; in my
LocalSettings.php from an earlier attempt to get File protocol links working
(would have required me to write Firefox and Chrome extensions so it was
abandoned as too much effort to securely manage).
Thank you,
Derric Atzrott
Here's a copy of a mail I just sent to stewards-l about the trial
deployment of global AbuseFilters:
Hello,
after a long time we're finally confident that the AbuseFilter extension
is in a state in which it can be used for global filters. Therefore I'm
happy to announce that from now on global AbuseFilters can be used on
(some) Wikimedia Wikis.
The filters can be created and edited by Stewards using the normal
AbuseFilter interface on meta
( https://meta.wikimedia.org/wiki/Special:AbuseFilter ) and basically
work like local filters (you only have to set the "Global filter" flag).
Although global AbuseFilters are already in use on Wikimedia Labs for
quite some time we would like you to start using them slowly, preferable
with logging-only filters to prevent unforeseeable damage.
Global filters are yet enabled on:
metawiki, testwiki, test2wiki, mediawikiwiki
(They will only filter changes on these wikis... Of course this trial
will later be extended to further Wikis and it's planned to cover all
Wikis at some point)
Please notice that it's not yet possible to create custom warning
message for global filters
( https://bugzilla.wikimedia.org/show_bug.cgi?id=45164 ) and that global
filters can't yet be enabled/ disabled for certain wikis only
( https://bugzilla.wikimedia.org/show_bug.cgi?id=41172 ).
Cheers,
Marius Hoch (Hoo man)