Ack, sorry for the (no subject); again in the right thread:
> For external uses like XML dumps integrating the compression
> strategy into LZMA would however be very attractive. This would also
> benefit other users of LZMA compression like HBase.
For dumps or other uses, 7za -mx=3 / xz -3 is your best bet.
That has a 4 MB buffer, compression ratios within 15-25% of
current 7zip (or histzip), and goes at 30MB/s on my box,
which is still 8x faster than the status quo (going by a 1GB
benchmark).
Trying to get quick-and-dirty long-range matching into LZMA isn't
feasible for me personally and there may be inherent technical
difficulties. Still, I left a note on the 7-Zip boards as folks
suggested; feel free to add anything there:
https://sourceforge.net/p/sevenzip/discussion/45797/thread/73ed3ad7/
Thanks for the reply,
Randall
Hi folks,
This is an attempt to summarize the list of RFCs that are listed under
this cluster:
https://www.mediawiki.org/wiki/Architecture_Summit_2014/RFC_Clusters#HTML_t…
...and possibly get a conversation going about all of this in advance
of the Architecture Summit.
The main focus of all of these RFCs is around HTML generation for user
interface elements. This category is not about wikitext templates or
anything to do with how we translate wikitext markup
"Template Engine" is Chris Steipp's submission outlining the use of
Twig. From my conversations with Chris, it's not so much that he's
eager to adopt Twig specifically so much as standardize on
*something*, and make sure there's usage guidelines around it to avoid
common mistakes. He's seen many attempts at per-extension template
libraries that bloat our code and often start off with big security
vulnerabilities. There are many extensions that use Twig, and it
seems to be a popular choice for new work, so Chris is hoping to
standardize on it and put some usage standards around it.
"HTML templating library" is Ryan Kaldari's submission, promoting the
use of Mustache or something like it. His main motivation is to have
a Javascript template library for front-end work, but is hoping we
choose something that has both Javascript and PHP implementations so
that any PHP template system implementation is compatible with what we
use for Javascript templating.
"MVC Framework" is Owen Davis's description of Wikia's Nirvana
framework, which has been central to all of the user-facing work
they've been doing for the past 2-3 years. As Owen points out in the
talk page for this, it's really view-controller rather than full MVC.
A big plus of adopting this RFC is that it would make it much more
likely that Wikia-developed extensions (of which there are many) would
be of greater use to the larger MediaWiki community, and would
generally help facilitate greater collaboration between Wikia and the
rest of us.
"OutputPage Refactor" is another submission from Owen Davis, which
isn't really about templating, so much as taking the HTML generation
code out of OutputPage. Wikia has been maintaining a fork of
OutputPage for quite some time, so they already have quite a bit of
experience with the proposed changes. This is clustered with the
templating proposals, since I imagine the work that gets factored out
of OutputPage would need to be factored into whatever templating
system we choose.
The first three seem somewhat mutually exclusive, though it's clear
our task is likely to come up with a fourth proposal that incorporates
many requirements of those three. The OutputPage Refactor proposal,
given some fleshing out, may not be controversial at all.
Where should we go from here? Can we make some substantial progress
on moving one or more of these RFCs over the next few weeks?
Rob
Hoi,
At this moment Wikipedia "red links" provide no information whatsoever.
This is not cool.
In Wikidata we often have labels for the missing (=red link) articles. We
can and do provide information from Wikidata in a reasonable way that is
informative in the "Reasonator". We also provide additional search
information on many Wikipedias.
In the Reasonator we have now implemented "red lines" [1]. They indicate
when a label does not exist in the primary language that is in use.
What we are considering is creating a template {{Reasonator}} that will
present information based on what is available in Wikidata. Such a template
would be a stand in until an article is actually written. What we would
provide is information that is presented in the same way as we provide it
as this moment in time [2]
This may open up a box of worms; Reasonator is NOT using any caching. There
may be lots of other reasons why you might think this proposal is evil. All
the evil that is technical has some merit but, you have to consider that
the other side of the equation is that we are not "sharing in the sum of
all knowledge" even when we have much of the missing requested information
available to us.
One saving (technical) grace, Reasonator loads round about as quickly as
WIkidata does.
As this is advance warning, I hope that you can help with the issues that
will come about. I hope that you will consider the impact this will have on
our traffic and measure to what extend it grows our data.
The Reasonator pages will not show up prettily on mobile phones .. so does
Wikidata by the way. It does not consider Wikipedia zero. There may be more
issues that may require attention. But again, it beats not serving the
information that we have to those that are requesting it.
Thanks,
GerardM
[1]
http://ultimategerardm.blogspot.nl/2014/01/reasonator-is-red-lining-your-da…
[2] http://tools.wmflabs.org/reasonator/test/?lang=oc&q=35610
Hello everyone,
It’s with great pleasure that I’m announcing that Sam Smith[1] has joined the WIkimedia Foundation as a Software Engineer in Features Engineering. He'll be working with the Growth team.[2]
Before joining us, Sam was previously a member of the Last.fm web team (web-slingers) where he helped to build the new catalogue pages, the Last.fm Spotify app, the new (the only) user on-boarding flow, and helped immortalise his favorite band (Maybeshewill) in most of their unit test cases. Before that he worked on everything from Java job schedulers, to Pascal windows installers, to Microsoft server sysadmining, to Zend Framework and symfony migrations. Ask him which was the was the worst—my money is on the PHP migration[3]. He received his Masters Degree in physics at the University of Warwick.
Sam is based in London (the capital of England, not the city in Ontario or the settlement the island of Kiribati). He lives in Surray Quays with his wife, Lisa, and his 19 month old son, George. His hobbies include juggling (he's juggled for 6 years), unicycling (one day he's going to attempt the distance record… one day!), climbing (specifically bouldering, he's actually really afraid of heights), coffee (it's not really a hobby, it's an obsession), and playing Lineage 1[4]. Ask him to do some unicycling up a boulder while drinking coffee and playing Lineage… now that would be juggling!
His first official day is today, Tuesday, January 21, 2013. (What? On time? He signed his contract last year, so I had a lot of time to prepare. Having said that, I didn't start this e-mail until this morning so balance has been restored to the force.)
Please join me in a not-belated welcome of Sam Smith to the Wikimedia Foundation. :-)
Take care,
terry
P.S. Because Jared is demanding we include a picture for new staff and contractors, here is one[5]:
[1] https://github.com/phuedx
[2] https://www.mediawiki.org/wiki/Growth
[3] http://www.codinghorror.com/blog/2012/06/the-php-singularity.html
[4] http://en.wikipedia.org/wiki/Lineage_(video_game)
[5] http://en.wikipedia.org/wiki/Heston_Blumenthalhttps://commons.wikimedia.org/wiki/File:Heston_Blumenthal_Chef_Whites.jpg
terry chay 최태리
Director of Features Engineering
Wikimedia Foundation
“Imagine a world in which every single human being can freely share in the sum of all knowledge. That's our commitment.”
p: +1 (415) 839-6885 x6832
m: +1 (408) 480-8902
e: tchay(a)wikimedia.org
i: http://terrychay.com/
w: http://meta.wikimedia.org/wiki/User:Tychay
aim: terrychay
> For external uses like XML dumps integrating the compression
> strategy into LZMA would however be very attractive. This would also
> benefit other users of LZMA compression like HBase.
For dumps or other uses, 7za -mx=3 / xz -3 is your best bet.
That has a 4 MB buffer, compression ratios within 15-25% of
current 7zip (or histzip), and goes at 30MB/s on my box,
which is still 8x faster than the status quo (going by a 1GB
benchmark).
Trying to get quick-and-dirty long-range matching into LZMA isn't
feasible for me personally and there may be inherent technical
difficulties. Still, I left a note on the 7-Zip boards as folks
suggested; feel free to add anything there:
https://sourceforge.net/p/sevenzip/discussion/45797/thread/73ed3ad7/
Thanks for the reply,
Randall
On Tue, Jan 21, 2014 at 2:19 PM, Randall Farmer <randall(a)wawd.com> wrote:
> > For external uses like XML dumps integrating the compression
> > strategy into LZMA would however be very attractive. This would also
> > benefit other users of LZMA compression like HBase.
>
> For dumps or other uses, 7za -mx=3 / xz -3 is your best bet.
>
> That has a 4 MB buffer, compression ratios within 15-25% of
> current 7zip (or histzip), and goes at 30MB/s on my box,
> which is still 8x faster than the status quo (going by a 1GB
> benchmark).
>
> Re: trying to get long-range matching into LZMA, first, I
> couldn't confidently hack on liblzma. Second, Igor might
> not want to do anything as niche-specific as this (but who
> knows!). Third, even with a faster matching strategy, the
> LZMA *format* seems to require some intricate stuff (range
> coding) that be a blocker to getting the ideal speeds
> (honestly not sure).
>
> In any case, I left a note on the 7-Zip boards as folks have
> suggested: https://sourceforge.net/p/sevenzip/discussion/45797/thread/73ed3ad7/
>
> Thanks for the reply,
> Randall
>
>
Hi, everyone.
tl;dr: New tool compresses full-history XML at 100MB/s, not 4MB/s, with the
same avg compression ratio as 7zip. Can anyone help me test more or
experimentally deploy?
As I understand, compressing full-history dumps for English Wikipedia and
other big wikis takes a lot of resources: enwiki history is about 10TB
unpacked, and 7zip only packs a few MB/s/core. Even with 32 cores, that's
over a day of server time. There's been talk about ways to speed that up in
the past.[1]
It turns out that for history dumps in particular, you can compress many
times faster if you do a first pass that just trims the long chunks of text
that didn't change between revisions. A program called rzip[2] does this
(and rzip's _very_ cool, but fatally for us it can't stream input or
output). The general approach is sometimes called Bentley-McIlroy
compression.[3]
So I wrote something I'm calling histzip.[4] It compresses long repeated
sections using a history buffer of a few MB. If you pipe history XML
through histzip to bzip2, the whole process can go ~100 MB/s/core, so we're
talking an hour or three to pack enwiki on a big box. While it compresses,
it also self-tests by unpacking its output and comparing checksums against
the original. I've done a couple test runs on last month's fullhist dumps
without checksum errors or crashes. Last full run I did, the whole dump
compressed to about 1% smaller than 7zip's output; the exact ratios varied
file to file (I think it's relatively better at pages with many revisions)
but were +/- 10% of 7zip's in general.
Also, less exciting, but histzip's also a reasonably cheap way to get daily
incr dumps about 30% smaller.
Technical datadaump aside: *How could I get this more thoroughly tested,
then maybe added to the dump process, perhaps with an eye to eventually
replacing for 7zip as the alternate, non-bzip2 compressor?* Who do I talk
to to get started? (I'd dealt with Ariel Glenn before, but haven't seen
activity from Ariel lately, and in any case maybe playing with a new tool
falls under Labs or some other heading than dumps devops.) Am I nuts to be
even asking about this? Are there things that would definitely need to
change for integration to be possible? Basically, I'm trying to get this
from a tech demo to something with real-world utility.
Best,
Randall
[1] Some past discussion/experiments are captured at
http://www.mediawiki.org/wiki/Dbzip2, and some old scripts I wrote are at
https://git.wikimedia.org/commit/operations%2Fdumps/11e9b23b4bc76bf3d89e1fb…
[2] http://rzip.samba.org/
[3]
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.11.8470&rep=rep1&t…
[4] https://github.com/twotwotwo/histzip
Hi all,
I am writing a little application to make edits on (Italian) Wikipedia
using OAuth. I use in particular flask_mwoauth[1].
This tool is intended to insert "wikipedia" tags in OpenStreetMap and
coordinates in Wikipedia.
== A Little Context ==
The Italian OpenStreetMap community is very eager about collaborating
with Wikipedia, especially in the last months when an Italian
OpenStreetMap contributor (User:Groppo) created this tool:
Wikipedia-tags-in-OSM
<http://geodati.fmach.it/gfoss_geodata/osm/wtosm/index.html>
As you can imagine from its name this tool was born with the idea of
helping OSM user to tag objects in OSM with the corresponding
wikipedia=* tag (more info about the "wikipedia" tag in OSM here[1]).
The source code (GPL) is here:
<https://github.com/simone-f/wikipedia-tags-in-osm>
The idea is based on a tool by User:Kolossos from de.wiki[2] which
does something similar (but without several features, this OAuth thing
being one of them) (not secondarily, at least for my point of view,
that tool is in PHP, a language I don't know very much so I am much
more happy of contributing to a Python project).
I have contributed to the tool in the past weeks and, since the tool
also signal when an object is tagget is OSM but it has no coordinates
on Italian Wikipedia (those are the articles with a little purple (W)
I wanted to make the process of adding coordinates from OSM to
Wikipedia easier.
Now if you click on the little (W) you see a popup explaining what you
have to do, instead here's what I have done so far as I told it to the
Italian OSM community:
<https://lists.openstreetmap.org/pipermail/talk-it/2014-January/040981.html>
(the e-mail is in Italian but I think the screenshots are understandable)
I need to polish it and fix several little (and not-so-little)
details, but the core of the idea is there.
== The Problem ==
So for now I have two consumers registered: both on mediawiki.org.
One is for my first test and I have called it test-app, it allows only
to make edits on mediawiki.org, the other is called wtosm and allows
editing of it.wikipedia.org
Now I have shared the code of my contribution here (it's also in the
e-mail above, but here's again the link):
<https://github.com/CristianCantoro/wikipedia-tags-in-osm/tree/wikimap>
Of course I am not sharing the consumer token and secret but I was
wondering: if anybody wants to contribute to code of the project how
they should proceed? As it is now it seems that they need to register
their own consumer to get the keys? This seems a little illogic as a
process, to me.
Do you have any idea of what is The Right Way(TM) to do this?
Thank you in advance.
Cristian
Sorry about the borked line wrapping in the previous message - I'm
resending it so you can read it properly!
----
This is a proposal to try and bring order to the messy area of interwiki
linking and interwiki prefixes, particularly for non-WMF users of
MediaWiki.
At the moment, anyone who installs MediaWiki gets a default interwiki
table that is hopelessly out of date. Some of the URLs listed there
have seemingly been broken for 7 years [1]. Meanwhile, WMF wikis have
access to a nice, updated interwiki map, stored on Meta, that is
difficult for anyone else to use. Clearly something needs to be done.
What I propose we do to improve the situation is along the lines of
bug 58369:
1. Split the existing interwiki map on Meta [2] into a "global
interwiki map", located on MediaWiki.org (draft at [3]), and a
"WMF-specific interwiki map" on Meta (draft at [4]).
Wikimedia-specific interwiki prefixes, like bugzilla:, gerrit:, and
irc: would be located in the map on Meta, whereas general-purpose
interwikis, like orthodoxwiki: and wikisource: would go to the
"global map" at MediaWiki.org.
2. Create a bot, similar to l10n-bot, that periodically updates the
default interwiki data in mediawiki/core based on the contents of
the global map. (Right now, the default map is duplicated in two
different formats [5] [6]which is quite messy.)
3. Write a version of the rebuildInterwiki.php maintenance script [7]
that can be bundled with MediaWiki, and which can be run by server
admins to pull in new entries to their interwiki table from the
global map.
This way, fresh installations of MediaWiki get a set of current, useful
interwiki prefixes, and they have the ability to pull in updates as
required. It also has the benefit of separating out the WMF-specific
stuff from the global MediaWiki logic, which is a win for external users
of MW.
Two other things it would be nice to do:
* Define a proper scope for the interwiki map. At the moment it is a
bit unclear what should and shouldn't be there. The fact that we
currently have a Linux users' group from New Zealand and someone's
personal blog on the map suggests the scope of the map have not been
well thought out over the years.
My suggested criterion at [3] is:
"Most well-established and active wikis should have interwiki
prefixes, regardless of whether or not they are using MediaWiki
software.
Sites that are not wikis may be acceptable in some cases,
particularly if they are very commonly linked to (e.g. Google,
OEIS)."
* Take this opportunity to CLEAN UP the global interwiki map!
** Many of the links are long dead.
** Many new wikis have sprung up in the last few years that deserve to
be added.
** Broken prefixes can be moved to the WMF-specific map so existing
links on WMF sites can be cleaned up and dealt with appropriately.
** We could add API URLs to fill the iw_api column in the database
(currently empty by default).
I'm interested to hear your thoughts on these ideas.
Sorry for the long message, but I really think this topic has been
neglected for such a long time.
TTO
----
PS. I am aware of an RFC on MediaWiki.org relating to this, but I can't
see that gaining traction any time soon. This proposal would be a more
light-weight way of dealing with the problem at hand.
[1] https://gerrit.wikimedia.org/r/#/c/84303/
[2] https://meta.wikimedia.org/wiki/Interwiki_map
[3]
https://www.mediawiki.org/wiki/User:This,_that_and_the_other/Interwiki_map
[4]
https://meta.wikimedia.org/wiki/User:This,_that_and_the_other/Local_interwi…
[5]
http://git.wikimedia.org/blob/mediawiki%2Fcore.git/master/maintenance%2Fint…
[6]
http://git.wikimedia.org/blob/mediawiki%2Fcore.git/master/maintenance%2Fint…
[7]
https://git.wikimedia.org/blob/mediawiki%2Fextensions%2FWikimediaMaintenanc…