-----BEGIN PGP SIGNED MESSAGE-----
I want to crawl around 800.000 flagged revisions from the German
Wikipedia, in order to make a dump containing only flagged revisions.
For this, I obviously need to spider Wikipedia.
What are the limits (rate!) here, what UA should I use and what
caveats do I have to take care of?
PS: I already have a revisions list, created with the Toolserver. I
used the following query: "select fp_stable,fp_page_id from
flaggedpages where fp_reviewed=1;". Is it correct this one gives me a
list of all articles with flagged revs, fp_stable being the revid of
the most current flagged rev for this article?
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (MingW32)
Comment: Use GnuPG with Firefox : http://getfiregpg.org (Version: 0.7.2)
-----END PGP SIGNATURE-----
On enwiki, the secure server (i.e. secure.wikimedia.org) is currently
written down as using: 22.214.171.124–126.96.36.199
It seems unlikely that the server really uses or needs such a large range.
In addition, we received a report that 188.8.131.52 is operating as
a TOR exit node. Since Wikipedia policy is to prohibit anon editing
and account creation from TOR nodes, it would be nice to clarify this.
In function getInitialPageText in SpecialUpload.php hardcoded headings
are added to the license and file description provided by the user. It
would be great, if those hardcoded headings could be changed to a
MediaWiki message that can be altered onwiki. On Commons, which is
multilingual, for example the headings are more like information clutter
than useful structuring tools. The file description heading will always
be in the user language and the license heading always in the content
language, English (I have no idea why the difference, but it's in the
source code). That does add more confusion than benefit. Especially
since both description and license are wrapped in templates by default
and thus don't need the headings to be distinguishable.
Would be nice if that could be fixed.
And then I want to bring back to mind my last message about "localize
transcluded image description pages". My proposal was easy to implement,
uncontroversial, so I think, and it would provide a big gain in
usability. At least it's relatively easy to implement for somebody with
commit access, unlike me. I would appreciate, if somebody could change
the code. Thank you.
Quick note -- we're now testing Werdna's AbuseFilter extension on
test.wikipedia.org. AbuseFilter will make it easier for on-wiki admins
to set up automatic detection and tagging in response to common suspect
A lot of this kind of filtering is being done in client-side bot tools
today, but it can be hard to coordinate what's going on, and responses
are usually limited to heavy-handed reversions... by building the
filters into the wiki, actions can range from simply tagging an edit for
human attention to emergency desysopping, depending on what's appropriate.
* Docs: http://www.mediawiki.org/wiki/Extension:AbuseFilter
* Take a peek: http://test.mediawiki.org/wiki/Special:AbuseFilter
Currently, sysops can define filter rules on Test Wikipedia, but with
some limitations on what the system can do in response:
* Tagging with visibility in RecentChanges/History/Contribs/etc isn't
implemented yet (needs some support in MediaWiki core that hasn't been
* Filter-triggered blocks, range blocks, and removal from groups is
disabled since we don't want people going crazy just yet. ;)
Werdna will be polishing up the capabilities, interface, and
capabilities of AbuseFilter over the next couple weeks ... in response
to your help testing and providing feedback. Go check it out! :D
With all the discussion on foundation-l about contributors and
attribution, I have noted that while there're two different
implementations for blaming mediawiki articles, none of them seem to be
There're some example results, but not the tools themselves.
The implementations I am aware are:
*Roman Nosov (svn user roman) blamemap extension (2006-2007), which was
*Greg Hewgill wikiblame (2008)
Is the code available and I have missed it? Do we have any other
I don't think it would be a _bad_ idea to support server side
transcoding it ofcourse gives more flexibility to have the original file
and then let us target different output formats in the future. Would let
us support camera video uploads etc.
But there are logistical issues. It adds a bit of complexity / cost to
the server side setup. Additionally we are interested in working with
archive.org who already offers free transcode to ogg from arbitrary
uploaded formats for free licensed content. They have 2100+
transcode/storage cpu units and petabytes of storage. Commons has on the
order of 40 TB storage and all of (already busy) wikimedias servers
together are around 400 units ... It makes sense to encourage long form
video contribution to be supported via partnership with archive.org.
Especially once we have them integrated as an archive provider.
Firefogg ideally is not "complex" for end users. Its a one click
extension install, the user does not have to know anything about
so the settings are identical to what we would request server side.
Using an extension also lets us control the upload system so we can have
it upload in 1 meg chunks for example. That way we can improve usability
around multi hundred meg POST uploads by giving progress indicators,
support resumed uploads etc.
> Would it be worth providing a simple http-upload to a server-side transcoder
> for these relatively small files that are low-quality to begin with?
yes I would support that effort. Just focused on the firefogg stuff
right now. If you have time to push forward on this we can try and get
something set up.
> wouldn't it be more efficient to let
> an infrastructure like the one I created encode _all_ versions used for
> streaming, whether for desktops or mobile devices, from a single
> archival-quality upload?
yes, it may be more ideal to just upload the HQ version and have the
server do the transcode. Your transcode infrastructure could be very
useful for that. But we will have to see how the logistical issues
mentioned above play out.
Through a message on another list, I found that when one tries to
reach wikipedia (or at least wikipedia-en) specifying the User Agent
as "Python-urllib/1.17", the server gives a "403 Forbidden" response,
together with the content of the page.
1. Why is this User Agent getting this response? If I remember
correctly, this was installed in the early days of the pywikipediabot,
when Brion wanted to block it because it had a programming error
causing it to fetch each page twice (sometimes even more?). If that is
the actual reason, I see no reason why it should still be active years
2. If this User Agent is really to be blocked, why do we still provide
the content of the page that is forbidden?
André Engels, andreengels(a)gmail.com
A few months ago I successfully downloaded the November 2006 HTML version of Wikipedia (about 6GB expanding to 90GB) and the October 2008 xml.bz2 file (4.1GB converted to 7.1GB Wikitaxi format).
I have just downloaded the June 2008 HTML version in .tar.7z format and extracted into .tar format (14.3GB.to 230GB). I now have no idea what to do next. I ran WinRAR on it and it gave up after more than 6 million files.
1. How do I actually access all this information? I use the Wikitaxi version, but only the HTML version allows access to, for instance, categories, so the latest version would be useful.
2. Is there any way to recompress it to a reasonable size such that I can still access it without it occupying nearly all my disk?3. Or, failing that, is there any way to access the original .tar.7z file, as BzReader can access .xml.bz2 files?