> Revision: 48735
>
> Author: midom
>
> Date: 2009-03-24 10:44:24 +0000 (Tue, 24 Mar 2009)
>
> Log Message:
>
> -----------
>
> change limit to reflect one in interface. :)
>
> Modified Paths:
>
> --------------
>
> trunk/phase3/includes/specials/SpecialRecentchanges.php
>
> Modified: trunk/phase3/includes/specials/SpecialRecentchanges.php
>
> ===================================================================
>
> --- trunk/phase3/includes/specials/SpecialRecentchanges.php 2009-03-24 09:59:13 UTC (rev 48734)
>
> +++ trunk/phase3/includes/specials/SpecialRecentchanges.php 2009-03-24 10:44:24 UTC (rev 48735)
>
> @@ -55,7 +55,7 @@
>
> $this->parseParameters( $parameters, $opts );
>
> }
>
> - $opts->validateIntBounds( 'limit', 0, 5000 );
>
> + $opts->validateIntBounds( 'limit', 0, 500 );
>
> return $opts;
>
> }
Was this necessary for performance reasons? A lot of people were using
>500 recentchanges lists, some wikis even had them as options on the
RC interface (see
http://hu.wikipedia.org/wiki/MediaWiki:Recentchangestext for example).
If it was only changed for aesthetic purposes, please change it back,
or make it a site option.
>> On 1/4/09 6:20 AM, yegg at alum.mit.edu wrote:
>> The current enwiki database dump (http://download.wikimedia.org/enwiki/20081008/
>> ) has been crawling along since 10/15/2008.
> The current dump system is not sustainable on very large wikis and
> is being replaced. You'll hear about it when we have the new one in
> place. :)
> -- brion
Following up on this thread: http://lists.wikimedia.org/pipermail/wikitech-l/2009-January/040841.html
Brion,
Can you offer any general timeline estimates (weeks, months, 1/2
year)? Are there any alternatives to retrieving the article data
beyond directly crawling
the site? I know this is verboten but we are in dire need of
retrieving this data and don't know of any alternatives. The current
estimate of end of year is
too long for us to wait. Unfortunately, wikipedia is a favored source
for students to plagiarize from which makes out of date content a real
issue.
Is there any way to help this process along? We can donate disk
drives, developer time, ...? There is another possibility
that we could offer but I would need to talk with someone at the
wikimedia foundation offline. Is there anyone I could
contact?
Thanks for any information and/or direction you can give.
Christian
Occasionally I visit Ohloh.net to satisfy my stats addiction.
One of the things Ohloh analyses in the source code is license
information[1]. On the Ohloh MediaWiki page[2] an analysis summary is
displayed. It contains the following warnings (number of files added by me
from [1]):
# Mozilla Public License 1.0 may conflict with GPL (253 files)
# PHP License may conflict with GPL (7 files)
# Apache Software License may conflict with GPL (1 file)
# Artistic License may conflict with GPL (7 files)
# Common Development and Distribution License may conflict with GPL (1 file)
# Apache License 2.0 may conflict with GPL (7 files)
I am wondering if any of these warnings can really point out a licensing
issue. If they do, I think we need to persue this, and get it sorted out.
Anyone who can shed some light on this?
Cheers! Siebrand
[1] http://www.ohloh.net/p/mediawiki/analyses/latest
[2] http://www.ohloh.net/p/mediawiki
Allow me to forward to the list non-subscriber Suresh's reply:
>>>>> "S" == Suresh Ramasubramanian <suresh(a)hserus.net> writes:
S> I'm not particularly short of disk space or memory, thanks. But as
S> Dan mentions, it does sound like a needless waste - and the volume
S> of dud entries is certainly going to scale far higher up when you
S> try it on, say, wikipedia.org or mediawiki.org
S> srs
I asked my pal about his small wiki
http://www.hserus.net/wiki/index.php/Main_Page .
He has even more of those rows, revolving uselessly on his disks...
>>>>> "S" == Suresh Ramasubramanian <suresh(a)hserus.net> writes:
S> Interesting
mysql> SELECT COUNT(*) FROM archive WHERE ar_namespace = 8 AND ar_user_text ='MediaWiki default';
S> +----------+
S> | COUNT(*) |
S> +----------+
S> | 1796 |
S> +----------+
mysql> SELECT COUNT(*) FROM logging WHERE log_namespace = 8 AND log_comment = 'No longer required';
S> +----------+
S> | COUNT(*) |
S> +----------+
S> | 1638 |
S> +----------+
Gentlemen, if your personal Mediawiki wiki has been around since early
2007, you might want to clean out the thousands of Mediawiki:
namespace rows that were left in the database by
maintenance/deleteDefaultMessages.php . Wouldn't it make you feel good
to clean out thousands of wasted rows, leaving behind e.g., on a small
wiki, perhaps just a few hundred rows that are actually related to us?
I don't know why the design decision was made to just leave those
Mediawiki: namespace items sitting in the archive and text tables. But
OK, we proceed to clean them out by hand. I hope I got this right:
$ mysqlshow --count myDatabase > before.txt
$ mysql myDatabase
SELECT COUNT(*) FROM archive WHERE ar_namespace = 8 AND ar_user_text = 'MediaWiki default';
COUNT(*)
1518
DELETE FROM archive WHERE ar_namespace = 8 AND ar_user_text = 'MediaWiki default';
$ php purgeOldText.php --purge
Purge Old Text
Searching for active text records in revisions table...done.
Searching for active text records in archive table...done.
Searching for inactive text records...done.
1518 inactive items found.
Deleting...done.
$ mysql myDatabase
SELECT COUNT(*) FROM logging WHERE log_comment = 'No longer required' AND log_namespace = 8;
COUNT(*)
1510
SELECT MIN(log_timestamp),MAX(log_timestamp) FROM logging WHERE log_comment = 'No longer required' AND log_namespace = 8;
MIN(log_timestamp) MAX(log_timestamp)
20070226185326 20070226194040
DELETE FROM logging WHERE log_comment = 'No longer required' AND log_namespace = 8;
$ mysqlshow --count myDatabase|diff before.txt -|sed '/|/!d'
< | archive | 15 | 2206 |
> | archive | 15 | 688 |
< | logging | 10 | 2597 |
> | logging | 10 | 1087 |
< | text | 3 | 4466 |
> | text | 3 | 2948 |
Hi,
after reading the following sections:
http://wikitech.wikimedia.org/view/Data_dump_redesign#Follow_uphttp://en.wikipedia.org/wiki/Wikipedia_database#Dealing_with_compressed_fil…http://meta.wikimedia.org/wiki/Data_dumps#bzip2http://www.mediawiki.org/wiki/Mwdumper#Usagehttp://www.mediawiki.org/wiki/Dbzip2#Development_status
and skimming the January, February and March archives of this year (all of
which may be outdated and/or incomplete, and then I'll sound like an
idiot), I'd like to say the following:
** 1. If the export process uses dbzip2 to compress the dump, and dbzip2's
MO is to compress input blocks independently, then to bit-shift the
resulting compressed blocks (= single-block bzip2 streams) back into a
single multi-block bzip2 stream, so that the resulting file is
bit-identical to what bzip2 would produce, then the export process wastes
(CPU) time. Bunzip2 can decompress concatenated bzip2 streams. In exchange
for a small size penalty, the dumper could just concatenate the
single-block bzip2 streams, saving a lot of cycles.
** 2. If dump.bz2 was single-block, many-stream (as opposed to the current
many-block, single-stream), then people on the importing end could speed
up *decompression* with pbzip2.
** 3. Even if dump2.bz2 stays single-stream, *or* it becomes multi-stream
*but* is available only from a pipe or socket, decompression can still be
sped up by way of lbzip2 (which I wrote, and am promoting here). Since
it's written in strict adherence to the Single UNIX Specification, Version
2, it's available on Cygwin too, and should work on the Mac.
Dependent on the circumstances (number of cores, availability of dump.bz2 from
a regular file or just a pipe, etc) different bunzip2 implementations are best.
For example, on my dual core desktop, even
7za e -tbzip2 -so dump.bz2
performs best in some cases (which -- I guess -- parallelizes the different
stages of the decompression).
For my more complete analysis (with explicit points on (my imagination of)
dbzip2), please see
http://lists.debian.org/debian-mentors/2009/02/msg00135.html
** 4. Thanassis Tsiodras' offline reader, available under
http://users.softlab.ece.ntua.gr/~ttsiod/buildWikipediaOffline.html
uses, according to section "Seeking in the dump file", bzip2recover to
split the bzip2 blocks out of the single bzip2 stream. The page states
This process is fast (since it involves almost no CPU calculations
While this may be true relative to other dump-processing operations,
bzip2recover is, in fact, not much more than a huge single threaded
bit-shifter, which even makes two passes over the dump. (IIRC, the first
pass shifts over the whole dump to find bzip2 block delimiteres, then the
second pass shifts the blocks found previously into byte-aligned, separate
bzip2 streams.)
Since lbzip2's multiple-workers decompressor distributes the search for
bzip2 block headers over all cores, a list of bzip2 block bit positions
(or the separate files themselves) could be created faster, by hacking a
bit on lbzip2 (as in "print positions, omit decompression").
Or dbzip2 itself could enable efficient seeking in the compressed dump by
saving named bit positions in a separate text file.
-o-
My purpose with this mail is two-fold:
- To promote lbzip2. I honestly believe it can help dump importers. I'm
also promoting, with obviously less bias, pbzip2 and 7za, because in some
decompression situations they beat lbzip2, and I feel their usefulness
isn't emphasized enough in the links above. (If parallel decompression for
importDump.php and/or MWDumper is a widely solved problem, then I'm sorry
for the noise.)
- To ask a question. Can someone please describe the current (and planned)
way of compressing/decompressing the dump? (If I'd had more recent info on
this, perhaps I wouldn't have bothered the list with this post. I'm also
just plain curious.)
Thanks,
lacos
http://phptest11.atw.hu/http://lacos.web.elte.hu/pub/lbzip2/
Hi there,
I was asked to setup a MediaWiki but there are some strange (though obvious
in their case) requests that have been made which I'm not sure what the best
solution for integration is. Here they are briefly:
- They have a list of authors and animators and would like to sort them
in a category by last name first. All the articles currently created begin
with first and last. I realised that there is a transclusion option called
{{DEFAULTSORT:}} which lets you define its sort value. The problem is, I'm
not sure how to invert and display the name on the category page with the
inverse. http://meta.wikimedia.org/wiki/Help:Category doesn't tell me
much other than about the default sort value.
- The Wiki is going to be based on a massive bibliographical system. I've
searched high and low for good biblio extensions, but they pretty much do
not do what we want. What I've decided to do is have the bibliography
entries listed on each Article under a Bibliography header. I am writing an
extension which finds all pages with a similar header and parses all the
bibliographical entries, which are listed in MLA format. I'm wondering if
there is a much better way of doing this, because then I have to consolidate
duplicate entries by using a lot of regex (ack).
That's about it for now, there are other minor things, but before I bother
the mailing list I'm going to do my research and figure them out with the
help files. For now these are the only things I can't seem to solve on my
own.
Regards,
David
We're a mentoring organization for the Google Summer of Code again this
year, and we're dead set on making it our awesomest summer ever!
One key thing though is making sure that students and potential students
have access to a mentor who can answer their questions and just help
steer them into becoming an active member of our development community.
If you're an experienced MediaWiki developer and would like to help out
with selecting and mentoring student projects, please give us a shout!
We'll take you even if you live in the southern hemisphere. ;)
We need folks who'll be available online fairly regularly over the
summer and are knowledgeable about MediaWiki -- not necessarily knowing
every piece of it, but knowing where to look so you can help the
students can help themselves.
If you're interested, don't forget to apply soon! Student submissions
will complete next week and we'll need to start selecting then...
http://socghop.appspot.com/org/apply_mentor/google/gsoc2009
-- brion