Re-posting, the original seems to have been lost in cyberspace:
Magnus - I checked out your tool, but it looks like you're using a query
against the categorylinks table? Have you played with setting up a new table
for categories and fulltext indexing it? Use group_concat to get all of a
pages categories into one field, then create a fulltext index on that field.
You get much better performance than using the categorylinks table (kind of
Are you pinging a live database, or a copy made from a dump? (please excuse my
ignorance if this is common knowledge)
I'm working on dummying up a UI using the same approach (fulltext index of
categories) on wikidweb and will write back when I've got something worth
If the Database class wasn't so mysql-centric and all
queries have been abstracted, the child database class
could, in theory, retrieve data from any type of source,
as long as it returns the format expected by the
software, Sql results.
This is all in theory though, since its a pie in the sky :)
On Dec 4, 2008 6:22 PM, "David Gerard" <dgerard(a)gmail.com> wrote:
Peer-to-peer git repositories. Imagine a MediaWiki with the data
stored in git, and updates distributed peer-to-peer.
"Imagine if Wikipedia could be mirrored locally, run on a local
mirror, where content was pushed and pulled, GPG-Digitally-signed;
content shared via peer-to-peer instead of overloading the Wikipedia
This would certainly go some way to solving the "a good dump is all
but impossible" problem ...
(so, anyone hacked up a git backend for MediaWiki revisions rather
than MySQL? :-) )
Wikitech-l mailing list
Of interest (and related to the recent wikitech-l threads on uploading
large files) is this new firefox extension that handles transcoding
from * to Ogg/Theora and uploading: http://firefogg.org/
Not only is this something we may wish to use, but it demonstrates the
viability of a complex uploading widget via a firefox extension. The
extension is GPLed, so it could serve as a starting point for a
commons-uploader firefox extension.
Polish Wiktionary is much interested in having an option to narrow search results to a category (single one is fully enough).
I've read https://bugzilla.wikimedia.org/show_bug.cgi?id=2285. Searching in category using incategory: keyword works on English Wikipedia, which as I have heard has a new internal search engine. Is it possible for another projects (more precisely: Polish Wiktionary) to get this search engine?
Jabber id: derbeth(a)jabber.wp.pl
Wikisłownik to więcej niż słownik! Sprawdź: http://pl.wiktionary.org/
Opera - the fastest browser on Earth! http://www.opera.com/
I'm also forwarding this to the wikitech-l list.
On Dec 3, 2008, at 8:46 PM [Dec 3, 2008 ], Thomas Larsen wrote:
> Hi all,
> The current <ref>...</ref>...<references/> system produces nice
> references, but it is flawed--all the text contained in a given
> reference appears in the text that the reference is linked from. For
> It was a sunny day on Wednesday<ref>David Smith. ''History of
> History Magazine, 2019.</ref>. The next day, Thursday, was cloudy.
> == References and notes ==
> (That's a very simple example, too. References start to become a lot
> larger once they start to include other information and/or are
> produced via a template.)
> Once way I could conceive of correcting the problem is to have a
> reference tag that provides only a _link_ to the note via a label and
> another type of reference tag that actually _defines_ and _displays_
> the note. For example:
> It was a sunny day on Wednesday<ref id="smith"/>. The next day,
> was cloudy.
> == References and notes ==
> <reference id="smith">David Smith. ''History of Wednesdays.''
> Magazine, 2019.</reference>
> This makes the raw wikitext easier to read, since the text of the
> actual reference is in the _references_ section instead of in the
> page's primary content.
> I think this could work ...
> --Thomas Larsen
> WikiEN-l mailing list
> To unsubscribe from this mailing list, visit:
>From CNET interview to Brion
> The text alone is less 500 MB compressed.
That statement struck me, as I wouldn't think that big wikis could fit
on that, much less all wikis.
So I went and spent some CPU on calculations:
I first looked at dewiki:
$ 7z e -so dewiki-20081011-pages-meta-history.xml.7z|sed -n 's/\s*<text
xml:space="preserve">\([^<]*\)\(<\/text>\)\?/\1/gp'| bzip2 -9 | wc -c
325915907 bytes = 310.8 MB
Not bad for a 5.1 GB 7z file. :)
Then I to enwiki, begining with the current versions:
$ bzcat enwiki-20081008-pages-meta-current.xml.bz2|sed -n 's/\s*<text
xml:space="preserve">\([^<]*\)\(<\/text>\)\?/\1/gp'|bzip2 -9 | wc -c
253648578 bytes = 241.898 MB
Again, a gigantic file (7.8 GB bz2) was reduced to less than 500MB.
Maybe it *can* be done after all. There're much more revisions, but
the compression ratio is greater.
So I had to go to turn to the beast, enwiki history files. As there
hasn't been any successful enwiki history dump on the last months, I
used an old dump I had, which is nearly a year old and fills 18G.
$ 7z e -so enwiki-20080103-pages-meta-history.xml.7z |sed -n 's/\s*<text
xml:space="preserve">\([^<]*\)\(<\/text>\)\?/\1/gp'|bzip2 -9 | wc -c
1092104465 bytes = 1041.5 MB = 1.01 GB
So, where did those 'less than 500MB' numbers came from? Also note that
I used bzip2 instead of gzip, so external storage will be using much
more space (plus indexes, ids...).
Nonetheless, the results are impressive on how the size of *already
compressed files* get reduced just by reducing the metadata.
As a comparison, dewiki-20081011-stub-meta-history.xml.gz containing the
remaining metadata is 1.7GB. 1.7 GB + 310.8 MB is still much less than
the 51.4 GB of dewiki-20081011-pages-meta-history.xml.bz2!
Maybe we should investigate new ways of storing the dumps compressed.
Could we achieve similar gains increasing the bzip window size to
counteract the noise of revision metadata?
Or perhaps I used a wrong regex and thus large chunks of data were not
taken into account ?
Names with non-Latin characters in the donation comments are broken
and outputting as question marks. Some people are understandably
unhappy that their names are not appearing next to their donations.
For example, see <
(Thanks to [[ja:user:Aotake]] for pointing it out in #wikimedia.)
Jesse Plamondon-Willard (Pathoschild)
As per Michael's earlier e-mail:
We're very grateful to the Stanton Foundation for this important
investment in Wikipedia's user-friendliness. We're aware of the UNICEF
research as well and we'll survey the existing improvements as part of
this project. A few points beyond the press release:
'''When will this project begin, and when will it finish?'''
The project will begin in January 2009. It will wrap up April 2010.
'''What is its overall scope?'''
The project scope will include the following:
* user testing designed to identify the most common barriers to entry
for first-time writers, and
* a series of improvements to the MediaWiki interface, including
improvements to issues identified through user testing and a focus on
hiding complex elements of the user interface from people who don't
use them. (Specifically, we'll focus on complex syntax like templates,
references, tables, etc.)
'''What does the Wikimedia Foundation consider to be wrong with the
editing interface right now?'''
When it was first developed, MediaWiki was considered reasonably
user-friendly. At that time, software wasn't as flexible and
user-focused as it is today. It's logical that by today's standards,
MediaWiki may not seem to be as streamlined or user-friendly as other
We have never systematically examined the editing interface to examine
what kinds of challenges new contributors face, but we do know of
certain common problems. For example, many people have difficulty
creating new articles, uploading images, and editing templates,
footnotes, and tables. We hope to make improvements in those areas.
'''Who are the new contributors you are hoping to attract?'''
We are hoping to attract new contributors who are just as smart and
knowledgeable as the people who have always written for Wikipedia and
its sister projects, but who -to date- have been unable or reluctant
to participate because of the barriers posed by the interface. There
are countless individuals who read Wikipedia and would be great
writers/editors, but are daunted by complex wiki syntax. They may not
even realize that they can edit Wikipedia. They are the people we are
targeting with this project.
'''What is the nature of the interface improvements that will be made
in this project?'''
In phase 1 (until late summer 2009), we will focus on reducing or
eliminating common, simple barriers to entry. A possible example
would be, "making the edit button more visible." These will be
identified through systematic user testing, but also by surveying
existing research. In phase 2 (until early 2010), we will shift our
attention to identifying complex pieces of "wiki code" (the formatting
language used to write Wikipedia articles) and making them less
visible to first-time contributors and/or helping them achieve the
respective functionality (such as adding tables) more easily.
'''When can we expect to see the first changes to the Wikipedia interface?'''
We hope to demonstrate a first series of improvements by mid-2009,
with production deployment following shortly thereafter.
'''How can the Wikimedia volunteer community be involved in this project?'''
The project will be open and participatory throughout. Every major
report will be publicly shared, and all code will be developed through
our existing, public version control system. Volunteer developers and
testers will be encouraged to contribute throughout the process.
'''Are the positions created for this project just temporary?'''
We will allocate at least two existing, budgeted developer positions
to this project, and additional hires will be employed for the
duration of the grant.
'''Why don't these funds count towards your overall fundraising goals?'''
The majority of the funding for this project will go towards costs not
included in our 2008-09 budget. While we anticipate that the project
will offset some of our operating costs, we also want to retain
flexibility to reallocate funding inside the project budget as
'''Are you going to localize these changes in all the languages of
Wikipedia and the other projects?'''
All code will be ready for internationalization.
'''Are you going to be looking at the entire editing/contribution
process or just the software?'''
This project focuses on technical solutions, but the user testing will
aim to capture problems experienced throughout the editing process.
Deputy Director, Wikimedia Foundation
Support Free Knowledge: http://wikimediafoundation.org/wiki/Donate
After some more thought on the origins of stub articles and a
better overview of the contents of the Swedish Wikipedia, it is
clear that very few individuals are responsible for creating large
numbers of stubs, a few years back. Now, depending on religion
(mergists, deletionists...), these should either be deleted,
improved, merged or put on lists of necessary quality
improvements. Either way, it's a lot of work and it would have
been better to have stopped those invidiuals back then. At least
we want to stop such individuals today, so the same mistake isn't
repeated while the old mess is being cleaned up.
What we want is to foster a spirit of writing better articles,
improving the one you started, before you start the next one.
But instead of increased patrolling and speedy deletions, this
could be implemented in the Mediawiki software. If a user (logged
in or IP address) tries to create a new page, their recent
contribution history could be checked, and if any of their five
most recently created articles (except redirects) are shorter
than, say, 300 bytes, they would simply be unable to create
another article. This would be a very soft kind of blocking (as
soon as you have improved your existing article, you can start the
next one), each case being completely an affair between the user
and the software, not involving opinions of individual admins.
Such an extension (is there an "article creation hook"?) could be
fully parameterized, so each community could decide where to set
the limits (5 recently created articles, 300 bytes), and what
message to show to the user who violates these limits.
Has this been suggested before? Has it been implemented? Would
it be a really bad idea to suggest this?
Lars Aronsson (lars(a)aronsson.se)
Aronsson Datateknik - http://aronsson.se
we´re having a wonderful wiki based on mediawiki but our legal deparment
said we should include some terms and conditions on the registration
process. We tried to find an extension which provide a simple checkbox
with a text in the registration page but we cannot find anything.
We don´t really want to touch the mediawiki files so i would prefer a
solution in the skin files or with an extension. We tried to find an
existing mediawiki with terms and conditions, but there does not seem to
Here my question:
Do you know an extension which will include terms and conditions
checkbox to the register page?
Do you know a theme that includes such a terms&conditions checkbox?
Which files do I need to touch to include such a feature? where is the
register page defined, and where is it prozessed?
thank you very much