On Sat, Feb 23, 2008, Tim Starling <tstarling(a)wikimedia.org> wrote:
> But it seems to me, if you look at data storage software already in use,
> Lucene is much better suited for computing intersections than MySQL.
Tim, aren't you kind of the point guy for the lucene search? Would you be
up for setting up a categories index? I don't know how the update works (I
think, from what I've read, that it does a big index regeneration on some
kind of schedule, but I really don't know).
I think it could be implemented as either a separate index, or as a new
field on the current index.
I'd be happy to help, but I'm totally unfamiliar with the code, and don't
really want to set up Java on my server for testing... I've created lucene
indexes on the categories table before, but not in any way that even
approaches a production type environment. Maybe that still leaves some
opportunity to help though.
On Sun, Feb 24, 2008 at 3:48 AM, Samuel Wantman <sam(a)wantman.net> wrote:
> Fabulous! Can we do it?
Why not? You don't have to ask here to use bots. Use whatever bots
you like, and if one becomes a problem (which is rare) someone will
> If it literally can happen "right
> now", I'd actually prefer it be left turned off until us category wonks can
> generate a plan.
I get the feeling we're talking about a different feature here. The
feature to hide categories is already enabled and is not going to be
turned off (although it needs to be improved somewhat). Everything
else is user-level stuff that isn't really relevant to this list, as
far as I can tell.
Sorry for the spam like email but this is just an email to let you all know
about the Wikimania 2008 Call for Participation. I have included this
below, and the page can also be found on the Wikimania wiki @
http://wikimania2008.wikimedia.org/wiki/Call_for_Participation. Please do
forward this onto any local project mailing lists or anyone else who may be
interested in this. Also there are translations available on wiki.
== Call for Participation ==
Wikimania is an annual global event devoted to Wikimedia projects around the
globe (including Wikipedia, Wikibooks, Wikisource, Wikinews, Wiktionary,
Wikiversity, Wikiquote, Wikispecies, and Wikimedia Commons) and for its
editors and users to gather, meet each other, exchange ideas, and report on
research and projects. It is a community event, which is also open to the
public and to researchers. This year's conference will be held from '''July
17-19, 2008''' in Alexandria, Egypt at the new Library of Alexandria
For more information, please visit the Wikimania 2008 Home page at
We are accepting submissions for presentations, workshops, panels, posters,
open spaces, and artistic artifacts. Please carefully follow the submission
guidelines below. Submissions can be sent via the following link:
=== Important dates ===
* 1 February – 16 March : Submission
* 17 March – 30 April : Review, feedback and notification of acceptance
* 17 – 19 July 2008 : '''Wikimania'''
=== Conference Tracks ===
Submissions should address one or more of the following themes:
; Wikimedia Communities : Interesting projects and particularities within
the communities; policy creation within individual projects; conflict
resolution and community dynamics; reputation and identity;
multi-lingualism, languages and cultures; social studies. We explicitly
invite you to discuss your local Wikimedia project's community.
; Free Knowledge : Open access to information; ways to gather and distribute
free knowledge, usage of the Wikimedia projects in education, journalism,
research; ways to improve content quality and usability; copyright laws and
other legal areas that interfere with Wikimedia projects. Free Content in
; Technical infrastructure : Issues related to MediaWiki development and
extensions; Wikimedia's technical infrastructure; new ideas for development
(including case studies from other wikis or similar projects).
; Scientific track : Academic papers about massively collaborative work,
open and free content creation, community dynamics, the social or economic
aspects of the Wikimedia projects, and other topics related to Wikimedia
projects. Papers submitted to the scientific track will be peer reviewed by
a reviewing committee regarding their novelty, rigour, and estimated impact,
and accepted or rejected based on these reviews. The papers will be
published in proceedings afterwards, and depending on the number and the
quality of the submissions, a journal special issue may be pursued.
Scientific track papers must be in English, and must not exceed 7,500 words
(or 15 pages LNCS).
Your topic must be related either to the Wikimedia projects and their
communities, or to the creation of free content in general.
=== Types of Submissions ===
We are seeking submissions for
* presentations (10–30 minute talks with discussion afterwards)
* workshops/open discussions (60–120 minute session with a discussion leader
and more involvement of the audience)
* panels (group of 2-5 speakers to discuss on a specific subject)
* posters (printed presentations or visual displays that can stand on their
* artistic artifacts (plays, competitions, comedy, visualizations, or other
representations of some aspect of the projects)
In addition there will the possibility to give [[lightning talks]] (5 minute
short presentations). These will be organized on the Wikimania 2008 wiki
without need to submit via the submission system.
=== Submission Guidelines ===
Wikimania is organized by volunteers, so please help us minimize wasted
effort by submitting via the submission system and following these
guidelines. All submissions MUST explicitly include the following:
# an English "Event title"
# a short English "Abstract" of your event in 50 to 100 words. The abstract
will be used for the public schedule.
# the "Track" your submission fits in best (Wikimedia Communities, Free
Knowledge, Technical infrastructure, or Scientific)
# the "Event type" (presentation, workshop, panel, poster, artistic...)
# information about the speaker (full name, email, a short description of at
least 2 sentences...)
# for submissions to the scientific track: set "Submission of paper for
proceedings" to "yes" and upload a paper instead of the "Description" below
as "Attachment". Papers must be in English, and must not exceed 7,500 words.
In addition you can add some more information like a a subtitle of the
event, an image (will be resized to 128x128px) and private "Submission
notes" for reviewers and conference organisation. In particular you should
* a more detailed "Description" of your event in English or Arabic. The
description is essential for review: please give an overview of the areas to
be covered or taught. The better you describe your submission, the more
likely it will get accepted. State clearly the relevance to the Wikimedia
projects and whether submission concerns a specific wiki project. You can
also include links. The description will later be used for the public
schedule but you can edit it before.
* special requirements (such as equipment for a workshop or panel) if needed
* the language used for presentation
* whether you want to submit a paper for proceedings
* whether you want to submit presentation slides
* whether the presentation is intended to be a specific length
* the target audience you are going to reach and what previous knowledge is
* images or sketches of the poster or artistic artifact if available
* for panel submissions a suggested moderator and short biographies of each
In the "Submission notes" you should tell us whether you will attend to
Wikimania (a) surely, (b) probably, (c) only if your submission is accepted,
or (d) only if we provide travel and/or accommodation. You can also add
yourself to the public list of attendees at the Wikimania 2008 wiki:
Please note that all submissions must be dual licensed under the GNU Free
Documentation License version 1.2 or later ''and'' the Creative Commons
Attribution License! By submitting for Wikimania 2008 you agree to this
For more information see the submission guidelines at
Once you are sure you have included all of the required information, please
send your submission before the respective deadline through our
== See also ==
* About the venue: http://wikimania2008.wikimedia.org/wiki/Venue
* Brainstorming page for program ideas:
* Editable list of attendees:
Hidden categories should appear in the category namespace, but not
As for being "Hidden" or "Admin", I can envision uses for both.
Admin categories could have a separate collapsible listing, while hidden
categories might have some other uses. Since we've also been discussing
the problems of implementing "Category Intersection", an interim
solution could be repopulating parent categories and "hiding"
intersection categories. Fully populated parent categories are the norm
in some projects like German Wikipedia and they also appear sometimes in
English Wikipedia (eg. Category:Operas). I have a proposal posted
currently about fully populating "Index" categories at en:Wikipedia
talk:Categorization, and it would be much improved if the intersection
categories could be hidden. The primary reason we have been deleting
intersection categories is because they clutter articles. If they
didn't clutter articles, they wouldn't be a problem.
Perhaps the non-hidden categories could be expanded with a [+] the same
way subcategories are expanded. For example, if someone is listed under
"Methodist", clicking on the plus might add the hidden categories
"American methodist" or "Methodist presidents". This would require
searching to see if any of the hidden categories are descendants of the
clicked on category. This pseudo categorization intersection system
would also be an incentive to get ready for a real implementation. For
Category Intersection to work, hundreds of categories will need to be
Along those lines, I'm wondering about yet another interim step toward
full category intersection. A while back, several of us editors on
English Wikipedia worked on a design for an interface for implementing
Category Intersection (it is at en:WP:CI). We envision check boxes
next to each category listing in an article, and then a button that
queries the intersection. If making the query were to create a hidden
category and automatically categorize all the articles that result from
the query, the next time the request is made it could just display the
results, just like any other category. There might be a timer that
resets (every week?) that would force another query to update the
category. This way each intersection query would happen fairly
infrequently -- as infrequently as need be to keep from overloading the
There would need to be a naming convention for the automatically
generated categories, perhaps using a double colon -- so the
intersection of Category:Mozart and Category:Operas would generate
Category:Mozart::Operas. I don't think we'd want these auto-generated
categories to be orphan categories. The category could be
automatically put in a maintenance category, or better yet, a child
category of each parent could be created to hold all automatically
created categories. If the category is called "Operas" this holding
category could be called "Intersections with Operas" or "Operas and..."
If the query is worth keeping it could be recategorized by an editor
(eg. Category:Operas by composer). It would probably be useful to be
able to see how often the query was requested. If intersection
categories get renamed, a category redirect should be able to get the
user (and future queries) to the correct place.
If any of these intersection queries cause problems, an administrator
could protect the category page. The next time the query is requested,
the blocked page would keep the query from being run. The user would
see the reason for the blocked query posted on the category page. This
would prevent two or more huge categories from being intersected (eg.
Category:Living People intersected with Category:Films). If the CPU
time was analyzed for each query automatically, the blocks might be able
to happen automatically.
-- Samuel Wantman
I need to extract the only the text from a Wikipedia page. I.e., I
need to remove all wiki markup, section headings etc, to extract only
the text a reader will read.
For example, for the text :
'''Paris''' ([[Help:IPA|pronounced]] /paʁi/ in French; /ˈpaɹɪs/ in
English) is the [[communes of France|capital city]] of [[France]]. It
is situated on the [[Seine|River Seine]], in northern France, at the
heart of the [[Île-de-France (region)|Île-de-France]] [[Regions of
France|region]] (aka "Paris Region"; in French: ''Région Parisienne''
or ''RP''). The City of Paris has an estimated population of 2,167,994
within its administrative limits (January 2006)."
I need to get the following after extraction:
Paris (pronounced /paʁi/ in French; /ˈpaɹɪs/ in English) is the
capital city France. It is situated on the River Seine, in northern
France, at the heart of the Île-de-France region (aka "Paris Region";
in French: ''Région Parisienne'' or ''RP''). The City of Paris has an
estimated population of 2,167,994 within its administrative limits
Using Pywikipediabot framework, I can get the raw text, but not the
text-sans-markups. Since I need to do some textual analysis on the
article contents, I need to get rid of all the extra markups, citation
tags or other templates.
So, what is the best/easiest way to do this? Thanks in advance.
Dept of Computer Science
University of Illinois at Urbana-Champaign
201 N Goodwin Avenue
Urbana IL 61801
> We don't have to move off MySQL, we just have to use a different
> system for this one feature. That's perfectly plausible; we use
> Lucene for search.
Ah, something I actually know something about. This is the third or fourth
time, to my knowledge, that we've discussed category intersection in depth.
Last year (I think it was last year) I did a bunch of pretty extensive
testing, including running MySQL queries against the categories table using
various methods (joins, subselects, you name it) and the consensus was that
was way too slow (queries against large categories were awful - Living
People was a test case).
So, I also loaded the categories into the cur table (I'm using an old
schema) and created a field holding all the categories with underscores for
spaces in the categories (like it appears in the url). This made MySQL's
fulltext index see the whole category as one word. This performed *much*
faster, and you could use boolean queries to get fancy.
I also created a lucene index which I queried with zend_search_lucene. This
actually performed pretty comparably to the MySQL fulltext index. It's all
in the archives somewhere. I think either of those solutions would probably
be okay, but if it's wildly poplular the load might be a bit much. I didn't
get (that I recall) any really conclusive opinions from the group or the
But, based on all that, here's my suggestion: create a new lucene index of
categories using all the existing tools, and do boolean queries against
that. I think it's the path of least resistance, and the performance should
be quite acceptable (pretty much be definition).
On a related topic, anybody on the list mess around with clucene? I'm still
playing with it off and on... (I'm a novice at c/c++) seems like a good
choice for a high performance web based search (doesn't have the overhead of
On Fri, Feb 22, 2008 at 1:49 PM, <vasilievvv(a)svn.wikimedia.org> wrote:
> Remove "(not written yet)" text from title (introduced in r31140. It broke many bots which used title attribute of link for getting its target.
Any bot that uses title for that purpose seems to me to be seriously
broken. Isn't that exactly what href is for? Normalizing href is not
I 've got a parser-extension (using $wgExtensionFunctions) which needs to insert real square brackets within a anchor-href-attribute in the html output page. The problem is that all my squarebrackets within the href-attributs get converted to %5B and %5D when I return my html to the Parser. How can I suppress this behaviour of Wikimedia?
Thanks a lot
If you are interested why I would like to do such a stuff. Here's an example where it's necessary to use square brackets in href-tags.
Psst! Geheimtipp: Online Games kostenlos spielen bei den GMX Free Games!
A few requests regarding implementation of i18n for the extension Collection:
* please use wfLoadExtensionMessages which takes care of proper fallback and is used by most of the extensions in the repo
* please use a unique message ID prefix to avoid conflicts with message IDs of other extensions (f.e. "coll-")
If you want I can make those changes for you. Please let me know.
Also, please create an entry for your account 'jojo' in http://svn.wikimedia.org/viewvc/mediawiki/USERINFO/ so more personal contact is possible.
Van: mediawiki-cvs-bounces(a)lists.wikimedia.org [mailto:firstname.lastname@example.org] Namens jojo(a)mayflower.knams.wikimedia.org
Verzonden: vrijdag 22 februari 2008 11:24
Onderwerp: [MediaWiki-CVS] SVN:  trunk/extensions
Date: 2008-02-22 10:24:20 +0000 (Fri, 22 Feb 2008)
--- trunk/extensions/Collection/Collection.i18n.php (rev 0)
+++ trunk/extensions/Collection/Collection.i18n.php 2008-02-22 10:24:20 UTC (rev 31180)
@@ -0,0 +1,105 @@
+ 'collection' => 'Collection',
+ 'collections' => 'Collections',