Okay, it's not quite done, and it's still really crude, but starting to take
shape - I've got some basic intersections functionality running on Wikidweb
- I hacked skin.php and added links to the special intersections page. The
intersections are using a MyIsam fulltext index. I'm not using 'boolean
mode' queries, as this seems to give more interesting results.
(look at any article page at http://wikidweb.com)
The UI on the special page itself is really ugly and needs lots of work, and
once this is all done it will have to be ported to an up-to-date version of
mediawiki (I'm way down rev), but *it's a start*.
Comments, suggestions, criticisms, and offers to help all welcome.
Best Regards,
Aerik
--
http://eventfeed.org - An Initiative Promoting Syndication of Events
http://www.wikidweb.com - the Wiki Directory of the Web
http://tagthis.info - Hosted Tagging for your website!
(feel free to bash me if we had this variant already, I couldn't find
it in the list archives)
Task: On German Wikipedia (yay atomic categories!), find women who
were born in 1901 and died in 1986.
Runtime : Toolserver, <2 sec
Query:
SELECT * FROM ( SELECT page_title,count(cl_to) AS cnt FROM
page,categorylinks WHERE page_id=cl_from AND cl_to in ( "Frau" ,
"Geboren_1901" , "Gestorben_1986" ) GROUP BY cl_from ) AS tbl1 WHERE
tbl1.cnt = 3 ;
Trying to "poison" the query by also looking in all GFDL images
("GFDL-Bild", ~60K entries in category) increases runtime to 3 sec.,
so not that bad.
I've implemented this as a tool now:
http://toolserver.org/~magnus/category_intersection.php
Queries seem to take a little longer there (2-4 sec) compared to the
command line.
Articles on en.wikipedia with "1905 births" and "1967 deaths" took <0.4 sec.
OTOH, looking for images on Commons in "GFDL" and "Buildings in
Berlin" took ~2min. Might be the giant GFDL category, or the
toolserver, or both. I'll try to fiddle with it some more utilising
cat_pages/cat_files.
Magnus
Just a quick check - I did the BBC Radio 4 Today show about the
[[:en:Virgin Killer]] blocking. Both presenters asked afterwards how
to get their crappy articles fixed - I said "email
info(a)wikimedia.org". Bet you they email info(a)wikipedia.org - does that
address redirect in the obvious and sensible fashion? Same for
info-en@, etc? I realise these addresses are deprecated, but you know
people are going to go there first.
- d.
Magnus - I checked out your tool, but it looks like you're using a query
against the categorylinks table? Have you played with setting up a new
table for categories and fulltext indexing it? Use group_concat to get all
of a pages categories into one field, then create a fulltext index on that
field. You get much better performance than using the categorylinks table
(kind of weird, eh?)
Are you pinging a live database, or a copy made from a dump? (please excuse
my ignorance if this is common knowledge)
I'm working on dummying up a UI using the same approach (fulltext index of
categories) on wikidweb and will write back when I've got something worth
looking at...
Best Regards,
Aerik
--
http://eventfeed.org - An Initiative Promoting Syndication of Events
http://www.wikidweb.com - the Wiki Directory of the Web
http://tagthis.info - Hosted Tagging for your website!
Dear all,
I am an administrator, bureaucrat and checkuser on the Romanian
Wikipedia. I have contacted the current owner of domain wikipedia.ro
asking him to consider donating the domain to Wikimedia Foundation, Inc.
(there is no local chapter in Romania). He's considering the option, but
in the meanwhile he has offered to allow us use of the domain -- in
other words, he asked for the appropriate nameserver IP addresses he
should associate with that domain (obviously, that should lead to the
content currently served at ro.wikipedia.org). Could that be arranged?
If so, please provide the respective IP addresses so I can pass them on
to him.
In a different train of thoughts, should he agree donating the domain
altogether to WMF, can WMF to take ownership, or is that against any policy?
Best regards,
Gutza
http://advogato.org/article/994.html
Peer-to-peer git repositories. Imagine a MediaWiki with the data
stored in git, and updates distributed peer-to-peer.
"Imagine if Wikipedia could be mirrored locally, run on a local
mirror, where content was pushed and pulled, GPG-Digitally-signed;
content shared via peer-to-peer instead of overloading the Wikipedia
servers."
This would certainly go some way to solving the "a good dump is all
but impossible" problem ...
(so, anyone hacked up a git backend for MediaWiki revisions rather
than MySQL? :-) )
- d.
On Wed, 03 Dec 2008 17:05:39 +0100, Roan Kattouw <roan.kattouw(a)home.nl>
wrote:
>
>
> Daniel Schwen schreef:
> > So how does this take care of deep indexing non-atomic categories?
> >
> Err.. what? Please explain what you mean by that.
I think he means finding stuff that's already buried in sub-sub categories,
when you query on a parent category. Like querying for and intersection
of [[Category:Deceased people]] and [[Category:Presidents of the
United States]] won't find the guys listed in [[Category:Deceased
Presidents of the United States]] without re categorizing those entries.
>
> > =>How will this extension be even remotely useful for let's say commons?
> >
> Without addressing Commons in particular, having an efficient way to get
> pages in the intersection of multiple categories would allow wikis to
> delete a category such as [[Category:Deceased Presidents of the United
> States]] and replace it by, say, [[Intersection:Deceased Presidents of
> the United States]], which would list all articles in
> [[Category:Deceased people]] and [[Category:Presidents of the United
> States]]. My extension alone doesn't make that possible, but it makes
> implementing such a feature considerably easier.
> > This discussion is far from over. The basic problems are _not_ solved.
> >
> Would you care to elaborate on what those unsolved problems are?
I thought we were 90% of the way there when you wrote this extension, having
reasonably solved the efficiency (speed) issues with the fulltext and lucene
based approaches, and the view of the atomic categories problem was that it
would be solved by people, not tech. In other words, I thought we all
assumed that once people were empowered with category intersections, they'd
make categories that make use of them. If not, then that's a problem to
solve, but not an obstacle to implementing category intersection. My input
would be to implement intersections, see what happens, and look at other
functionality for intersections v.2.
>
> > I'm sure this thread will die out soon.
> > Half of the participants will again be soothed by the promise of some
> easy
> > solution just barely beyond the horizon, while the half that realizes
> that
> > said solution _cannot possibly work_ without a radical reform of the
> category
> > system will again be too annoyed (I'm getting there already) to continue
> > discussing.
> It would be nice if you didn't judge people as naive rightaway.
>
Seconded.
But it sounds like maybe those of us who'd like to see this happen should
discuss a UI (or several) for it. I was thinking the most intuitive
interface was a sort of "browse" type function, where for any given group
of categories (could just be one category), you have two result sets:
related categories (other categories of pages in the starting category),
and articles at the intersection of the group. The articles are what we
generally think of, but the related categories gives us an intuitive way to
navigate through category intersections.
The articles in the group of categories are the problem we've already solved
(mostly): they are the result from the fulltext or lucene search. The
related categories problem is harder, I think, as the most obvious way to
get to that is to get all the categories belonging to those articles, and
then collapse them and rank them. For large result sets, this can get time
consuming again, and we would not want to (I think) build the related
categories only with the first page of results. OTOH... if we took the
first 100 results of a given category intersection, then queries the
categorylinks table for all the categories belonging to those articles, and
collapsed that... that would be a pretty good estimate at related
categories. It wouldn't give all of them, but it would be a nice set of
sample data.
What do you think?
Onto a soap box for a minute: the fact that this topic won't die, in 4
years, to me means that it's a really needed feature. Once implemented it
will give people a great tool to more efficiently find information. Looking
at things that are happening around the web with tags, Google adopting ideas
from Wikia search, semantic web stuff, I'm thinking that we are really at
the beginning of a movement to add structured metadata to information on the
net. In concert with all the wonderful algorithms that try to guess what a
given web page is about, we are doing things to explicitly state what a web
page is about, providing users a much better chance at being able to find
it. Developing category intersections for Wikipedia would be a milestone in
that movement.
Aerik
--
http://eventfeed.org - An Initiative Promoting Syndication of Events
http://www.wikidweb.com - the Wiki Directory of the Web
http://tagthis.info - Hosted Tagging for your website!
On Wed, Dec 3, 2008 at 8:46 PM, Thomas Larsen <larsen.thomas.h(a)gmail.com> wrote:
> Hi all,
>
> The current <ref>...</ref>...<references/> system produces nice
> references, but it is flawed--all the text contained in a given
> reference appears in the text that the reference is linked from. For
> example:
[snip]
> Once way I could conceive of correcting the problem is to have a
> reference tag that provides only a _link_ to the note via a label and
> another type of reference tag that actually _defines_ and _displays_
> the note. For example:
[snip]
Thats a lot like what we used to do, the problem is that references
were *constantly* orphaned, scrambled, etc. The references were often
nonsense.
My view is that the current behavior is bad mostly because it makes it
very hard to read the text in edit, you get this wall of meaningless
markup.
Instead I propose: Have javascript mediate the edit box so that inline
references are converted to little red [R] text, moving your cursor
into the [R] area by clicking or arrowkeying causes it to expand to
display the full reference. You can add references by simply typing
them like normal and then they'll collapse when you navigate away, or
you can press some "insert reference" button that pops up a dialog
that asks for the relevant information which then types the completed
reference for you.
This type of hiding could also be applied to other common inline
markup and dramatically improve usability.
This type of edit box mediation has been done by other edit-helper
userscripts, so it's certainly possible.
Thoughts?