G'day All,
There's a link suggesting tool I'm temporarily putting out there for you all to have a play with and to give feedback and comments on.
What it does is it takes an article of your choosing from the English Wikipedia, and suggests bits of text in that article that could potentially be linked. You can then accept or reject those individual suggestions, and then save your changes back to the Wikipedia.
It tries to do this in a reasonably pleasant UI, where you see the list of suggestions, and then simply select "yes", "no", or "don't know" for each suggestion, and click "Preview with Added Links".
Quick overview of the UI: * On the landing page, you type the name of the article that you want to see links for. It should appear in the list as you type (this bit uses suggestion searching). * Press enter or click the relevant link, then wait for up to 10 seconds for it to fetch the current version and suggest links, and then you'll be presented with a list of possible links that you can make. * To go through the list, you can either use the mouse, or you can use the keyboard. * For the keyboard, the keys are: Up arrow, Down arrow, "y" for yes, "n" for no, and "s" or "d" for skip/don't know. * "Yes" adds the link, "no" doesn't, and "don't know" doesn't add the link either; but "Don't know" will make the exact same link suggestion in future, whereas "yes" and "no" bring closure in that the same suggestion will no longer be made for that page in future. * If you don't make any choice for a suggestion, that's treated the same as choosing "don't know". * Each suggestion has a link that opens in a new tab/window, so if you want to determine whether something is an appropriate link or not, you can just click its link.
If you want to play with it now, it's at: http://can-we-link-it.nickj.org/
Some caveats to be aware of: * Currently only works for the English Wikipedia. Although I haven't tried it yet, conceptually similar languages like French should probably work (i.e. Left-to-Right, spaces between words to separate out ideas [no or quite limited compound words], general use same characters in both article text and article names for the same idea, etc). No idea if this can be made to work for languages which differ substantially from this. * This site will disappear in a few days. It's just a temporary experiment to see what happens, and is currently running on a development box which has other duties to perform. * Super-alpha status. It may blow up, eat your homework, key your vehicle, trash your favourite article, etc. * The tool will work much better if you have JavaScript turned on, and the front page won't work at all if you have JavaScript turned off. * It's SLOOOW (e.g. might take 7 seconds to generate suggestions for a 32 Kb page). It doesn't inherently have to be slow, but it is currently - partially because it's behind a DSL link, but mostly because it's not very efficient currently. I'd rather put out an early version though with some rough edges and slowness than wait until getting something perfect (which I may never get around to doing). * Currently the suggested links will include links to disambiguation pages. Shouldn't really do this (i.e. disambig pages should ideally be excluded from the results). * Saving suggestions back to the Wikipedia is a less than optimal process. Currently it goes to an intermediate page, which saves the user's choices to a local database, and then uses a JavaScript form submission to transfer the user to a preview on the Wikipedia with links added; ideally this intermediate step could be skipped. Also it would be nicer to go to "Show Changes" rather than show a preview, but that's not possible currently because Show Changes is protected by an edit token, so you'll have to manually click the "Show Changes" button if you want to see a highlight of what's been changed (a request that this be changed has been logged as bug #7369). * Someone reported that they saw the "null edit summary" detector complain at them when using this; Not sure why this happens, as there is a default edit summary supplied.
Other things: * It has "learning from its mistakes" functionality, in that a suggestion which is regularly rejected will no longer be suggested. The current cut-offs are that a suggestion must be rejected at least 5 times, and also rejected 50% or more of the time; once this threshold has been crossed the suggestion will no longer be made. Thus, the bad cruft should hopefully be progressively filtered out as the tool is used more, and what remains should hopefully be mostly useful. * There are some hidden switches, which you can add to the URL that shows the link suggestions, if you want to fiddle with stuff: ** The first is to add "&exhaustive" to the end of the URL, in which case it will stop trying to be smarter about suggesting links based on grammatical structure (e.g. by excluding single word links), and will be exhaustive about showing you the links it finds. This will result in roughly 4 times as many links being found. ** The second switch is that you can specify the number of characters to include in the "context" before and after the suggested link. The default is 60 characters, but you can set this to anything between 0 and 100 characters inclusive, such as by adding "&context=20" to the URL for 20 characters of context. ** Lastly you can specify to just check the wiki syntax. It performs some very simplistic checks on the wiki syntax automatically, that are all about balance (e.g. checks number of [[ equals number of ]] and so forth), and if an article's syntax looks invalid then it'll tell you what's wrong, but deliberately won't give you the link suggestions until you've fixed the syntax on the Wikipedia :-) However, if you don't want link suggestions, and only want syntax checking, then tack "&onlyCheckSyntax" onto the URL.
I also want to give a big thank you to Julien Lemoine for writing his Suggestion Search daemon / server, which this tool uses (or rather, abuses) in a rather cruel way to determine what's a valid article name and what's not :-) Also the front page uses a modified version of his web form to help you find the right page that you want to suggest links for.
All the best, Nick.
On 21/09/06, Nick Jenkins nickpj@gmail.com wrote:
There's a link suggesting tool I'm temporarily putting out there for you all to have a play with and to give feedback and comments on. If you want to play with it now, it's at: http://can-we-link-it.nickj.org/
That's fantastic! I just tried it on [[Xenu]] and it's IMO improved the article.
One caveat: quite a lot of the suggested links were of random phrases to pop-culture articles on e.g. album titles of that name. These are not so useful for articles that aren't pop-culture.
Also: it took a reload to get the PHP pages to load.
But this is a really nice tool.
- d.
Hi All,
Thank you all for the great feedback! It is appreciated and welcome.
I'll combine all the responses related to this into one email, to make it easier to read:
One caveat: quite a lot of the suggested links were of random phrases to pop-culture articles on e.g. album titles of that name. These are not so useful for articles that aren't pop-culture.
Yes: Films, songs, albums, plays, and books are all culprits because sometimes artists do enjoy misappropriating phrases from common usage.
Here's a hit parade of some of the worst offenders, showing the links that would be suggested for a completely unlinked bit of text: "Look [[To The West|to the west]]! That guy [[In the Middle|in the middle]] [[In A Car|in a car]], [[he is]] [[A Single|a single]]. [[Of Course|of course]] [[this time]] he [[Has Been|has been]] [[On the Cover|on the cover]] of Vogue."
The bad news is that every single one of the those suggestions is completely useless, as they're all about songs or albums or other gumpf. The good news however is that they were all on the way out (and half of them had already been eliminated based on people's votes), but to help speed things along I've now given the rest of them the heave-ho.
So, with progressive use the annoying pop-culture ones should drop out.
Also: it took a reload to get the PHP pages to load.
Hmmm, I thought I was the only one seeing that, and only intermittently. I'm really not sure why that's happening, although it doesn't seem to be coming from my local PHP or Apache, as apache doesn't register anything in the error logs or access logs when it happens. My best guess is that either the DNS is slightly screwy, or that my ISP is doing something with a proxy that's interfering (they're big on transparent proxying, and originally I wasn't sending any headers to prevent caching, and that error did happen previously when I was setting it up, so _maybe_ something is caching that original error).
I dread the misuse this tool will have... simply because there will be some who use it to link everything that has a matching article.. which is clearly a bad idea..
Well, like everything else on the Wikipedia it requires some degree of judgement as to what's appropriate to include in an article. If people are daft about it, then those edits should be reverted, just like any other bad edit.
Don't use pipe links for [[plural word]]s.
Sounds good, that's been added now, and it should transform any proposed link variant of the form [[X|Y]], where X is a substring of Y, and where X and Y contain an equal number of spaces, into [[X]]rest-of-Y.
Why not have "don't know" selected for each item when the page loads - it will make things clearer, and then you only need to change the ones you are sure about.
Good idea, that's more explicit about what's going on. It has been changed to this now.
First of all, it seems to do a string search-and-replace for the *first* instance of each string. That's no good: it tried replacing "The '''[[C (programming language)|C programming language]]''' is a very widely used programming language" with "The '''[[C ([[programming language]])|C programming language]]''' is a very widely used programming language" rather than "The '''[[C (programming language)|C programming language]]''' is a very widely used [[programming language]]".
Guilty as charged. :-)
When it looks for things to link (suggester.php), it was finding and suggesting this: "The '''[[C (programming language)|C programming language]]''' is a very widely used programming language" ^^^^^^^^^^^^^^^^^^^
... however the bit that actually does the linking (post.php) was just using a replace-first-instance approach, which would link on this: "The '''[[C (programming language)|C programming language]]'''" ^^^^^^^^^^^^^^^^^^^^ It'll transmit the offset to start at now, which will avoid the above problem, and should be fine as long as people don't start getting into rapid conflicting-edit situations, involving articles with duplicated bits of text which they said yes to link to, some of which is enclosed in wiki syntax and some of which isn't. If this turns out to be a problem the whole article can just be reparsed, but I'll start with the simpler approach and revise if needed.
It suggested linking "source code" to "[[Source Code|source code]]", for instance, rather than just "[[source code]]".
Good catch, fixed now - Thank you.
One improvement would be to change the edit summary since external links aren't rendered in edit summaries anyway - either just show the link as plain text (instead of wikitext) or make a page on the wiki where you can explain it since internal links will work.
Good idea, and done - I've added a user subpage ([[:en:User:Nickj/Can We Link It]]) with the overview information on it, and so now people can read about it and then get to the tool's page with 2 clicks from the edit summary, rather than having to copy and paste the URL.
Is the code for this available anywhere, and is it available under the GPL?
Yes and yes. It's at: http://files.nickj.org/MediaWiki/suggest-links.zip
Couple of other smaller fixes/tweaks added additional to the above: * Section linking should work now if it's unhappy about wiki syntax (i.e. links to "page#Section", rather than "page§ion=section") * In the page's <title>, replaced the underscores with spaces. * On the suggester output page, added a header that says "<h1>Link Suggestions for:<a href='page_name'>page name</a></h1>" * Added a link back to landing page when there are no suggestions, or the article name given did not exist, or all suggestions were rejected. * Updated some of the tables columns from latin1 encoding to utf8 encoding to prevent a MySQL "Illegal mix of collations" error from occurring. * Made syntax checking ignore seemingly mismatched ''' and '' occurring on the same line, as this can be valid syntax. E.g.: ''France'''s, or ''''77''' * Updated TcpQuery backend to be running the new 0.44 version.
All the best, Nick.
2006/9/22, Nick Jenkins nickpj@gmail.com:
Sounds good, that's been added now, and it should transform any proposed link variant of the form [[X|Y]], where X is a substring of Y, and where X and Y contain an equal number of spaces, into [[X]]rest-of-Y.
Something must have gone wrong there - it now turns it into [[]]rest-of-Y..., see http://en.wikipedia.org/w/index.php?title=2005&diff=77172402&oldid=7... - and look for 'gas cylinder', 'aircraft carrier' or 'sodomy law'.
Why not have "don't know" selected for each item when the page loads - it will make things clearer, and then you only need to change the ones you are sure about.
Good idea, that's more explicit about what's going on. It has been changed to this now.
If I understand correctly, "Don't know" actually means "Don't link this, but don't let my decision count as a vote against linking in the future".
It suggested linking "source code" to "[[Source Code|source code]]", for instance, rather than just "[[source code]]".
Good catch, fixed now - Thank you.
I had a similar one, [[Second in Command|second in command]] instead of [[second-in-command|second in command]] ([[second in command]] is a redirect to that second one).
Something must have gone wrong there - it now turns it into [[]]rest-of-Y..., see http://en.wikipedia.org/w/index.php?title=2005&diff=77172402&oldid=7...
- and look for 'gas cylinder', 'aircraft carrier' or 'sodomy law'.
Thanks for the heads up and sorry about that, silly copy-and-paste error in the last thing I changed, have fixed the tool and the 2005 page now.
If I understand correctly, "Don't know" actually means "Don't link this, but don't let my decision count as a vote against linking in the future".
Exactly. It doesn't link it, and doesn't vote for it or against it. Basically it does nothing, and affects no future link suggestions, until a user explicitly chooses either yes or no, and clicks the "preview with added links" button.
I had a similar one, [[Second in Command|second in command]] instead of [[second-in-command|second in command]] ([[second in command]] is a redirect to that second one).
That should hopefully be okay now, I think.
I.e. this test text: had a similar one, second in command instead of Now suggests this: had a similar one, [[Second-in-command|second in command]] instead of
a suggestion might be to link differently if the words in question are in a different style from surrounding text (ie italics vs non-italics) or in some kind of quotation marks.
Well it won't suggest linking anything enclosed in italics or any other form of wiki text (e.g. {{something linkable}}, [[something else linkable]], ''something else linkable'', etc are all out).
It's debatable whether italics or bolded text should be linked, but my gut feeling is that if the writer thought there was something special about that phrase then we should probably assume we should leave it alone.
Adding links for things in quotation marks is possible, although I'm happy to see mostly dubious links just get voted off the island...
Can you make it so that changing any of the radio buttons doesn't lose their keyboard focus in Firefox? Otherwise it's extremely hard, to the point of being impossible, to use this properly.
I'm assuming that you're probably using tab + left-arrow / right-arrow for navigating around the page and changing radio button choices?
The trade-off is that the up arrow and down arrow keys are currently being used to allow moving the selected/highlighted row up and down; However Firefox also binds those keys to changing the selected radio button if a radio button has the focus. In other words, without removing the focus, pressing up arrow and down arrow will both change the current row, AND change the selected radio button (which is very annoying, and that's why it explicitly drops the focus).
So I can stop fiddling with the focus, if we lose the up-arrow / down-arrow key functionality (or shift it to different keys). Would that work okay for you?
preg_replace("/\[\[([$linkprefixchars]*)([^]|]+)([$linktrailchars]*)|$1]]/", "$1[[$2]]$3", $text), where $linkprefixchars and $linktrailchars are grabbed from the appropriate language file. (Regex is untested and might contain typos and/or other errors, but you probably get the point: try to use the exact matching technique that MediaWiki itself uses.)
Cool, I'll have a look at using this, although to a certain extent it's currently largely independent of MediaWiki (i.e. there may be no appropriate language file to grab). And if it gets it wrong and suggests [[Aircraft carrier|aircraft carriers]], it's probably not as good as [[aircraft carrier]]s, but it's not a catastrophe.
Does it already tag the edit comment with "[via Suggestor]"?
Yes, it supplies a default edit summary of: "Adding a few internal links from a [[User:Nickj/Can We Link It|link suggesting tool]]".
All the best, Nick.
On 9/22/06, Nick Jenkins nickpj@gmail.com wrote:
It's debatable whether italics or bolded text should be linked, but my gut feeling is that if the writer thought there was something special about that phrase then we should probably assume we should leave it alone.
I don't think so. New terms that are being introduced are often italicized, for instance, and those are often pretty useful to link (since they'll tend to be specialized terms that will likely be unfamiliar to the reader). Likewise, not infrequently a subset of the (bolded) initial mention of the article title will be worth linking, like '''criticisms of the [[C programming language]]'''.
Plus, you have to account for the fact that bold and italics are probably used in different ways in different languages. A few centuries back it was common to italicize all sorts of odd words in English, like (IIRC) nationalities.
Cool, I'll have a look at using this, although to a certain extent it's currently largely independent of MediaWiki (i.e. there may be no appropriate language file to grab).
Well, there's always a language file to grab; if nothing else, it will default to English. If you mean that your thingy might not have all language files handy, well, the GPL's free for a reason, right? :) Just package them all, or the appropriate subsets of them. I assume there's some way to easily check a wiki's content language.
And if it gets it wrong and suggests [[Aircraft carrier|aircraft carriers]], it's probably not as good as [[aircraft carrier]]s, but it's not a catastrophe.
True, but no reason not to be perfectionist.
On 9/22/06, Nick Jenkins nickpj@gmail.com wrote:
The bad news is that every single one of the those suggestions is completely useless, as they're all about songs or albums or other gumpf. The good news however is that they were all on the way out (and half of them had already been eliminated based on people's votes), but to help speed things along I've now given the rest of them the heave-ho.
I haven't tried this tool, but a suggestion might be to link differently if the words in question are in a different style from surrounding text (ie italics vs non-italics) or in some kind of quotation marks. Thus, "Prague is in the middle of Europe": no link. "The song "In the Middle" was crap": link. etc.
Steve
On 9/22/06, Nick Jenkins nickpj@gmail.com wrote:
Don't use pipe links for [[plural word]]s.
Sounds good, that's been added now, and it should transform any proposed link variant of the form [[X|Y]], where X is a substring of Y, and where X and Y contain an equal number of spaces, into [[X]]rest-of-Y.
Better solution: preg_replace("/\[\[([$linkprefixchars]*)([^]|]+)([$linktrailchars]*)|$1]]/", "$1[[$2]]$3", $text), where $linkprefixchars and $linktrailchars are grabbed from the appropriate language file. (Regex is untested and might contain typos and/or other errors, but you probably get the point: try to use the exact matching technique that MediaWiki itself uses.)
On Fri, Sep 22, 2006 at 11:54:12PM +1000, Nick Jenkins wrote:
I dread the misuse this tool will have... simply because there will be some who use it to link everything that has a matching article.. which is clearly a bad idea..
Well, like everything else on the Wikipedia it requires some degree . of judgement as to what's appropriate to include in an article If . people are daft about it, then those edits should be reverted, just . like any other bad edit .
Does it already tag the edit comment with "[via Suggestor]"?
Cheers, -- jra
On 9/21/06, Nick Jenkins nickpj@gmail.com wrote:
There's a link suggesting tool I'm temporarily putting out there for you all to have a play with and to give feedback and comments on.
[snip]
I dread the misuse this tool will have... simply because there will be some who use it to link everything that has a matching article.. which is clearly a bad idea..
In any case, a suggestion: Don't use pipe links for [[plural word]]s.
"Nick Jenkins" nickpj@gmail.com wrote in message news:NABBJGECPILFNLFEAPIKKELOIOAA.nickpj@gmail.com...
- "Yes" adds the link, "no" doesn't, and "don't know" doesn't add the link
either; but "Don't know" will make the exact same link
suggestion in future, whereas "yes" and "no" bring closure in that the
same suggestion will no longer be made for that page in
future.
- If you don't make any choice for a suggestion, that's treated the same
as choosing "don't know".
Why not have "don't know" selected for each item when the page loads - it will make things clearer, and then you only need to change the ones you are sure about.
- Mark Clements (HappyDog)
Very neat. I tried it on [[C programming language, criticism]] and I think it built the web in very useful and interesting ways. While arguably linking "computer scientist" and "programming language" was a bit excessive, certainly someone with a more tenuous grasp of C would benefit from linking things like "pointer" and "null character" (even if they could never fully get the criticism just from reading Wikipedia), and articles like "High and low level (description)", "Type I and type II errors", and "classic example" (!) either have potential or are already interesting.
Now, the bugs. :) First of all, it seems to do a string search-and-replace for the *first* instance of each string. That's no good: it tried replacing "The '''[[C (programming language)|C programming language]]''' is a very widely used programming language" with "The '''[[C ([[programming language]])|C programming language]]''' is a very widely used programming language" rather than "The '''[[C (programming language)|C programming language]]''' is a very widely used [[programming language]]".
Second, there were a few odd suggestions. It suggested linking "source code" to "[[Source Code|source code]]", for instance, rather than just "[[source code]]". And in the string "''precompiled headers'', a system where declarations are stored in an intermediate format that is quick to parse. Building the precompiled header", it suggested replacement of the second instance of "precompiled header", not the first (presumably confused by the apostrophes -- although in this case the first bug I mentioned canceled this out :P).
All in all, a very neat tool. Here's my diff (I did tweak the output slightly, but these changes were basically all suggested): http://en.wikipedia.org/w/index.php?title=C_programming_language%2C_criticis...
Nick Jenkins wrote:
It tries to do this in a reasonably pleasant UI, where you see the list of suggestions, and then simply select "yes", "no", or "don't know" for each suggestion, and click "Preview with Added Links".
Can you make it so that changing any of the radio buttons doesn't lose their keyboard focus in Firefox? Otherwise it's extremely hard, to the point of being impossible, to use this properly.
Thanks, Timwi
wikitech-l@lists.wikimedia.org