Hi folks,
On <http://meta.wikimedia.org/wiki/Stop_word_list>, it's claimed that
for full-text searching, English Wikipedia uses the MySQL 4.0.20 stop
word list with a few modifications. I gather that this is no longer the
case; when did it stop?
Thanks,
Reid
Hi,
I haven't seen a complete dump of the english wikipedia for a while, and I'm a
little worried:
I am a PhD student from New Zealand, and I'm completely dependent on this data -
particularly the sql tables (page, pagelink, etc). With the last 2 dumps
failing, and others being removed, there isnt a single complete dump of the
english wikipedia data left available for download - and there hasn't been a new
dump since the beginning of April.
Has something happened?
Cheers,
Dave
Hi folks,
In our ongoing research here at UMN, we've discovered some reverts that
introduce apparent character set problems; what seems to happen is that
some Unicode characters are replaced by a character I don't recognize
followed by a hexadecimal number. For example:
http://en.wikipedia.org/w/index.php?title=Dog&diff=58851026&oldid=58821211
What I see is that a sequence of five characters that I don't have
glyphs for, which show up as five boxes with the numbers "010337 01033F
01033D 010333 010343" in them, is replaced with the sequence
"?df37?df3f?df3d?df33?df43", where ? is not the question mark but a
black diamond with a white question mark in it (a zero byte?).
Do any of you have pointers on information as to what is going on?
We are trying to devise a workaround that would result in revisions like
this comparing identical.
Many thanks,
Reid
vyznev(a)svn.wikimedia.org schrieb:
> Revision: 22159
> Author: vyznev
> Date: 2007-05-13 23:07:06 -0700 (Sun, 13 May 2007)
>
> Log Message:
> -----------
> All the MediaWiki: pages linked to from Special:Allmessages have at least a default value, there's no point in showing any of them as redlinks.
It is now impossible to see if a message exists in a localized revision
in the MediaWiki namespace which is identical to the default.
It makes it hardly difficult to recognize which messages have to be
deleted from MW namespace after commit of messages to the MessagesXx.php
file.
Raymond.
Is possible this?
I want that when a category creates, automatically it creates an default
article structure within her.
Example, because mi English is very poor:
I make a Category Exception, and now, automatically it creates the articles:
Description_Exception, News_Exception, Source_Exception, etc.
Very complicated?
Thanks!
An automated run of parserTests.php showed the following failures:
This is MediaWiki version 1.11alpha (r22359).
Reading tests from "maintenance/parserTests.txt"...
Reading tests from "extensions/Cite/citeParserTests.txt"...
Reading tests from "extensions/Poem/poemParserTests.txt"...
18 still FAILING test(s) :(
* URL-encoding in URL functions (single parameter) [Has never passed]
* URL-encoding in URL functions (multiple parameters) [Has never passed]
* Table security: embedded pipes (http://mail.wikipedia.org/pipermail/wikitech-l/2006-April/034637.html) [Has never passed]
* Link containing double-single-quotes '' (bug 4598) [Has never passed]
* message transform: <noinclude> in transcluded template (bug 4926) [Has never passed]
* message transform: <onlyinclude> in transcluded template (bug 4926) [Has never passed]
* BUG 1887, part 2: A <math> with a thumbnail- math enabled [Has never passed]
* HTML bullet list, unclosed tags (bug 5497) [Has never passed]
* HTML ordered list, unclosed tags (bug 5497) [Has never passed]
* HTML nested bullet list, open tags (bug 5497) [Has never passed]
* HTML nested ordered list, open tags (bug 5497) [Has never passed]
* Fuzz testing: image with bogus manual thumbnail [Introduced between 08-Apr-2007 07:15:22, 1.10alpha (r21099) and 25-Apr-2007 07:15:46, 1.10alpha (r21547)]
* Inline HTML vs wiki block nesting [Has never passed]
* Mixing markup for italics and bold [Has never passed]
* dt/dd/dl test [Has never passed]
* Images with the "|" character in the comment [Has never passed]
* Parents of subpages, two levels up, without trailing slash or name. [Has never passed]
* Parents of subpages, two levels up, with lots of extra trailing slashes. [Has never passed]
Passed 495 of 513 tests (96.49%)... 18 tests failed!
Thanks for the reactions.
This is getting a bit editorial for this mailing list and perhaps the
editorial content bit should move to the project pages. Technically we
have is a decent script which eats a list of archived versions of
articles and puts out a cleaned static tree, obeying manually alterable
delete instructions. It is very easy to restore content or run this on
another list of articles if you have one.
But anyway Matthew makes a fair point, I should have thought through
exactly our process. Please bear in mind this is motley crew of
volunteer stuff not professional editors. The process was chuck
everything into a funnel, get a volunteer to read it and then throw
irrelevant stuff out; then sort by school topic then go get other
articles to fill holes in curriculum. However very US-centric content
and fringe content got thrown out (see list at
http://en.wikipedia.org/w/index.php?title=Wikipedia:Wikipedia_CD_Selection&…
for discarded articles), including things like baseball players. I
hadn't really noticed many FAs going from the 800 articles I did
personally but overall no doubt this included FA/GA stuff, plus a tonne
of Pokemon characters. In defence the current collection of FA and GAs
is very skewed.
On the other two questions
1) "how many good articles has Wikipedia" I concede I could be
completely wrong. There are a vast number of key school topics (such as
classic novels) where the content is hugely disappointing and we kept
hitting poor quality articles when trying to fill holes. We also had a
quick go at comparing with EB articles and were saddened. But Walkerma
thinks 50,000 good articles could be found and he could be right, it
could be far more.
2) Censorship: lets not get this out of proportion. There were a small
number of articles where we thought content might cause issues. We could
have left out all these articles with no sweat; no one would notice.
There are plenty of places a 15 year old can go for things not in this
collection. There is plenty of content which an 8 year old won't
understand. We have taken out a small amount of content to allow the
appeal to widen downwards in schools. You go get your list of archived
articles chosen your way and we will knock off a static copy for your
choice, with no section deletes: no problem.
3) I am happy to be guided on citations but part of the problem is that
the formatting is so variable in Wikipedia itself we were struggling
with it. WP chooses to nofollow citations so I guess we all agree this
part of content is unreliable? Anyway its done so many different ways we
thought it needed to come out.
BozMo
============
Matthew Brown wrote:
> On 5/22/07, Andrew Cates <andrew(a)catesfamily.org.uk> wrote:
>> It contains all Good & Featured content (except adult content).
>
> Not true, unless 'adult content' means not only content deemed
> unsuitable for children but also content deemed not interesting or
> using some other mechanism. Since I only could be bothered to go
> through the FA process once, of course I looked to see if "my" FA was
> included, and it wasn't.
>
> Which is no problem, it's on a nerdy topic of little general interest,
> but this seemed to diverge from what you said, so I thought I'd bring
> it up before other people ;)
>
> -Matt
>
>
>
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
aaron(a)svn.wikimedia.org wrote:
> + print( "Flagging bot account edits...\n" );
> +
> + # Fill in the rc_bot field
> + $sql = "SELECT DISTINCT rc_user FROM $recentchanges " .
> + "LEFT JOIN $usergroups ON rc_user=ug_user " .
> + "WHERE ug_group='bot'";
This is fragile, as there's no guarantee that the "bot" group is the
only one that has bot privileges, or indeed that it exists at all.
Instead, you should look up which group(s), if any, have the 'bot'
permission in $wgGroupPermissions, then search for those groups.
- -- brion vibber (brion @ wikimedia.org)
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2.2 (Darwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
iD8DBQFGUwf2wRnhpk1wk44RAghvAJ9rHGFxPpDuJ52svK8ykrw3OkAsMwCeJcxg
jI74ZZYmW0C7JF+bsqc3t9o=
=8fBO
-----END PGP SIGNATURE-----
An automated run of parserTests.php showed the following failures:
This is MediaWiki version 1.11alpha (r22317).
Reading tests from "maintenance/parserTests.txt"...
Reading tests from "extensions/Cite/citeParserTests.txt"...
Reading tests from "extensions/Poem/poemParserTests.txt"...
18 still FAILING test(s) :(
* URL-encoding in URL functions (single parameter) [Has never passed]
* URL-encoding in URL functions (multiple parameters) [Has never passed]
* Table security: embedded pipes (http://mail.wikipedia.org/pipermail/wikitech-l/2006-April/034637.html) [Has never passed]
* Link containing double-single-quotes '' (bug 4598) [Has never passed]
* message transform: <noinclude> in transcluded template (bug 4926) [Has never passed]
* message transform: <onlyinclude> in transcluded template (bug 4926) [Has never passed]
* BUG 1887, part 2: A <math> with a thumbnail- math enabled [Has never passed]
* HTML bullet list, unclosed tags (bug 5497) [Has never passed]
* HTML ordered list, unclosed tags (bug 5497) [Has never passed]
* HTML nested bullet list, open tags (bug 5497) [Has never passed]
* HTML nested ordered list, open tags (bug 5497) [Has never passed]
* Fuzz testing: image with bogus manual thumbnail [Introduced between 08-Apr-2007 07:15:22, 1.10alpha (r21099) and 25-Apr-2007 07:15:46, 1.10alpha (r21547)]
* Inline HTML vs wiki block nesting [Has never passed]
* Mixing markup for italics and bold [Has never passed]
* dt/dd/dl test [Has never passed]
* Images with the "|" character in the comment [Has never passed]
* Parents of subpages, two levels up, without trailing slash or name. [Has never passed]
* Parents of subpages, two levels up, with lots of extra trailing slashes. [Has never passed]
Passed 495 of 513 tests (96.49%)... 18 tests failed!
As far as I can tell, importDump does not mark imported pages as
coming from a bot, even when the user is a bot in the User table. Is
that correct? Is there a way to indicate a bot revision in the xml,
or do I need to do this in the db afterward?
=====================================
Jim Hu
Associate Professor
Dept. of Biochemistry and Biophysics
2128 TAMU
Texas A&M Univ.
College Station, TX 77843-2128
979-862-4054