Sorry, now correctly cross posted. Emmanuel
-------- Original Message -------- Subject: WMF XML dump title case problem Date: Sun, 26 Jun 2011 17:07:19 +0200 From: Emmanuel Engelhart emmanuel@engelhart.org To: Mailing list for Wikimedia CH wikimediach-l@lists.wikimedia.org, offline-l@lists.wikimedia.org
Hi
Titles should be stored in the table "page" with a first letter uppercased. http://en.wikipedia.org/wiki/Wikipedia:Naming_conventions_%28technical_restr...
Unfortunately, it seems that we have XML dumps (and consequently mwdumper generated SQL) containing titles with a first letter lowercased.
For example: $wget http://download.wikimedia.org/mywiktionary/20110617/mywiktionary-20110617-pa... $bzip2 -d -c mywiktionary-20110617-pages-articles.xml.bz2 | grep "<title>"| grep tationery | more <title>stationery</title> <title>stationery shop</title>
Is that a bug?
Regards Emmanuel
Emmanuel Engelhart wrote:
Titles should be stored in the table "page" with a first letter uppercased. http://en.wikipedia.org/wiki/Wikipedia:Naming_conventions_%28technical_restr... tions%29#Lower_case_first_letter
Unfortunately, it seems that we have XML dumps (and consequently mwdumper generated SQL) containing titles with a first letter lowercased.
For example: $wget http://download.wikimedia.org/mywiktionary/20110617/mywiktionary-20110617-pa... s-articles.xml.bz2 $bzip2 -d -c mywiktionary-20110617-pages-articles.xml.bz2 | grep "<title>"| grep tationery | more
<title>stationery</title> <title>stationery shop</title>
Is that a bug?
No.
You're trying to apply the English Wikipedia's rules to the Burmese Wiktionary. Wiktionaries have $wgCapitalLinks set to false.[1]
MZMcBride
On 06/26/2011 05:22 PM, MZMcBride wrote:
Emmanuel Engelhart wrote:
Titles should be stored in the table "page" with a first letter uppercased. http://en.wikipedia.org/wiki/Wikipedia:Naming_conventions_%28technical_restr... tions%29#Lower_case_first_letter
Unfortunately, it seems that we have XML dumps (and consequently mwdumper generated SQL) containing titles with a first letter lowercased.
For example: $wget http://download.wikimedia.org/mywiktionary/20110617/mywiktionary-20110617-pa... s-articles.xml.bz2 $bzip2 -d -c mywiktionary-20110617-pages-articles.xml.bz2 | grep "<title>"| grep tationery | more
<title>stationery</title> <title>stationery shop</title>
Is that a bug?
No.
You're trying to apply the English Wikipedia's rules to the Burmese Wiktionary. Wiktionaries have $wgCapitalLinks set to false.[1]
MZMcBride
[1] http://www.mediawiki.org/wiki/Manual:$wgCapitalLinks
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
thank you both for your explanations. Everything is clear now for me. Emmanuel
Emmanuel Engelhart wrote:
Titles should be stored in the table "page" with a first letter uppercased... Unfortunately, it seems that we have XML dumps (and consequently mwdumper generated SQL) containing titles with a first letter lowercased. For example: $wget http://download.wikimedia.org/mywiktionary/20110617/mywiktionary-20110617-pa...
Wiktionary is different. Its users requested reconfiguration so that words are stored in the database with their exact capitalization. The Wikipedia-style first-letter capitalization (which caused pretty severe problems for a dictionary) is *not* performed there.
See also http://en.wiktionary.org/wiki/Wiktionary:Capitalization .
Emmanuel Engelhart wrote:
Hi
Titles should be stored in the table "page" with a first letter uppercased. http://en.wikipedia.org/wiki/Wikipedia:Naming_conventions_%28technical_restr...
Unfortunately, it seems that we have XML dumps (and consequently mwdumper generated SQL) containing titles with a first letter lowercased.
For example: $wget http://download.wikimedia.org/mywiktionary/20110617/mywiktionary-20110617-pa... $bzip2 -d -c mywiktionary-20110617-pages-articles.xml.bz2 | grep "<title>"| grep tationery | more
<title>stationery</title> <title>stationery shop</title>
Is that a bug?
No. Those titles are fully case sensitive. Look at the top of the file: <case>case-sensitive</case>
wikitech-l@lists.wikimedia.org