Sorry, now correctly cross posted. Emmanuel
-------- Original Message -------- Subject: WMF XML dump title case problem Date: Sun, 26 Jun 2011 17:07:19 +0200 From: Emmanuel Engelhart emmanuel@engelhart.org To: Mailing list for Wikimedia CH wikimediach-l@lists.wikimedia.org, offline-l@lists.wikimedia.org
Hi
Titles should be stored in the table "page" with a first letter uppercased. http://en.wikipedia.org/wiki/Wikipedia:Naming_conventions_%28technical_restr...
Unfortunately, it seems that we have XML dumps (and consequently mwdumper generated SQL) containing titles with a first letter lowercased.
For example: $wget http://download.wikimedia.org/mywiktionary/20110617/mywiktionary-20110617-pa... $bzip2 -d -c mywiktionary-20110617-pages-articles.xml.bz2 | grep "<title>"| grep tationery | more <title>stationery</title> <title>stationery shop</title>
Is that a bug?
Regards Emmanuel
Emmanuel Engelhart wrote:
Titles should be stored in the table "page" with a first letter uppercased... Unfortunately, it seems that we have XML dumps (and consequently mwdumper generated SQL) containing titles with a first letter lowercased. For example: $wget http://download.wikimedia.org/mywiktionary/20110617/mywiktionary-20110617-pa...
Wiktionary is different. Its users requested reconfiguration so that words are stored in the database with their exact capitalization. The Wikipedia-style first-letter capitalization (which caused pretty severe problems for a dictionary) is *not* performed there.
See also http://en.wiktionary.org/wiki/Wiktionary:Capitalization .