Hi Erik, I'm crossposting this message to the wikisource-l, if anyone is interested to give some inputs.
The http://stats.wikimedia.org/wikisource/EN/TablesDatabaseWords.htm seens to be inaccurate. Apparently your tool compute only words in the main namespace. It may works for projects like Wikipedia and theirs very long talk pages at the namespace Project: on some subjects (such as deletion requests). But it doens't work for Wikisource for two main reasons:
1) Some subdomains have custom namespaces for short biographies and list of works by author (en, it, pt and others), some have it on the main namespace (fr, de, es and others). This is a minor issue, since the amount of words on those pages is small
2) Some Wikisources (de, fr and en, according to http://wikisource.org/wiki/Wikisource:ProofreadPage_Statistics ) have large amount of contents in a custom namespace devoted to the ProofreadPage Extension ( http://www.mediawiki.org/wiki/Extension:Proofread_Page ). This content is displayed on main namespace within page transclusion (see http://en.wikisource.org/w/index.php?title=35_Sonnets&action=edit for an example).
Is possible to include the custom namespaces for all Wikisources on your automated calculation tool?
[[:m:User:555]]
Follow up: I see nl wikisource was not a good example.
In fr the namespaces Page and Livre are indeed defined, unlike at nl. Hence these namespaces are counted separately.
<case>first-letter</case> <namespaces> <namespace key="-2">Média</namespace> <namespace key="-1">Spécial</namespace> <namespace key="0" /> <namespace key="1">Discuter</namespace> <namespace key="2">Utilisateur</namespace> <namespace key="3">Discussion Utilisateur</namespace> <namespace key="4">Wikisource</namespace> <namespace key="5">Discussion Wikisource</namespace> <namespace key="6">Image</namespace> <namespace key="7">Discussion Image</namespace> <namespace key="8">MediaWiki</namespace> <namespace key="9">Discussion MediaWiki</namespace> <namespace key="10">Modèle</namespace> <namespace key="11">Discussion Modèle</namespace> <namespace key="12">Aide</namespace> <namespace key="13">Discussion Aide</namespace> <namespace key="14">Catégorie</namespace> <namespace key="15">Discussion Catégorie</namespace> <namespace key="100">Transwiki</namespace> <namespace key="101">Discussion Transwiki</namespace> <namespace key="104">Page</namespace> <namespace key="105">Discussion Page</namespace> <namespace key="112">Livre</namespace> <namespace key="113">Discussion Livre</namespace> </namespaces> </siteinfo>
In http://stats.wikimedia.org/wikisource/EN/TablesWikipediaFR.htm in the first table you will see counts for namespace 0 only.
In the table 'Database records per namespace' on the same page you can see counts for 104 separately. (112 is missing).
Do you want the total of all three namespaces 0+104+112) in column E in the first table? I can treat 104 and 112 as namespace 0 then. It will influence counts in all columns.
Treating them as one in the first table and separately in 'Database records per namespace' is doable. They will show up in 'Database records per namespace' in separate columns then.
-----------
If defined will the numbers for the keys always be 104 and 112? In that case I'd rather not harvest the codes from a html page or through the API, if they are directly available in the dump.
Cheers Erik Zachte
Luiz Augusto wrote:
Hi Erik, I'm crossposting this message to the wikisource-l, if anyone is interested to give some inputs.
The http://stats.wikimedia.org/wikisource/EN/TablesDatabaseWords.htm seens to be inaccurate. Apparently your tool compute only words in the main namespace. It may works for projects like Wikipedia and theirs very long talk pages at the namespace Project: on some subjects (such as deletion requests). But it doens't work for Wikisource for two main reasons:
- Some subdomains have custom namespaces for short biographies and list
of works by author (en, it, pt and others), some have it on the main namespace (fr, de, es and others). This is a minor issue, since the amount of words on those pages is small
- Some Wikisources (de, fr and en, according to
http://wikisource.org/wiki/Wikisource:ProofreadPage_Statistics ) have large amount of contents in a custom namespace devoted to the ProofreadPage Extension ( http://www.mediawiki.org/wiki/Extension:Proofread_Page ). This content is displayed on main namespace within page transclusion (see http://en.wikisource.org/w/index.php?title=35_Sonnets&action=edit http://en.wikisource.org/w/index.php?title=35_Sonnets&action=edit for an example).
Is possible to include the custom namespaces for all Wikisources on your automated calculation tool?
[[:m:User:555]]
On third thought: would it not be somewhat arbitrary to count each page in article counts. In wikipedia an article is a logical unit, like a chapter in wikisource
(I looked at http://en.wikisource.org/wiki/M._K._Gandhi:_Indian_Patriot_in_South_Africa/C... as an example)
In wikipedia we do not count paragraphs in articles separately. Would not counting pages as part of 'article count' be similar?
Or do I misunderstand your request?
Erik Zachte
wikisource-l@lists.wikimedia.org