On 07.09.2015 19:37, Daniel Kinzler wrote:
Am 07.09.2015 um 18:05 schrieb Emilio J. Rodríguez-Posada:
Wow, that is a big difference. Almost 4 million.
I think that MediaWiki doesn't count pages without any [[link]]. Is that the reason?
No, that only applies to Wikitext.
Here is the relevant code from ItemContent:
public function isCountable( $hasLinks = null ) { return !$this->isRedirect() && !$this->getItem()->isEmpty(); }
And the relevant code from Item:
public function isEmpty() { return $this->fingerprint->isEmpty() && $this->statements->isEmpty() && $this->siteLinks->isEmpty(); }
So all pages that are not empty (have labels or descriptions or aliases or statements or sitelinks), and are not redirects, should be counted.
Is it possible that the difference of 3,694,285 is mainly redirects? Which dump were you refering to, Markus?The XML dump contains redirects, and so does the RDF dump. The JSON dump doesn't... so if you were referring to the JSON dump, that would imply we have 3.7 million empty (useless) items.
Or, of course, the counting mechanism is just broken. Which is quite possible.
This could of course also be the case for my Java program, but I reconfirmed:
$ zgrep -c "{"typ" 20150831.json.gz 18483096
I am using the JSON dump, so redirects are not possible. I would not detect duplicate items, if this would occur.
It seems that there are indeed a number of empty or almost empty items, apparently created by merges, e.g.:
https://www.wikidata.org/wiki/Q10031183
Some of them do have some minimal amount of remaining data though, e.g.,
https://www.wikidata.org/wiki/Q6237652
(would this count as "empty"?)
I'll count how many of each we have. Back in 30min.
Markus