Hi all,
The main page of Wikidata shows an item count that is getting increasingly out of synch with reality. The 31 Aug dump contains 18,483,096 items, while the front page says that there are 14,788,811 now. I think this is caused by how MediaWiki counts "articles" (which is not what we are dealing with).
Or maybe this is intended? But if we prominently publish a number that is 25% off the "raw" data, we should at least explain which criteria was used to produce it. What counts as a "proper" item on Wikidata?
Cheers,
Markus
Wow, that is a big difference. Almost 4 million.
I think that MediaWiki doesn't count pages without any [[link]]. Is that the reason?
2015-09-07 17:39 GMT+02:00 Markus Krötzsch markus@semantic-mediawiki.org:
Hi all,
The main page of Wikidata shows an item count that is getting increasingly out of synch with reality. The 31 Aug dump contains 18,483,096 items, while the front page says that there are 14,788,811 now. I think this is caused by how MediaWiki counts "articles" (which is not what we are dealing with).
Or maybe this is intended? But if we prominently publish a number that is 25% off the "raw" data, we should at least explain which criteria was used to produce it. What counts as a "proper" item on Wikidata?
Cheers,
Markus
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Am 07.09.2015 um 18:05 schrieb Emilio J. Rodríguez-Posada:
Wow, that is a big difference. Almost 4 million.
I think that MediaWiki doesn't count pages without any [[link]]. Is that the reason?
No, that only applies to Wikitext.
Here is the relevant code from ItemContent:
public function isCountable( $hasLinks = null ) { return !$this->isRedirect() && !$this->getItem()->isEmpty(); }
And the relevant code from Item:
public function isEmpty() { return $this->fingerprint->isEmpty() && $this->statements->isEmpty() && $this->siteLinks->isEmpty(); }
So all pages that are not empty (have labels or descriptions or aliases or statements or sitelinks), and are not redirects, should be counted.
Is it possible that the difference of 3,694,285 is mainly redirects? Which dump were you refering to, Markus? The XML dump contains redirects, and so does the RDF dump. The JSON dump doesn't... so if you were referring to the JSON dump, that would imply we have 3.7 million empty (useless) items.
Or, of course, the counting mechanism is just broken. Which is quite possible.
I know that over the past 9 months I have created 500,000 redirects. Other than that I would guess that maybe 100,000 other redirect have been created, at most 500,000 more meaning 1,000,000 in total.
Such a big difference does seem rather odd to me...
On 7 September 2015 at 19:37, Daniel Kinzler daniel.kinzler@wikimedia.de wrote:
Am 07.09.2015 um 18:05 schrieb Emilio J. Rodríguez-Posada:
Wow, that is a big difference. Almost 4 million.
I think that MediaWiki doesn't count pages without any [[link]]. Is that
the reason?
No, that only applies to Wikitext.
Here is the relevant code from ItemContent:
public function isCountable( $hasLinks = null ) { return !$this->isRedirect() &&
!$this->getItem()->isEmpty(); }
And the relevant code from Item:
public function isEmpty() { return $this->fingerprint->isEmpty() && $this->statements->isEmpty() && $this->siteLinks->isEmpty(); }
So all pages that are not empty (have labels or descriptions or aliases or statements or sitelinks), and are not redirects, should be counted.
Is it possible that the difference of 3,694,285 is mainly redirects? Which dump were you refering to, Markus? The XML dump contains redirects, and so does the RDF dump. The JSON dump doesn't... so if you were referring to the JSON dump, that would imply we have 3.7 million empty (useless) items.
Or, of course, the counting mechanism is just broken. Which is quite possible.
-- Daniel Kinzler Senior Software Developer
Wikimedia Deutschland Gesellschaft zur Förderung Freien Wissens e.V.
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Hi!
Is it possible that the difference of 3,694,285 is mainly redirects? Which dump
We have 725691 redirects per SPARQL engine. We do have some sizeable number of entities which have no statements (alas!) but I have hard time believing we have ~3 mln of those not having even a single label. Unless there's some bot gone wild here. The problem is that if entity has no sitelinks, no labels and no statements, I don't even think it would be in SPARQL engine, so I can't query for it.
Hi!
We have 725691 redirects per SPARQL engine. We do have some sizeable number of entities which have no statements (alas!) but I have hard time believing we have ~3 mln of those not having even a single label. Unless there's some bot gone wild here. The problem is that if entity has no sitelinks, no labels and no statements, I don't even think it would be in SPARQL engine, so I can't query for it.
OTOH, we have 14,590,233 entities having P31 or P279, which given historical statistics on non-classified entities, suggests 14,788,811 is way too low, unless we've got spectacularly good at catching up with classification (which may have happened, I don't know - did it?)
On 07.09.2015 19:37, Daniel Kinzler wrote:
Am 07.09.2015 um 18:05 schrieb Emilio J. Rodríguez-Posada:
Wow, that is a big difference. Almost 4 million.
I think that MediaWiki doesn't count pages without any [[link]]. Is that the reason?
No, that only applies to Wikitext.
Here is the relevant code from ItemContent:
public function isCountable( $hasLinks = null ) { return !$this->isRedirect() && !$this->getItem()->isEmpty(); }
And the relevant code from Item:
public function isEmpty() { return $this->fingerprint->isEmpty() && $this->statements->isEmpty() && $this->siteLinks->isEmpty(); }
So all pages that are not empty (have labels or descriptions or aliases or statements or sitelinks), and are not redirects, should be counted.
Is it possible that the difference of 3,694,285 is mainly redirects? Which dump were you refering to, Markus?The XML dump contains redirects, and so does the RDF dump. The JSON dump doesn't... so if you were referring to the JSON dump, that would imply we have 3.7 million empty (useless) items.
Or, of course, the counting mechanism is just broken. Which is quite possible.
This could of course also be the case for my Java program, but I reconfirmed:
$ zgrep -c "{"typ" 20150831.json.gz 18483096
I am using the JSON dump, so redirects are not possible. I would not detect duplicate items, if this would occur.
It seems that there are indeed a number of empty or almost empty items, apparently created by merges, e.g.:
https://www.wikidata.org/wiki/Q10031183
Some of them do have some minimal amount of remaining data though, e.g.,
https://www.wikidata.org/wiki/Q6237652
(would this count as "empty"?)
I'll count how many of each we have. Back in 30min.
Markus
On 07.09.2015 21:48, Markus Krötzsch wrote: ...
I'll count how many of each we have. Back in 30min.
This does not seem to be the explanation after all. I could only find 33 items in total that have no data at all. If I also count items that have nothing but descriptions or aliases, I get 589.
Will check for duplicates next.
Markus
How many items have no sitelinks at all (regardless of labels, properties, etc)? That might be a more substantial number...
Andrew.
On 7 September 2015 at 21:10, Markus Krötzsch markus@semantic-mediawiki.org wrote:
On 07.09.2015 21:48, Markus Krötzsch wrote: ...
I'll count how many of each we have. Back in 30min.
This does not seem to be the explanation after all. I could only find 33 items in total that have no data at all. If I also count items that have nothing but descriptions or aliases, I get 589.
Will check for duplicates next.
Markus
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
On 07.09.2015 22:10, Markus Krötzsch wrote:
On 07.09.2015 21:48, Markus Krötzsch wrote: ...
I'll count how many of each we have. Back in 30min.
This does not seem to be the explanation after all. I could only find 33 items in total that have no data at all. If I also count items that have nothing but descriptions or aliases, I get 589.
Will check for duplicates next.
Update: there are no duplicate items in the dump.
Markus
Thanks for investigating, Makrus!
Am 07.09.2015 um 22:54 schrieb Markus Krötzsch:
On 07.09.2015 22:10, Markus Krötzsch wrote:
On 07.09.2015 21:48, Markus Krötzsch wrote: ...
I'll count how many of each we have. Back in 30min.
This does not seem to be the explanation after all. I could only find 33 items in total that have no data at all. If I also count items that have nothing but descriptions or aliases, I get 589.
Will check for duplicates next.
Update: there are no duplicate items in the dump.
Markus
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
On 08.09.2015 00:39, Daniel Kinzler wrote:
Thanks for investigating, Makrus!
Unfortunately, none of my results can explain the missing 4 million items; they just tell us what is *not* the problem. Is there anything else that should be checked, or do you think the problem is just somewhere in the counting in MediaWiki (i.e., the 4 million items are not special at all, just overlooked for some reason)?
Markus
Am 07.09.2015 um 22:54 schrieb Markus Krötzsch:
On 07.09.2015 22:10, Markus Krötzsch wrote:
On 07.09.2015 21:48, Markus Krötzsch wrote: ...
I'll count how many of each we have. Back in 30min.
This does not seem to be the explanation after all. I could only find 33 items in total that have no data at all. If I also count items that have nothing but descriptions or aliases, I get 589.
Will check for duplicates next.
Update: there are no duplicate items in the dump.
Markus
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Can the counter be reset to show the updated figure?
2015-09-08 14:20 GMT+02:00 Markus Kroetzsch markus.kroetzsch@tu-dresden.de :
On 08.09.2015 00:39, Daniel Kinzler wrote:
Thanks for investigating, Makrus!
Unfortunately, none of my results can explain the missing 4 million items; they just tell us what is *not* the problem. Is there anything else that should be checked, or do you think the problem is just somewhere in the counting in MediaWiki (i.e., the 4 million items are not special at all, just overlooked for some reason)?
Markus
Am 07.09.2015 um 22:54 schrieb Markus Krötzsch:
On 07.09.2015 22:10, Markus Krötzsch wrote:
On 07.09.2015 21:48, Markus Krötzsch wrote: ...
I'll count how many of each we have. Back in 30min.
This does not seem to be the explanation after all. I could only find 33 items in total that have no data at all. If I also count items that have nothing but descriptions or aliases, I get 589.
Will check for duplicates next.
Update: there are no duplicate items in the dump.
Markus
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
-- Markus Kroetzsch Faculty of Computer Science Technische Universität Dresden +49 351 463 38486 http://korrekt.org/
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata