On Tue, Feb 23, 2016 at 1:28 AM, Markus Krötzsch <
markus(a)semantic-mediawiki.org> wrote:
On 22.02.2016 18:28, Tom Morris wrote:
On Sun, Feb 21, 2016 at 4:25 PM, Markus Krötzsch
<markus(a)semantic-mediawiki.org <mailto:markus@semantic-mediawiki.org>>
wrote:
On 21.02.2016 20 <tel:21.02.2016%2020>:37, Tom Morris wrote:
On Sun, Feb 21, 2016 at 11:41 AM, Markus Krötzsch
<markus(a)semantic-mediawiki.org
<mailto:markus@semantic-mediawiki.org>>>
wrote:
On 18.02.2016 15:59, Lydia Pintscher wrote:
Thomas, Denny, Sebastian, Thomas, and I have published
a paper
which was
accepted for the industry track at WWW 2016. It covers
the migration
from Freebase to Wikidata. You can now read it here:
http://research.google.com/pubs/archive/44818.pdf
Is it possible that you have actually used the flawed
statistics
from the Wikidata main page regarding the size of the
project? 14.5M
items in Aug 2015 seems far too low a number. Our RDF
exports from
mid August already contained more than 18.4M items. It
would be nice
to get this fixed at some point. There are currently almost
20M
items, and the main page still shows only 16.5M.
Numbers are off throughout the paper. They also quote 48M
instead of
58M topics for Freebase and mischaracterize some other key
points. They
key number is that 3.2 billion facts for 58 million topics has
generated
106,220 new statements for Wikidata. If my calculator had more
decimal
places, I could tell you what percentage that is.
Obviously, any tool can only import statements for which we have
items and properties at all, so the number of importable facts is
much lower.
Obviously, but "much lower" from 3.2B is probably something like
50M-300M, not 0.1M.
That estimate might be a bit off. The paper contains a detailed discussion
of this aspect.
Or the paper might be off. Addressing the flaws in the paper would require
a full paper in its own right.
I don't mean to imply that numbers are the only thing that's important,
because that's just one measure of how much value has been extracted from
the Freebase data, the relative magnitudes of the numbers are startling.
The total number of statements that could be
translated from Freebase to
Wikidata is given as 17M, of which only 14M were new. So this seems to be
the current upper bound of what you could import with PS or any other tool.
Upper bound using that particular methodology, only 4.5M of the 20M
Wikidata topics were mapped when, given the fact that Wikidata items have
to appear in a Wikipedia and that Freebase include all of English
Wikipedia, one would expect a much higher percentage to be mappable.
The authors mention that this already includes more
than 90% of the
"reviewed" content of Freebase that refers to Wikidata items. The paper
seems to suggest that these mapped+reviewed statements were already
imported directly -- maybe Lydia could clarify if this was the case.
More clarity and information is always welcome, but since this is mentioned
as a possible future work item in Section 7, I'm guessing it wasn't done
yet.
It seems that if you want to go to the dimensions that you refer to
(50M/300M/3200M) you would need to map more Wikidata items to Freebase
topics in some way. The paper gives several techniques that were used to
obtain mappings that are already more than what we have stored in Wikidata
now. So it is probably not the lack of mappings but the lack of items that
is the limit here. Data can only be imported if we have a page at all ;-)
If it's true that only 25% of Wikidata items appear in Freebase, I'd be
amazed (and I'd like to see an analysis of what makes up that other 75%).
Btw. where do the 100K imported statements come from
that you mentioned
here? I was also interested in that number but I could not find it in the
paper.
The paper says in section 4, "At the time of writing (January, 2016), the
tool has been used by more than a hundred users who performed about 90,000
approval or rejection actions." which probably means ~80,000 new statements
(since ~10% get rejected). My 106K number is from the current dashboard
<https://tools.wmflabs.org/wikidata-primary-sources/status.html>.
Tom