On 22.02.2016 18:28, Tom Morris wrote:
On Sun, Feb 21, 2016 at 4:25 PM, Markus Krötzsch <markus@semantic-mediawiki.org mailto:markus@semantic-mediawiki.org> wrote:
On 21.02.2016 20 <tel:21.02.2016%2020>:37, Tom Morris wrote: On Sun, Feb 21, 2016 at 11:41 AM, Markus Krötzsch <markus@semantic-mediawiki.org <mailto:markus@semantic-mediawiki.org> <mailto:markus@semantic-mediawiki.org <mailto:markus@semantic-mediawiki.org>>> wrote: On 18.02.2016 15:59, Lydia Pintscher wrote: Thomas, Denny, Sebastian, Thomas, and I have published a paper which was accepted for the industry track at WWW 2016. It covers the migration from Freebase to Wikidata. You can now read it here: http://research.google.com/pubs/archive/44818.pdf Is it possible that you have actually used the flawed statistics from the Wikidata main page regarding the size of the project? 14.5M items in Aug 2015 seems far too low a number. Our RDF exports from mid August already contained more than 18.4M items. It would be nice to get this fixed at some point. There are currently almost 20M items, and the main page still shows only 16.5M. Numbers are off throughout the paper. They also quote 48M instead of 58M topics for Freebase and mischaracterize some other key points. They key number is that 3.2 billion facts for 58 million topics has generated 106,220 new statements for Wikidata. If my calculator had more decimal places, I could tell you what percentage that is. Obviously, any tool can only import statements for which we have items and properties at all, so the number of importable facts is much lower.
Obviously, but "much lower" from 3.2B is probably something like 50M-300M, not 0.1M.
That estimate might be a bit off. The paper contains a detailed discussion of this aspect. The total number of statements that could be translated from Freebase to Wikidata is given as 17M, of which only 14M were new. So this seems to be the current upper bound of what you could import with PS or any other tool. The authors mention that this already includes more than 90% of the "reviewed" content of Freebase that refers to Wikidata items. The paper seems to suggest that these mapped+reviewed statements were already imported directly -- maybe Lydia could clarify if this was the case.
It seems that if you want to go to the dimensions that you refer to (50M/300M/3200M) you would need to map more Wikidata items to Freebase topics in some way. The paper gives several techniques that were used to obtain mappings that are already more than what we have stored in Wikidata now. So it is probably not the lack of mappings but the lack of items that is the limit here. Data can only be imported if we have a page at all ;-)
Btw. where do the 100K imported statements come from that you mentioned here? I was also interested in that number but I could not find it in the paper.
Markus