Denny/Thomas - Thanks for publishing these artefacts. I'll look forward to the report with the metrics. Are there plans for next steps or is this the end of the project as far as the two of you go?
Comments on individual items inline below:
On Thu, Oct 1, 2015 at 2:09 PM, Denny Vrandečić vrandecic@google.com wrote:
The scripts that were created and used can be found here:
Oh no! Not PHP!! :-) One thing that concerns me is that the scripts seem to work on the Freebase RDF dump which is derivative artefact subject to a lossy transform. I assumed that one of the reasons for having this work hosted at Google was that it would allow direct access to the Freebase graphd quads. Is that not what happened? There's a bunch of provenance information which is very valuable for quality analysis in the graphd graph which gets lost during the RDF transformation.
https://tools.wmflabs.org/wikidata-primary-sources/data/freebase-mapped-miss...
The actual missing statements, including URLs for sources, are in this file. This was filtered against statements already existing in Wikidata, and the statements are mapped to Wikidata IDs. This contains about 14.3M statements (214MB gzipped, 831MB unzipped). These are created using the mappings below in addition to the mappings already in Wikidata. The quality of these statements is rather mixed.
From my brief exposure and the comments of others, the quality seems highly
problematic, but the issue seems to mainly be with the URLs proposed, which are of unknown provenance. Presumably in whatever Google database these were derived from, they were tagged with the tool/pipeline that produced them and some type of probability of relevance. Including this information in the data set would help pick the most relevant URLs to present and also help identify low-quality sources as voting feedback is collected. Also, filtering the URLs for known unacceptable citations (485K IMDB references, BBC Music entries which consist solely of EN Wikipedia snippets, etc) would cut down on a lot of the noise.
Some quick stats in addition to the 14.3M statements: 2.3M entities, 183 properties, 284K different web sites.
Additional datasets that we know meet a higher quality bar have been
previously released and uploaded directly to Wikidata by Tpt, following community consultation.
Is there a pointer to these?
https://tools.wmflabs.org/wikidata-primary-sources/data/additional-mapping.p...
Contains additional mappings between Freebase MIDs and Wikidata QIDs, which are not available in Wikidata. These are mappings based on statistical methods and single interwiki links. Unlike the first set of mappings we had created and published previously (which required multiple interwiki links at least), these mappings are expected to have a lower quality - sufficient for a manual process, but probably not sufficient for an automatic upload. This contains about 3.4M mappings (30 MB gzipped, 64MB unzipped).
I was really excited when I saw this because the first step in the Freebase migration project should be to increase the number of topic mappings between the two databases and 3.4M would almost double the number of existing mappings. Then I looked at the first 10K Q numbers and found of the 7,500 "new" mappings, almost 6,700 were already in Wikidata.
Fortunately when I took a bigger sample, things improved. For a 4% sample, it looks just under 30% are already in Wikidata, so if the quality of the remainder is good, that would yield an additional 2.4M mappings, which is great! Interestingly there were also a smattering of Wikidata 404s (25), redirects (71), and values which conflicted with Wikidata (530), a cursory analysis of the latter showed that they were mostly the result of merges on the Freebase end (so the entity now has two MIDs).
https://tools.wmflabs.org/wikidata-primary-sources/data/freebase-new-labels....
This file includes labels and aliases for Wikidata items which seem to be currently missing. The quality of these labels is undetermined. The file contains about 860k labels in about 160 languages, with 33 languages having more than 10k labels each (14MB gzipped, 32MB unzipped).
Their provenance is available in the Freebase graph. The most likely source is other language Wikipedias, but this could be easily confirmed.
https://tools.wmflabs.org/wikidata-primary-sources/data/freebase-reviewed-mi... This is an interesting file as it includes a quality signal for the statements in Freebase. What you will find here are ordered pairs of Freebase mids and properties, each indicating that the given pair were going through a review process and likely have a higher quality on average. This is only for those pairs that are missing from Wikidata. The file includes about 1.4M pairs, and this can be used for importing part of the data directly (6MB gzipped, 52MB unzipped).
This appears to be a dump of the instances for the property /freebase/valuenotation/is_reviewed but it's not usable as is because of the intended semantics of the property. The property indicates that *when the triple was written* the reviewer asserts that the current value of the name property is correct. This means that you need to use the creation date of the triple to extract the right property value from the graph for the named property (and because there's no write protection, probably only reviewers who are members of groups like "Staff IC" or "Staff OD" should be counted).
Additionally, Tpt and I created in the last few days of his internship a
few visualizations of the current data in Wikidata and in Freebase.
What are the visualizations designed to show? What, if any, insights did you derive from them?
Thanks again for the work and the interesting data sets. I'll look forward to the full report.
Tom