Denny/Thomas - Thanks for publishing these artefacts. I'll look forward to
the report with the metrics. Are there plans for next steps or is this the
end of the project as far as the two of you go?
Comments on individual items inline below:
On Thu, Oct 1, 2015 at 2:09 PM, Denny Vrandečić <vrandecic(a)google.com>
wrote:
The scripts that were created and used can be found here:
https://github.com/google/freebase-wikidata-converter
Oh no! Not PHP!! :-) One thing that concerns me is that the scripts seem
to work on the Freebase RDF dump which is derivative artefact subject to a
lossy transform. I assumed that one of the reasons for having this work
hosted at Google was that it would allow direct access to the Freebase
graphd quads. Is that not what happened? There's a bunch of provenance
information which is very valuable for quality analysis in the graphd graph
which gets lost during the RDF transformation.
https://tools.wmflabs.org/wikidata-primary-sources/data/freebase-mapped-mis…
The actual missing statements, including URLs for
sources, are in this
file. This was filtered against statements already existing in Wikidata,
and the statements are mapped to Wikidata IDs. This contains about 14.3M
statements (214MB gzipped, 831MB unzipped). These are created using the
mappings below in addition to the mappings already in Wikidata. The quality
of these statements is rather mixed.
From my brief exposure and the comments of others, the
quality seems highly
problematic, but the issue seems to mainly be with the URLs
proposed, which
are of unknown provenance. Presumably in whatever Google database these
were derived from, they were tagged with the tool/pipeline that produced
them and some type of probability of relevance. Including this information
in the data set would help pick the most relevant URLs to present and also
help identify low-quality sources as voting feedback is collected. Also,
filtering the URLs for known unacceptable citations (485K IMDB references,
BBC Music entries which consist solely of EN Wikipedia snippets, etc) would
cut down on a lot of the noise.
Some quick stats in addition to the 14.3M statements: 2.3M entities, 183
properties, 284K different web sites.
Additional datasets that we know meet a higher quality bar have been
previously released and uploaded directly to Wikidata
by Tpt, following
community consultation.
Is there a pointer to these?
https://tools.wmflabs.org/wikidata-primary-sources/data/additional-mapping.…
Contains additional mappings between Freebase MIDs and
Wikidata QIDs,
which are not available in Wikidata. These are mappings based on
statistical methods and single interwiki links. Unlike the first set of
mappings we had created and published previously (which required multiple
interwiki links at least), these mappings are expected to have a lower
quality - sufficient for a manual process, but probably not sufficient for
an automatic upload. This contains about 3.4M mappings (30 MB gzipped, 64MB
unzipped).
I was really excited when I saw this because the first step in the Freebase
migration project should be to increase the number of topic mappings
between the two databases and 3.4M would almost double the number of
existing mappings. Then I looked at the first 10K Q numbers and found of
the 7,500 "new" mappings, almost 6,700 were already in Wikidata.
Fortunately when I took a bigger sample, things improved. For a 4% sample,
it looks just under 30% are already in Wikidata, so if the quality of the
remainder is good, that would yield an additional 2.4M mappings, which is
great! Interestingly there were also a smattering of Wikidata 404s (25),
redirects (71), and values which conflicted with Wikidata (530), a cursory
analysis of the latter showed that they were mostly the result of merges on
the Freebase end (so the entity now has two MIDs).
https://tools.wmflabs.org/wikidata-primary-sources/data/freebase-new-labels…
This file includes labels and aliases for Wikidata
items which seem to be
currently missing. The quality of these labels is undetermined. The file
contains about 860k labels in about 160 languages, with 33 languages having
more than 10k labels each (14MB gzipped, 32MB unzipped).
Their provenance is available in the Freebase graph. The most likely
source is other language Wikipedias, but this could be easily confirmed.
https://tools.wmflabs.org/wikidata-primary-sources/data/freebase-reviewed-m…
This is an interesting file as it includes a quality signal for the
statements in Freebase. What you will find here are ordered pairs of
Freebase mids and properties, each indicating that the given pair were
going through a review process and likely have a higher quality on average.
This is only for those pairs that are missing from Wikidata. The file
includes about 1.4M pairs, and this can be used for importing part of the
data directly (6MB gzipped, 52MB unzipped).
This appears to be a dump of the instances for the property
/freebase/valuenotation/is_reviewed but it's not usable as is because of
the intended semantics of the property. The property indicates that *when
the triple was written* the reviewer asserts that the current value of the
name property is correct. This means that you need to use the creation
date of the triple to extract the right property value from the graph for
the named property (and because there's no write protection, probably only
reviewers who are members of groups like "Staff IC" or "Staff OD"
should be
counted).
Additionally, Tpt and I created in the last few days of his internship a
few visualizations of the current data in Wikidata and
in Freebase.
What are the visualizations designed to show? What, if any, insights did
you derive from them?
Thanks again for the work and the interesting data sets. I'll look forward
to the full report.
Tom