On Fri, Oct 2, 2015 at 11:59 AM, Tom Morris <tfmorris(a)gmail.com> wrote:
Denny/Thomas - Thanks for publishing these artefacts.
I'll look forward
to the report with the metrics.
This is now, finally, available:
http://static.googleusercontent.com/media/research.google.com/en//pubs/arch…
Are there plans for next steps or is this the end of
the project as far as
the two of you go?
I'm going to assume that the lack of answer to this question over the last
four months, the lack of updates on the project, and the fact no one is
even bothering to respond to issues
<https://github.com/google/primarysources/issues> means that this project
is dead and abandoned. That's pretty sad. For an internship, it sounds
like a cool project and a decent result. As an actual serious attempt to
make productive use of the Freebase data, it's a weak, half-hearted effort
by Google.
Is there any interest in the Wikidata community for making use of the
Freebase data now that Google has abandoned their effort, or is there too
much negative sentiment against it to make it worth the effort?
Tom
p.s. I'm surprised that none of the stuff mentioned below is addressed in
the paper. Was it already submitted by the beginning of October?
Comments on individual items inline below:
On Thu, Oct 1, 2015 at 2:09 PM, Denny Vrandečić <vrandecic(a)google.com>
wrote:
The scripts that were created and used can be found here:
https://github.com/google/freebase-wikidata-converter
Oh no! Not PHP!! :-) One thing that concerns me is that the scripts seem
to work on the Freebase RDF dump which is derivative artefact subject to a
lossy transform. I assumed that one of the reasons for having this work
hosted at Google was that it would allow direct access to the Freebase
graphd quads. Is that not what happened? There's a bunch of provenance
information which is very valuable for quality analysis in the graphd graph
which gets lost during the RDF transformation.
This isn't addressed in the paper and represents a significant loss of
provenance information.
https://tools.wmflabs.org/wikidata-primary-sources/data/freebase-mapped-mis…
The actual missing statements, including URLs for sources, are in this
file. This was filtered against statements already existing in Wikidata,
and the statements are mapped to Wikidata IDs. This contains about 14.3M
statements (214MB gzipped, 831MB unzipped). These are created using the
mappings below in addition to the mappings already in Wikidata. The quality
of these statements is rather mixed.
From my brief exposure and the comments of others, the quality seems
highly problematic, but the issue seems to mainly be with the URLs
proposed, which are of unknown provenance. Presumably in whatever Google
database these were derived from, they were tagged with the tool/pipeline
that produced them and some type of probability of relevance. Including
this information in the data set would help pick the most relevant URLs to
present and also help identify low-quality sources as voting feedback is
collected. Also, filtering the URLs for known unacceptable citations (485K
IMDB references, BBC Music entries which consist solely of EN Wikipedia
snippets, etc) would cut down on a lot of the noise.
Some quick stats in addition to the 14.3M statements: 2.3M entities, 183
properties, 284K different web sites.
Additional datasets that we know meet a higher quality bar have been
previously released and uploaded directly to
Wikidata by Tpt, following
community consultation.
Is there a pointer to these?
https://tools.wmflabs.org/wikidata-primary-sources/data/additional-mapping.…
Contains additional mappings between Freebase MIDs and Wikidata QIDs,
which are not available in Wikidata. These are mappings based on
statistical methods and single interwiki links. Unlike the first set of
mappings we had created and published previously (which required multiple
interwiki links at least), these mappings are expected to have a lower
quality - sufficient for a manual process, but probably not sufficient for
an automatic upload. This contains about 3.4M mappings (30 MB gzipped, 64MB
unzipped).
I was really excited when I saw this because the first step in the
Freebase migration project should be to increase the number of topic
mappings between the two databases and 3.4M would almost double the number
of existing mappings. Then I looked at the first 10K Q numbers and found
of the 7,500 "new" mappings, almost 6,700 were already in Wikidata.
Fortunately when I took a bigger sample, things improved. For a 4%
sample, it looks just under 30% are already in Wikidata, so if the quality
of the remainder is good, that would yield an additional 2.4M mappings,
which is great! Interestingly there were also a smattering of Wikidata
404s (25), redirects (71), and values which conflicted with Wikidata (530),
a cursory analysis of the latter showed that they were mostly the result of
merges on the Freebase end (so the entity now has two MIDs).
It's not clear to me if these additional mappings are being used in the
Primary Sources tool (or anywhere else). Are they?
https://tools.wmflabs.org/wikidata-primary-sources/data/freebase-new-labels…
This file includes labels and aliases for Wikidata items which seem to be
currently missing. The quality of these labels is undetermined. The file
contains about 860k labels in about 160 languages, with 33 languages having
more than 10k labels each (14MB gzipped, 32MB unzipped).
Their provenance is available in the Freebase graph. The most likely
source is other language Wikipedias, but this could be easily confirmed.
https://tools.wmflabs.org/wikidata-primary-sources/data/freebase-reviewed-m…
This is an interesting file as it includes a quality signal for the
statements in Freebase. What you will find here are ordered pairs of
Freebase mids and properties, each indicating that the given pair were
going through a review process and likely have a higher quality on average.
This is only for those pairs that are missing from Wikidata. The file
includes about 1.4M pairs, and this can be used for importing part of the
data directly (6MB gzipped, 52MB unzipped).
This appears to be a dump of the instances for the property
/freebase/valuenotation/is_reviewed but it's not usable as is because of
the intended semantics of the property. The property indicates that *when
the triple was written* the reviewer asserts that the current value of
the name property is correct. This means that you need to use the creation
date of the triple to extract the right property value from the graph for
the named property (and because there's no write protection, probably only
reviewers who are members of groups like "Staff IC" or "Staff OD"
should be
counted).