from Freebase to Wikidata: the great migration

List overview All Threads
Download

newer

older

weekly summary #198

Fight vandalism

Lydia Pintscher

18 Feb 2016 18 Feb '16

8:29 p.m.

Hey everyone :)

Thomas, Denny, Sebastian, Thomas, and I have published a paper which was accepted for the industry track at WWW 2016. It covers the migration from Freebase to Wikidata. You can now read it here: http://research.google.com/pubs/archive/44818.pdf

Cheers Lydia

-- Lydia Pintscher - http://about.me/lydia.pintscher Product Manager for Wikidata Wikimedia Deutschland e.V. Tempelhofer Ufer 23-24 10963 Berlin www.wikimedia.de Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e. V. Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg unter der Nummer 23855 Nz. Als gemeinnützig anerkannt durch das Finanzamt für Körperschaften I Berlin, Steuernummer 27/029/42207.

Attachments:

attachment.htm (text/html — 1.1 KB)

Show replies by date

Federico Leva (Nemo)

18 Feb 18 Feb

10:24 p.m.

Lydia Pintscher, 18/02/2016 15:59:

...

Thomas, Denny, Sebastian, Thomas, and I have published a paper which was accepted for the industry track at WWW 2016. It covers the migration from Freebase to Wikidata. You can now read it here: http://research.google.com/pubs/archive/44818.pdf

Nice!

...

Concluding, in a fairly short amount of time, we have been able to provide the Wikidata community with more than 14 million new Wikidata statements using a customizable

I must admit that, despite knowing the context, I wasn't able to understand whether this is the number of "mapped"/"translated" statements or the number of statements actually added via the primary sources tool. I assume the latter given paragraph 5.3:

...

after removing dupli cates and facts already contained in Wikidata, we obtain 14 million new statements. If all these statements were added to Wikidata, we would see a 21% increase of the num- ber of statements in Wikidata.

Nemo

Maximilian Klein

10:37 p.m.

Congratulations on a fantastic project and a your acceptance in WWW2016.

Make a great day, Max Klein ‽ http://notconfusing.com/

On Thu, Feb 18, 2016 at 10:54 AM, Federico Leva (Nemo) nemowiki@gmail.com wrote:

...

Lydia Pintscher, 18/02/2016 15:59:

...
Thomas, Denny, Sebastian, Thomas, and I have published a paper which was accepted for the industry track at WWW 2016. It covers the migration from Freebase to Wikidata. You can now read it here: http://research.google.com/pubs/archive/44818.pdf

Nice!

...
Concluding, in a fairly short amount of time, we have been able to provide the Wikidata community with more than 14 million new Wikidata statements using a customizable

I must admit that, despite knowing the context, I wasn't able to understand whether this is the number of "mapped"/"translated" statements or the number of statements actually added via the primary sources tool. I assume the latter given paragraph 5.3:

...
after removing dupli cates and facts already contained in Wikidata, we obtain 14 million new statements. If all these statements were added to Wikidata, we would see a 21% increase of the num- ber of statements in Wikidata.

I was confused about that too. "the [Primary Sources] tool has been used by more than a hundred users who performed about 90,000 approval or rejection actions. More than 14 million statements have been uploaded in total." I think that means that ≤ 90,000 items or statements were added of 14 million available to be add through Primary Sources tool.

...

Nemo

Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

Gerard Meijssen

21 Feb 21 Feb

8:30 p.m.

Hoi, I add statements of the primary sources tool in preference to add them myself (Primary Sources takes more time).

I am still of the strongest opinion that given the extremely disappointing number of added statements the Primary Sources tool is a failure.

It is sad that all the good work of Freebase is lost in this way. It is sad that we cannot even discuss this and consider alternatives. Thanks, GerardM

On 18 February 2016 at 18:07, Maximilian Klein isalix@gmail.com wrote:

...

Congratulations on a fantastic project and a your acceptance in WWW2016.

Make a great day, Max Klein ‽ http://notconfusing.com/

On Thu, Feb 18, 2016 at 10:54 AM, Federico Leva (Nemo) <nemowiki@gmail.com

...
wrote:

...
Lydia Pintscher, 18/02/2016 15:59:

...
Thomas, Denny, Sebastian, Thomas, and I have published a paper which was accepted for the industry track at WWW 2016. It covers the migration from Freebase to Wikidata. You can now read it here: http://research.google.com/pubs/archive/44818.pdf

Nice!

...
Concluding, in a fairly short amount of time, we have been able to provide the Wikidata community with more than 14 million new Wikidata statements using a customizable

I must admit that, despite knowing the context, I wasn't able to understand whether this is the number of "mapped"/"translated" statements or the number of statements actually added via the primary sources tool. I assume the latter given paragraph 5.3:

...
after removing dupli cates and facts already contained in Wikidata, we obtain 14 million new statements. If all these statements were added to Wikidata, we would see a 21% increase of the num- ber of statements in Wikidata.

I was confused about that too. "the [Primary Sources] tool has been used by more than a hundred users who performed about 90,000 approval or rejection actions. More than 14 million statements have been uploaded in total." I think that means that ≤ 90,000 items or statements were added of 14 million available to be add through Primary Sources tool.

...
Nemo

Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

Markus Krötzsch

8:56 p.m.

On 21.02.2016 16:00, Gerard Meijssen wrote:

...

Hoi, I add statements of the primary sources tool in preference to add them myself (Primary Sources takes more time).

I am still of the strongest opinion that given the extremely disappointing number of added statements the Primary Sources tool is a failure.

What is the number of added statements you refer to?

Markus

...

It is sad that all the good work of Freebase is lost in this way. It is sad that we cannot even discuss this and consider alternatives. Thanks, GerardM

On 18 February 2016 at 18:07, Maximilian Klein <isalix@gmail.com mailto:isalix@gmail.com> wrote:

Congratulations on a fantastic project and a your acceptance in WWW2016.

Make a great day,
Max Klein ‽ http://notconfusing.com/

On Thu, Feb 18, 2016 at 10:54 AM, Federico Leva (Nemo)
<nemowiki@gmail.com <mailto:nemowiki@gmail.com>> wrote:

    Lydia Pintscher, 18/02/2016 15:59:

        Thomas, Denny, Sebastian, Thomas, and I have published a
        paper which was
        accepted for the industry track at WWW 2016. It covers the
        migration
        from Freebase to Wikidata. You can now read it here:
        http://research.google.com/pubs/archive/44818.pdf


    Nice!

    > Concluding, in a fairly short amount of time, we have been
    > able to provide the Wikidata community with more than
    > 14 million new Wikidata statements using a customizable

    I must admit that, despite knowing the context, I wasn't able to
    understand whether this is the number of "mapped"/"translated"
    statements or the number of statements actually added via the
    primary sources tool. I assume the latter given paragraph 5.3:

    > after removing dupli
    > cates and facts already contained in Wikidata, we obtain
    > 14 million new statements. If all these statements were
    > added to Wikidata, we would see a 21% increase of the num-
    > ber of statements in Wikidata.


I was confused about that too. "the [Primary Sources] tool has been
used by more than a hundred users who performed about
90,000 approval or rejection actions. More than 14 million
statements have been uploaded in total."  I think that means that ≤
90,000 items or statements were added of 14 million available to be
add through Primary Sources tool.


    Nemo

    _______________________________________________
    Wikidata mailing list
    Wikidata@lists.wikimedia.org <mailto:Wikidata@lists.wikimedia.org>
    https://lists.wikimedia.org/mailman/listinfo/wikidata



_______________________________________________
Wikidata mailing list
Wikidata@lists.wikimedia.org <mailto:Wikidata@lists.wikimedia.org>
https://lists.wikimedia.org/mailman/listinfo/wikidata

Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

Thomas Douillard

8:03 p.m.

Congrats everybody.

Now that we're at it, I'll highlight a usability problem I have with the primary source tool : it reloads the page every time we approve something, which can take a lot of times for heavy pages, and is really a blocker to approve a large number of claims :)

2016-02-18 15:59 GMT+01:00 Lydia Pintscher Lydia.Pintscher@wikimedia.de:

...

Hey everyone :)

Thomas, Denny, Sebastian, Thomas, and I have published a paper which was accepted for the industry track at WWW 2016. It covers the migration from Freebase to Wikidata. You can now read it here: http://research.google.com/pubs/archive/44818.pdf

Cheers Lydia -- Lydia Pintscher - http://about.me/lydia.pintscher Product Manager for Wikidata

Wikimedia Deutschland e.V. Tempelhofer Ufer 23-24 10963 Berlin www.wikimedia.de

Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e. V.

Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg unter der Nummer 23855 Nz. Als gemeinnützig anerkannt durch das Finanzamt für Körperschaften I Berlin, Steuernummer 27/029/42207.

Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

Konstantinos Stampoulis

10:15 p.m.

2016-02-21 16:33 GMT+02:00 Thomas Douillard thomas.douillard@gmail.com:

...

Now that we're at it, I'll highlight a usability problem I have with the primary source tool : it reloads the page every time we approve something, which can take a lot of times for heavy pages, and is really a blocker to approve a large number of claims :)

+1 Konstantinos Stampoulis geraki@geraki.gr http://www.geraki.gr

---- Συνεισφέρετε στην Βικιπαίδεια. https://el.wikipedia.org ----------------------------------------------------------------------- Οι παραπάνω απόψεις είναι προσωπικές και δεν εκφράζουν παρά μόνο εμένα. Το μήνυμα θεωρείται εμπιστευτικό μόνο εάν το έχω ζητήσει ρητά, διαφορετικά μπορείτε να το χρησιμοποιήσετε σε οποιαδήποτε δημόσια συζήτηση. Δεν έχω τίποτε να κρύψω. :-)

Federico Leva (Nemo)

11:46 p.m.

Konstantinos Stampoulis, 21/02/2016 17:45:

...

Now that we're at it, I'll highlight a usability problem I have with
the primary source tool : it reloads the page every time we approve
something, which can take a lot of times for heavy pages, and is
really a blocker to approve a large number of claims :)

Yeah, this is a known issue: https://github.com/google/primarysources/issues/58

The gadget in itself isn't usable for mass additions, only for occasional use: https://www.wikidata.org/wiki/Wikidata_talk:Primary_sources_tool#Too_slow

Nemo

Markus Krötzsch

10:11 p.m.

On 18.02.2016 15:59, Lydia Pintscher wrote:

...

Hey everyone :)

Thomas, Denny, Sebastian, Thomas, and I have published a paper which was accepted for the industry track at WWW 2016. It covers the migration from Freebase to Wikidata. You can now read it here: http://research.google.com/pubs/archive/44818.pdf

Congratulations!

Is it possible that you have actually used the flawed statistics from the Wikidata main page regarding the size of the project? 14.5M items in Aug 2015 seems far too low a number. Our RDF exports from mid August already contained more than 18.4M items. It would be nice to get this fixed at some point. There are currently almost 20M items, and the main page still shows only 16.5M.

Markus

Tom Morris

22 Feb 22 Feb

1:07 a.m.

On Sun, Feb 21, 2016 at 11:41 AM, Markus Krötzsch < markus@semantic-mediawiki.org> wrote:

...

On 18.02.2016 15:59, Lydia Pintscher wrote:

...
Thomas, Denny, Sebastian, Thomas, and I have published a paper which was accepted for the industry track at WWW 2016. It covers the migration from Freebase to Wikidata. You can now read it here: http://research.google.com/pubs/archive/44818.pdf

Is it possible that you have actually used the flawed statistics from the Wikidata main page regarding the size of the project? 14.5M items in Aug 2015 seems far too low a number. Our RDF exports from mid August already contained more than 18.4M items. It would be nice to get this fixed at some point. There are currently almost 20M items, and the main page still shows only 16.5M.

Numbers are off throughout the paper. They also quote 48M instead of 58M topics for Freebase and mischaracterize some other key points. They key number is that 3.2 billion facts for 58 million topics has generated 106,220 new statements for Wikidata. If my calculator had more decimal places, I could tell you what percentage that is.

Tom

Markus Krötzsch

2:55 a.m.

On 21.02.2016 20:37, Tom Morris wrote:

...

On Sun, Feb 21, 2016 at 11:41 AM, Markus Krötzsch <markus@semantic-mediawiki.org mailto:markus@semantic-mediawiki.org> wrote:
On 18.02.2016 15:59, Lydia Pintscher wrote:


    Thomas, Denny, Sebastian, Thomas, and I have published a paper
    which was
    accepted for the industry track at WWW 2016. It covers the migration
    from Freebase to Wikidata. You can now read it here:
    http://research.google.com/pubs/archive/44818.pdf


Is it possible that you have actually used the flawed statistics
from the Wikidata main page regarding the size of the project? 14.5M
items in Aug 2015 seems far too low a number. Our RDF exports from
mid August already contained more than 18.4M items. It would be nice
to get this fixed at some point. There are currently almost 20M
items, and the main page still shows only 16.5M.
Numbers are off throughout the paper. They also quote 48M instead of 58M topics for Freebase and mischaracterize some other key points. They key number is that 3.2 billion facts for 58 million topics has generated 106,220 new statements for Wikidata. If my calculator had more decimal places, I could tell you what percentage that is.

Obviously, any tool can only import statements for which we have items and properties at all, so the number of importable facts is much lower. I don't think anyone at Google could change this (they cannot override notability criteria, and they cannot even lead discussions to propose new content).

Markus

Tom Morris

10:58 p.m.

On Sun, Feb 21, 2016 at 4:25 PM, Markus Krötzsch < markus@semantic-mediawiki.org> wrote:

...

On 21.02.2016 20:37, Tom Morris wrote:

...
On Sun, Feb 21, 2016 at 11:41 AM, Markus Krötzsch <markus@semantic-mediawiki.org mailto:markus@semantic-mediawiki.org> wrote:
On 18.02.2016 15:59, Lydia Pintscher wrote:

    Thomas, Denny, Sebastian, Thomas, and I have published a paper
    which was
    accepted for the industry track at WWW 2016. It covers the
migration from Freebase to Wikidata. You can now read it here: http://research.google.com/pubs/archive/44818.pdf
Is it possible that you have actually used the flawed statistics
from the Wikidata main page regarding the size of the project? 14.5M
items in Aug 2015 seems far too low a number. Our RDF exports from
mid August already contained more than 18.4M items. It would be nice
to get this fixed at some point. There are currently almost 20M
items, and the main page still shows only 16.5M.
Numbers are off throughout the paper. They also quote 48M instead of 58M topics for Freebase and mischaracterize some other key points. They key number is that 3.2 billion facts for 58 million topics has generated 106,220 new statements for Wikidata. If my calculator had more decimal places, I could tell you what percentage that is.
Obviously, any tool can only import statements for which we have items and properties at all, so the number of importable facts is much lower.

Obviously, but "much lower" from 3.2B is probably something like 50M-300M, not 0.1M.

Tom

Markus Krötzsch

23 Feb 23 Feb

11:58 a.m.

On 22.02.2016 18:28, Tom Morris wrote:

...

On Sun, Feb 21, 2016 at 4:25 PM, Markus Krötzsch <markus@semantic-mediawiki.org mailto:markus@semantic-mediawiki.org> wrote:

On 21.02.2016 20 <tel:21.02.2016%2020>:37, Tom Morris wrote:

    On Sun, Feb 21, 2016 at 11:41 AM, Markus Krötzsch
    <markus@semantic-mediawiki.org
    <mailto:markus@semantic-mediawiki.org>
    <mailto:markus@semantic-mediawiki.org
    <mailto:markus@semantic-mediawiki.org>>>
    wrote:

         On 18.02.2016 15:59, Lydia Pintscher wrote:

             Thomas, Denny, Sebastian, Thomas, and I have published
    a paper
             which was
             accepted for the industry track at WWW 2016. It covers
    the migration
             from Freebase to Wikidata. You can now read it here:
    http://research.google.com/pubs/archive/44818.pdf

         Is it possible that you have actually used the flawed
    statistics
         from the Wikidata main page regarding the size of the
    project? 14.5M
         items in Aug 2015 seems far too low a number. Our RDF
    exports from
         mid August already contained more than 18.4M items. It
    would be nice
         to get this fixed at some point. There are currently almost 20M
         items, and the main page still shows only 16.5M.

    Numbers are off throughout the paper.  They also quote 48M
    instead of
    58M topics for Freebase and mischaracterize some other key
    points. They
    key number is that 3.2 billion facts for 58 million topics has
    generated
    106,220 new statements for Wikidata. If my calculator had more
    decimal
    places, I could tell you what percentage that is.


Obviously, any tool can only import statements for which we have
items and properties at all, so the number of importable facts is
much lower.

Obviously, but "much lower" from 3.2B is probably something like 50M-300M, not 0.1M.

That estimate might be a bit off. The paper contains a detailed discussion of this aspect. The total number of statements that could be translated from Freebase to Wikidata is given as 17M, of which only 14M were new. So this seems to be the current upper bound of what you could import with PS or any other tool. The authors mention that this already includes more than 90% of the "reviewed" content of Freebase that refers to Wikidata items. The paper seems to suggest that these mapped+reviewed statements were already imported directly -- maybe Lydia could clarify if this was the case.

It seems that if you want to go to the dimensions that you refer to (50M/300M/3200M) you would need to map more Wikidata items to Freebase topics in some way. The paper gives several techniques that were used to obtain mappings that are already more than what we have stored in Wikidata now. So it is probably not the lack of mappings but the lack of items that is the limit here. Data can only be imported if we have a page at all ;-)

Btw. where do the 100K imported statements come from that you mentioned here? I was also interested in that number but I could not find it in the paper.

Markus

Tom Morris

9 p.m.

On Tue, Feb 23, 2016 at 1:28 AM, Markus Krötzsch < markus@semantic-mediawiki.org> wrote:

...

On 22.02.2016 18:28, Tom Morris wrote:

...

On Sun, Feb 21, 2016 at 4:25 PM, Markus Krötzsch <markus@semantic-mediawiki.org mailto:markus@semantic-mediawiki.org> wrote: On 21.02.2016 20 tel:21.02.2016%2020:37, Tom Morris wrote: On Sun, Feb 21, 2016 at 11:41 AM, Markus Krötzsch <markus@semantic-mediawiki.org

    <mailto:markus@semantic-mediawiki.org>>>
    wrote:

         On 18.02.2016 15:59, Lydia Pintscher wrote:

             Thomas, Denny, Sebastian, Thomas, and I have published
    a paper
             which was
             accepted for the industry track at WWW 2016. It covers
    the migration
             from Freebase to Wikidata. You can now read it here:
    http://research.google.com/pubs/archive/44818.pdf

         Is it possible that you have actually used the flawed
    statistics
         from the Wikidata main page regarding the size of the
    project? 14.5M
         items in Aug 2015 seems far too low a number. Our RDF
    exports from
         mid August already contained more than 18.4M items. It
    would be nice
         to get this fixed at some point. There are currently almost

20M items, and the main page still shows only 16.5M.

    Numbers are off throughout the paper.  They also quote 48M
    instead of
    58M topics for Freebase and mischaracterize some other key
    points. They
    key number is that 3.2 billion facts for 58 million topics has
    generated
    106,220 new statements for Wikidata. If my calculator had more
    decimal
    places, I could tell you what percentage that is.

Obviously, any tool can only import statements for which we have
items and properties at all, so the number of importable facts is
much lower.

Obviously, but "much lower" from 3.2B is probably something like 50M-300M, not 0.1M.

That estimate might be a bit off. The paper contains a detailed discussion of this aspect.

Or the paper might be off. Addressing the flaws in the paper would require a full paper in its own right.

I don't mean to imply that numbers are the only thing that's important, because that's just one measure of how much value has been extracted from the Freebase data, the relative magnitudes of the numbers are startling.

...

The total number of statements that could be translated from Freebase to Wikidata is given as 17M, of which only 14M were new. So this seems to be the current upper bound of what you could import with PS or any other tool.

Upper bound using that particular methodology, only 4.5M of the 20M Wikidata topics were mapped when, given the fact that Wikidata items have to appear in a Wikipedia and that Freebase include all of English Wikipedia, one would expect a much higher percentage to be mappable.

...

The authors mention that this already includes more than 90% of the "reviewed" content of Freebase that refers to Wikidata items. The paper seems to suggest that these mapped+reviewed statements were already imported directly -- maybe Lydia could clarify if this was the case.

More clarity and information is always welcome, but since this is mentioned as a possible future work item in Section 7, I'm guessing it wasn't done yet.

...

It seems that if you want to go to the dimensions that you refer to (50M/300M/3200M) you would need to map more Wikidata items to Freebase topics in some way. The paper gives several techniques that were used to obtain mappings that are already more than what we have stored in Wikidata now. So it is probably not the lack of mappings but the lack of items that is the limit here. Data can only be imported if we have a page at all ;-)

If it's true that only 25% of Wikidata items appear in Freebase, I'd be amazed (and I'd like to see an analysis of what makes up that other 75%).

...

Btw. where do the 100K imported statements come from that you mentioned here? I was also interested in that number but I could not find it in the paper.

The paper says in section 4, "At the time of writing (January, 2016), the tool has been used by more than a hundred users who performed about 90,000 approval or rejection actions." which probably means ~80,000 new statements (since ~10% get rejected). My 106K number is from the current dashboard https://tools.wmflabs.org/wikidata-primary-sources/status.html.

Tom

Markus Krötzsch

10:18 p.m.

On 23.02.2016 16:30, Tom Morris wrote: ...

...

Or the paper might be off. Addressing the flaws in the paper would require a full paper in its own right.

Criticising papers is good academic practice. Doing so without factual support, however, is not. You may be right, but you should try to produce a bit more evidence than your intuition.

[...]

...

The paper says in section 4, "At the time of writing (January, 2016), the tool has been used by more than a hundred users who performed about 90,000 approval or rejection actions." which probably means ~80,000 new statements (since ~10% get rejected). My 106K number is from the current dashboard https://tools.wmflabs.org/wikidata-primary-sources/status.html.

As Gerard has pointed out before, he prefers to re-enter statements instead of approving them. This means that the real number of "imported" statements is higher than what is shown in the dashboard (how much so depends on how many statements Gerard and others with this approach have added). It seems that one should rather analyse the number of statements that are already in Wikidata than just the ones that were approved directly.

Markus

Stas Malyshev

24 Feb 24 Feb

12:22 a.m.

Hi!

...

As Gerard has pointed out before, he prefers to re-enter statements instead of approving them. This means that the real number of "imported" statements is higher than what is shown in the dashboard (how much so depends on how many statements Gerard and others with this approach have added). It seems that one should rather analyse the number of statements

Yes, I do that sometimes too - if there is a statement saying "spouse: X" on wikidata, and statement in Freebase saying the same but with the start date, or the Freebase one has more precise date than the Wikidata one, such as full date instead of just year, I will modify the original statement and reject the Freebase one. I'm not sure this is the best practice with regard to tracking numbers but it's easiest and even if my personal numbers do not matter too much I imagine other people do this too. So rejection does not really mean the data was not entered - it may mean it was entered in a different way. Sometimes also while the data is already there, the reference is not, so the reference gets added.

-- Stas Malyshev smalyshev@wikimedia.org

Tom Morris

12:37 a.m.

On Tue, Feb 23, 2016 at 1:52 PM, Stas Malyshev smalyshev@wikimedia.org wrote:

...

...
As Gerard has pointed out before, he prefers to re-enter statements instead of approving them. This means that the real number of "imported" statements is higher than what is shown in the dashboard (how much so depends on how many statements Gerard and others with this approach have added). It seems that one should rather analyse the number of statements

Yes, I do that sometimes too - if there is a statement saying "spouse: X" on wikidata, and statement in Freebase saying the same but with the start date, or the Freebase one has more precise date than the Wikidata one, such as full date instead of just year, I will modify the original statement and reject the Freebase one.

I filed a bug report for this yesterday: https://github.com/google/primarysources/issues/73 I'll add the information about more precise qualifiers, since I didn't address that part.

...

I'm not sure this is the best practice with regard to tracking numbers but it's easiest and even if my personal numbers do not matter too much I imagine other people do this too. So rejection does not really mean the data was not entered - it may mean it was entered in a different way. Sometimes also while the data is already there, the reference is not, so the reference gets added.

Even if you don't care about your personal numbers, I'd argue that not being able to track the quality of data sources feeding the Primary Sources tool is an issue. It's valuable to not only measure quality for entire data sets, but also for particular slices of them since data sources, at least large ones like Freebase, are rarely homogenous in quality.

It's also clearly an issue that the tool is so awkward that people are working around it instead of having it help them.

Tom

Denny Vrandečić

1 Mar 1 Mar

1:06 a.m.

Hi all,

thank you for the interest in the primary sources tool!

I wanted to make sure that there are no false expectations. Google has committed to deliver the initial tool. Thanks to Thomas P’s internship and support from Thomas S and Sebastian, and with the release of the data, the code, the paper, and all services running on Wikimedia infrastructure, we have achieved that milestone. The tool was developed as open source, in order to allow the community to continue to mold it and to invest in it as the community sees warranted.

I am particularly thankful to Marco Fossati for his work in creating further datasets. Thomas S has in the last few days cleaned up the issue list and merged pull requests. Thank you, Thomas! We are all very thankful for the pull requests, in particular to Thomas P, Wieland Hoffmann, and Tom Morris. In general, we plan to keep the tool up as far as our time allows, and continue to merge such requests, but we have no concrete plans of extending its functionality right now.

We are very grateful to everyone contributing to the project, or using the tool. If anyone wants to take over the project, we would invite you to contribute a bit for a while, and then let’s discuss about it. I would be thrilled to see this tool develop.

As a reminder, a lot of data has been released under CC0. We invite all to play around with the data and see if there are slices of the data that can be directly uploaded to Wikidata, as Gerard suggests.

If there are any questions, we’ll try to answer them. Again, thanks everyone!

Cheers, Denny

On Tue, Feb 23, 2016 at 11:08 AM Tom Morris tfmorris@gmail.com wrote:

...

On Tue, Feb 23, 2016 at 1:52 PM, Stas Malyshev smalyshev@wikimedia.org wrote:

...
...
As Gerard has pointed out before, he prefers to re-enter statements instead of approving them. This means that the real number of "imported" statements is higher than what is shown in the dashboard (how much so depends on how many statements Gerard and others with this approach have added). It seems that one should rather analyse the number of statements

Yes, I do that sometimes too - if there is a statement saying "spouse: X" on wikidata, and statement in Freebase saying the same but with the start date, or the Freebase one has more precise date than the Wikidata one, such as full date instead of just year, I will modify the original statement and reject the Freebase one.

I filed a bug report for this yesterday: https://github.com/google/primarysources/issues/73 I'll add the information about more precise qualifiers, since I didn't address that part.

...
I'm not sure this is the best practice with regard to tracking numbers but it's easiest and even if my personal numbers do not matter too much I imagine other people do this too. So rejection does not really mean the data was not entered - it may mean it was entered in a different way. Sometimes also while the data is already there, the reference is not, so the reference gets added.

Even if you don't care about your personal numbers, I'd argue that not being able to track the quality of data sources feeding the Primary Sources tool is an issue. It's valuable to not only measure quality for entire data sets, but also for particular slices of them since data sources, at least large ones like Freebase, are rarely homogenous in quality.

It's also clearly an issue that the tool is so awkward that people are working around it instead of having it help them.

Tom _______________________________________________ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

Thomas Steiner

23 Feb 23 Feb

9:09 p.m.

Hi all,

Providing partial answers to some of the questions raised in this thread:

Regarding the hard-refresh-upon-approve/reject issue in the tool's front-end: this has technical reasons that I will hopefully elaborate in the GitHub issue (https://github.com/google/primarysources/issues/58). A reminder that the tool is meant for ad-hoc usage with manual source inspection, not mass insertion.

Regarding the numbers from the paper, Thomas P.-T. and Denny V. are the core contacts.

Regarding the usage dashboard: Sebastian S. is running it in the tool's back-end. Note that the top-users stats are currently debated about (https://github.com/google/primarysources/issues/67).

Cheers, Tom

-- Dr. Thomas Steiner, Employee (blog.tomayac.com, twitter.com/tomayac) Google Germany GmbH, ABC-Str. 19, 20354 Hamburg Geschäftsführer: Matthew Scott Sucherman, Paul Terence Manicle Registergericht und -nummer: Hamburg, HRB 86891 -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.29 (GNU/Linux) iFy0uwAntT0bE3xtRa5AfeCheCkthAtTh3reSabiGbl0ck0fjumBl3DCharaCTersAttH3b0ttom.hTtP5://xKcd.c0m/1181/ -----END PGP SIGNATURE-----

3220

Age (days ago)

3231

Last active (days ago)

wikidata@lists.wikimedia.org

18 comments

11 participants

tags (0)

participants (11)

Denny Vrandečić
Federico Leva (Nemo)
Gerard Meijssen
Konstantinos Stampoulis
Lydia Pintscher
Markus Krötzsch
Maximilian Klein
Stas Malyshev
Thomas Douillard
Thomas Steiner
Tom Morris