Thanks @Benjamin for the pointers!
I completely agree with @Tom.
I've also been researching techniques for crowdsourcing micro-tasks,
mostly for NLP activities like frame semantics annotation:
http://www.aclweb.org/anthology/P13-2130http://ceur-ws.org/Vol-1030/paper-03.pdf
I found out that the crowd of paid workers can really make the
difference, even for such difficult and subjective tasks.
So here are my 2 cents to get the best out of it:
1. Extreme care for quality check mechanisms: for instance, the
CrowdFlower.com platform has a facility that allows to automatically
discard untrusted workers;
2. The micro-task must be atomic, i.e., not containing multiple sub-tasks;
3. The UI design is always crucial: simple words, clear examples, avoid
screen scrolling.
Cheers!
On 7/18/15 2:00 PM, wikidata-request(a)lists.wikimedia.org wrote:
> Date: Fri, 17 Jul 2015 13:42:55 -0400
> From: Tom Morris<tfmorris(a)gmail.com>
> To: "Discussion list for the Wikidata project."
> <wikidata(a)lists.wikimedia.org>
> Subject: Re: [Wikidata] Freebase is dead, long live :BaseKB
> Message-ID:
> <CAE9vqEF3+xQtkbBiiV0Co2Lz8_HMvKhTh=3DwHyZ2UP0PK=CmQ(a)mail.gmail.com>
> Content-Type: text/plain; charset="utf-8"
>
> 3,000 judgments per person per day sounds high to me, particularly on a
> sustained basis, but it really depends on the type of task. Some of the
> tasks were very simple with custom high performance single purpose "games"
> designed around them. For example, Genderizer presented a person's
> information and allowed choices of Male, Female, Other, and Skip. Using
> arrow key bindings for the four choices to allow quick selection without
> moving one's hand, pipelining preloading the next topic in the background,
> and allowing votes to be undone in case of error were all features which
> allowed voters to make choices very quickly.
>
> The figures quoted in the paper below (18 seconds per judgment) work out to
> more like 1,800 judgments per eight hour day. They collected 2.3 million
> judgments over the course of a year from 555 volunteers (1.05 million
> judgments) and 84 paid workers (1.25 million).
>
> On Fri, Jul 17, 2015 at 12:35 PM, Benjamin Good<ben.mcgee.good(a)gmail.com>
> wrote:
>
>> >They wrote a really insightful paper about how their processes for
>> >large-scale data curation worked. Among may other things, they
>> >investigated mechanical turk 'micro tasks' versus hourly workers and
>> >generally found the latter to be more cost effective.
>> >
>> >"The Anatomy of a Large-Scale Human Computation Engine"
>> >http://wiki.freebase.com/images/e/e0/Hcomp10-anatomy.pdf
>> >
> The full citation, in case someone needs to track it down, is:
>
> Kochhar, Shailesh, Stefano Mazzocchi, and Praveen Paritosh. "The anatomy of
> a large-scale human computation engine." *Proceedings of the acm sigkdd
> workshop on human computation*. ACM, 2010.
>
> There's also a slide presentation by the same name which presents some
> additional information:
> http://www.slideshare.net/brixofglory/rabj-freebase-all-5049845
>
> Praveen Paritosh has written a number of papers on the topic of human
> computation, if you're interested in that (I am!):
> https://scholar.google.com/citations?user=_wX4sFYAAAAJ&hl=en&oi=sra
--
Marco Fossati
http://about.me/marco.fossati
Twitter: @hjfocs
Skype: hell_j
Hey folks :)
I've just added a meetup for us at Wikimania:
https://wikimania2015.wikimedia.org/wiki/Wikidata_Meetup Hope to see many
of you there.
Cheers
Lydia
--
Lydia Pintscher - http://about.me/lydia.pintscher
Product Manager for Wikidata
Wikimedia Deutschland e.V.
Tempelhofer Ufer 23-24
10963 Berlin
www.wikimedia.de
Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e. V.
Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg unter
der Nummer 23855 Nz. Als gemeinnützig anerkannt durch das Finanzamt für
Körperschaften I Berlin, Steuernummer 27/681/51985.
In our wiki discussions, I increasingly see people linking Wikidata
items when wishing to explain a concept (as one would do on a Wiktionary
or Wikipedia entry) without "[[d:Q169207|forcing]]" the multilingual
audience to a specific target language.
Am I the only one seeing this trend? Do we all think it's a good thing?
(Though a minor internal one.)
To be upfront, I ask because I face unexpected opposition to the
adoption of this custom elsewhere: https://phabricator.wikimedia.org/T89719
Nemo
Hi all
We are planning on holding a meetup[1
<https://wikimania2015.wikimedia.org/wiki/Wiktionary-Wikidata_Meetup>] at
Wikimania for discussing Wiktionary-Wikidata/Wikibase and it's place in the
Linked Open Data world.
The recent proposal[2
<https://www.wikidata.org/wiki/Wikidata:Wiktionary/Development/Proposals/201…>]
has again made this a current topic and we believe a meetup at Wikimania
would be a great opportunity to discuss this and related issues. It would
therefore be especially interesting if people from both projects would want
to show up.
In addition to looking at how Wikitionary will be integrated into Wikidata
we are also interested in seeing how a structured Wikitonary fits in the
greater world of open data. Our personal background for this is
acollaboration between Wikimedia Sverige and the Swedish Centre for
Terminology[3 <http://tnc.se/the-swedish-centre-for-terminology.html>] who
are looking at making their resources available as Linked Open Data and
thinking about how Wikidata can fit as a node in this and also about
whether Wikibase might be a suitable platform. That said we would be
surprised if that was the only example out there. If you've had similar
thoughts or know of other connections then we would love to hear about them.
The meetup will take place on Friday 17 July 17:30-20:00. Location TBD.
Sign up on the meetup page[1
<https://wikimania2015.wikimedia.org/wiki/Wiktionary-Wikidata_Meetup>] if
you are interested in attending and post a note on the discussion page if
there is something specific you have been thinking about.
Regards,
André Costa / Lokal_Profil
P.S. Sorry for the short notice
[1] https://wikimania2015.wikimedia.org/wiki/Wiktionary-Wikidata_Meetup
[2]
https://www.wikidata.org/wiki/Wikidata:Wiktionary/Development/Proposals/201…
[3] http://tnc.se/the-swedish-centre-for-terminology.html
André Costa | GLAM-tekniker, Wikimedia Sverige | Andre.Costa(a)wikimedia.se |
+46 (0)733-964574
Stöd fri kunskap, bli medlem i Wikimedia Sverige.
Läs mer på blimedlem.wikimedia.se
>> Note that Freebase did a lot of human curation and we know they could get
about 3000 verifications of facts by "non-experts" a day who were paid for
their efforts. That scales out to almost a million facts per FTE per year.
Where can I found out more about how they were able to do such high-volume
human curation? 3000/day is a huge number.
On Thu, Jul 16, 2015 at 5:01 AM, <wikidata-request(a)lists.wikimedia.org>
wrote:
> Date: Wed, 15 Jul 2015 15:25:27 -0400
> From: Paul Houle <ontology2(a)gmail.com>
> To: "Discussion list for the Wikidata project."
> <wikidata-l(a)lists.wikimedia.org>
> Subject: [Wikidata] Freebase is dead, long live :BaseKB
> Message-ID:
> <
> CAE__kdQt55E7k7xHMeuBCu9QrwRKoMU_60NDuYgcTHNkC7DFHA(a)mail.gmail.com>
> Content-Type: text/plain; charset="utf-8"
>
> For those who are interested in the project of getting something out of
> Freebase for use in Wikidata or somewhere else, I'd like to point out
>
> http://basekb.com/gold/
>
> this a completely workable solution for running queries out of Freebase
> after the MQL API goes dark.
>
> I have been watching the discussion about the trouble moving Freebase data
> to Wikidata and let me share some thoughts.
>
> First quality is in the eye of the beholder and if somebody defines that
> quality is a matter of citing your sources, than that is their definition
> of 'quality' and they can attain it. You might have some other definition
> of quality and be appalled that Wikidata has so little to say about a topic
> that has caused much controversy and suffering:
>
> https://www.wikidata.org/wiki/Q284451
>
> there are ways to attain that too.
>
> Part of the answer is that different products are going to be used in
> different places. For instance, one person might need 100% coverage of
> books he wants to talk about, another one might want a really great
> database of ski areas, etc.
>
> Note that Freebase did a lot of human curation and we know they could get
> about 3000 verifications of facts by "non-experts" a day who were paid for
> their efforts. That scales out to almost a million facts per FTE per year.
>
>
>
> --
> Paul Houle
>
> *Applying Schemas for Natural Language Processing, Distributed Systems,
> Classification and Text Mining and Data Lakes*
>
> (607) 539 6254 paul.houle on Skype ontology2(a)gmail.com
> https://legalentityidentifier.info/lei/lookup/
> <http://legalentityidentifier.info/lei/lookup/>
>
Hi.
I recently started following mediawiki/extensions/Wikibase on Gerrit,
and quite astonishingly found that nearly all of the 100 most recently
updated changes appear to be owned by WMDE employees (exceptions being
one change by Legoktm and some from L10n-bot). This is not the case, for
example, with mediawiki/core.
While this may be desired by the Wikidata team for corporate reasons, I
feel that encouraging code review by volunteers would empower both
Wikidata and third-party communities with new ways of contributing to
the project and raise awareness of the development team's goals in the
long term.
The messy naming conventions play a role too, i.e. Extension:Wikibase
<https://www.mediawiki.org/w/index.php?title=Extension:Wikibase&redirect=no>
is supposed to host technical documentation but instead redirects to the
Wikibase <https://www.mediawiki.org/wiki/Wikibase> portal, with actual
documentation split into Extension:Wikibase Repository
<https://www.mediawiki.org/wiki/Extension:Wikibase_Repository> and
Extension:Wikibase Client
<https://www.mediawiki.org/wiki/Extension:Wikibase_Client>, apparently
ignoring the fact that the code is actually developed in a single
repository (correct me if I'm wrong). Just to add some more confusion,
there's also Extension:Wikidata build
<https://www.mediawiki.org/wiki/Extension:Wikidata_build> with no
documentation.
And what about wmde on GitHub <https://github.com/wmde> with countless
creatively-named repos? They make life even harder for potential
contributors.
Finally, the ever-changing client-side APIs make gadgets development a
pain in the ass.
Sorry if this sounds like a slap in the face, but it had to be said.
For those who are interested in the project of getting something out of
Freebase for use in Wikidata or somewhere else, I'd like to point out
http://basekb.com/gold/
this a completely workable solution for running queries out of Freebase
after the MQL API goes dark.
I have been watching the discussion about the trouble moving Freebase data
to Wikidata and let me share some thoughts.
First quality is in the eye of the beholder and if somebody defines that
quality is a matter of citing your sources, than that is their definition
of 'quality' and they can attain it. You might have some other definition
of quality and be appalled that Wikidata has so little to say about a topic
that has caused much controversy and suffering:
https://www.wikidata.org/wiki/Q284451
there are ways to attain that too.
Part of the answer is that different products are going to be used in
different places. For instance, one person might need 100% coverage of
books he wants to talk about, another one might want a really great
database of ski areas, etc.
Note that Freebase did a lot of human curation and we know they could get
about 3000 verifications of facts by "non-experts" a day who were paid for
their efforts. That scales out to almost a million facts per FTE per year.
--
Paul Houle
*Applying Schemas for Natural Language Processing, Distributed Systems,
Classification and Text Mining and Data Lakes*
(607) 539 6254 paul.houle on Skype ontology2(a)gmail.com
https://legalentityidentifier.info/lei/lookup/
<http://legalentityidentifier.info/lei/lookup/>
We currently rely on the Wikidata Query API to identify whether or not a
set of claims exists on a given property. Some of our previous bot runs has
created duplicates since recent additions didn't make it to the WDQ API yet.
In our efforts to prevent the creation of duplicate entries, I am trying to
better understand the WDQ-api.
The documentation of the WDQ-api states that [1] "Also, the data used here
is from WikiData "dumps", so it can be a few hours old.". However, when I
check on the datadumps they are either updated weekly with json dumps or
incremental daily dumps as xml [2].
Also, sometimes the WDQ-api seems to have instant behaviour with claims
being added, in the sense that they are immediately available through the
WDQ API.
How often is the WDQ api really being updated? Is it possible to query
wikidata live, with WDQ and if not, are there alternatives that would allow
this?
Regards,
Andra Waagmeester
[1] https://wdq.wmflabs.org/api_documentation.html
[2] https://www.wikidata.org/wiki/Wikidata:Database_download