---------- Messaggio inoltrato ----------
Da: "Marco Fossati" <fossati(a)fbk.eu>
Data: 11 nov 2016 1:23 PM
Oggetto: Fwd: Re: [wikicite-discuss] Entity tagging and fact extraction
(from a scholarly publisher perspective)
A: "Marco Fossati" <fossati(a)spaziodati.eu>
---------- Messaggio inoltrato ----------
Da: "Marco Fossati" <fossati(a)fbk.eu>
Data: 11 nov 2016 1:18 PM
Oggetto: Re: [wikicite-discuss] Entity tagging and fact extraction (from a
scholarly publisher perspective)
A: "Andrew Smeall" <andrew.smeall(a)hindawi.com>
Cc: "Dario Taraborelli" <dtaraborelli(a)wikimedia.org>, "Benjamin Good" <
ben.mcgee.good(a)gmail.com>, "Discussion list for the Wikidata project." <
wikidata(a)lists.wikimedia.org>, "wikicite-discuss" <
wikicite-discuss(a)wikimedia.org>, "Daniel Mietchen" <
Just a couple of thoughts, which are in line with Dario's first message:
1. the primary sources tool lets third party providers release *full
datasets* in a rather quick way. It is conceived to (a) ease the ingestion
of *non-curated* data and to (b) make the community directly decide which
statements should be included, instead of eventually complex a priori
Important: the datasets should comply with the Wikidata vocabulary/ontology.
2. I see the mix'n'match tool as a way to *link* datasets with Wikidata via
ID mappings, thus only requiring statements that say "Wikidata entity X
links to the third party dataset entity Y".
This is pretty much what the linked data community has been doing so far.
No need to comply with the Wikidata vocabulary/ontology.
Il 11 nov 2016 10:27 AM, "Andrew Smeall" <andrew.smeall(a)hindawi.com> ha
> Regarding the topics/vocabularies issue:
> A challenge we're working on is finding a set of controlled vocabularies
> for all the subject areas we cover.
> We do use MeSH for those subjects, but this only applies to about 40% of
> our papers. In Engineering, for example, we've had more trouble finding an
> open taxonomy with the same level of depth as MeSH. For most internal
> applications, we need 100% coverage of all subjects.
> Machine learning for concept tagging is trendy now, partly because it
> doesn't require a preset vocabulary, but we are somewhat against this
> approach because we want to control the mapping of terms and a taxonomic
> hierarchy can be useful. The current ML tools I've seen can match to a
> controlled vocabulary, but then they need the publisher to supply the terms.
> The temptation to build a new vocabulary is strong, because it's the
> fastest way to get to something that is non-proprietary and universal. We
> can merge existing open vocabularies like MeSH and PLOS to get most of the
> way there, but we then need to extend that with concepts from our corpus.
> Thanks Daniel and Benjamin for your responses. Any other feedback would be
> great, and I'm always happy to delve into issues from the publisher
> perspective if that can be helpful.
> On Fri, Nov 11, 2016 at 4:54 PM, Dario Taraborelli <
> dtaraborelli(a)wikimedia.org> wrote:
>> Benjamin – agreed, I too see Wikidata as mainly a place to hold all the
>> mappings. Once we support federated queries in WDQS, the benefit of ID
>> mapping (over extensive data ingestion) will become even more apparent.
>> Hope Andrew and other interested parties can pick up this thread.
>> On Wed, Nov 2, 2016 at 12:11 PM, Benjamin Good <ben.mcgee.good(a)gmail.com>
>>> One message you can send is that they can and should use existing
>>> controlled vocabularies and ontologies to construct the metadata they want
>>> to share. For example, MeSH descriptors would be a good way for them to
>>> organize the 'primary topic' assertions for their articles and would make
>>> it easy to find the corresponding items in Wikidata when uploading. Our
>>> group will be continuing to expand coverage of identifiers and concepts
>>> from vocabularies like that in Wikidata - and any help there from
>>> publishers would be appreciated!
>>> My view here is that Wikidata can be a bridge to the terminologies and
>>> datasets that live outside it - not really a replacement for them. So, if
>>> they have good practices about using shared vocabularies already, it should
>>> (eventually) be relatively easy to move relevant assertions into the
>>> WIkidata graph while maintaining interoperability and integration with
>>> external software systems.
>>> On Wed, Nov 2, 2016 at 8:31 AM, 'Daniel Mietchen' via wikicite-discuss <
>>> wikicite-discuss(a)wikimedia.org> wrote:
>>>> I'm traveling ( https://twitter.com/EvoMRI/status/793736211009536000
>>>> ), so just in brief:
>>>> In terms of markup, some general comments are in
>>>> https://www.ncbi.nlm.nih.gov/books/NBK159964/ , which is not specific
>>>> to Hindawi but partly applies to them too.
>>>> A problem specific to Hindawi (cf.
>>>> https://commons.wikimedia.org/wiki/Category:Media_from_Hindawi) is the
>>>> bundling of the descriptions of all supplementary files, which
>>>> translates into uploads like
>>>> (with descriptions for nine files)
>>>> and eight files with no description, e.g.
>>>> There are other problems in their JATS, and it would be good if they
>>>> would participate in
>>>> http://jats4r.org/ . Happy to dig deeper with Andrew or whoever is
>>>> Where they are ahead of the curve is licensing information, so they
>>>> could help us set up workflows to get that info into Wikidata.
>>>> In terms of triple suggestions to Wikidata:
>>>> - as long as article metadata is concerned, I would prefer to
>>>> concentrate on integrating our workflows with the major repositories
>>>> of metadata, to which publishers are already posting. They could help
>>>> us by using more identifiers (e.g. for authors, affiliations, funders
>>>> etc.), potentially even from Wikidata (e.g. for keywords/ P921, for
>>>> both journals and articles) and by contributing to the development of
>>>> tools (e.g. a bot that goes through the CrossRef database every day
>>>> and creates Wikidata items for newly published papers).
>>>> - if they have ways to extract statements from their publication
>>>> corpus, it would be good if they would let us/ ContentMine/ StrepHit
>>>> etc. know, so we could discuss how to move this forward.
>>>> On Wed, Nov 2, 2016 at 1:42 PM, Dario Taraborelli
>>>> <dtaraborelli(a)wikimedia.org> wrote:
>>>> > I'm at the Crossref LIVE 16 event in London where I just gave a
>>>> > on WikiCite and Wikidata targeted at scholarly publishers.
>>>> > Beside Crossref and Datacite people, I talked to a bunch of folks
>>>> > in collaborating on Wikidata integration, particularly from PLOS,
>>>> > and Springer Nature. I started an interesting discussion with Andrew
>>>> > who runs strategic projects at Hindawi, and I wanted to open it up to
>>>> > everyone on the lists.
>>>> > Andrew asked me if – aside from efforts like ContentMine and StrepHit
>>>> > there are any recommendations for publishers (especially OA
>>>> publishers) to
>>>> > mark up their contents and facilitate information extraction and
>>>> > matching or even push triples to Wikidata to be considered for
>>>> > I don't think we have a recommended workflow for data providers for
>>>> > facilitating triple suggestions to Wikidata, other than leveraging the
>>>> > Primary Sources Tool. However, aligning keywords and terms with the
>>>> > corresponding Wikidata items via ID mapping sounds like a good first
>>>> step. I
>>>> > pointed Andrew to Mix'n'Match as a handy way of mapping identifiers,
>>>> but if
>>>> > you have other ideas on how to best support 2-way integration of
>>>> > with scholarly contents, please chime in.
>>>> > Dario
>>>> > --
>>>> > Dario Taraborelli Head of Research, Wikimedia Foundation
>>>> > wikimediafoundation.org • nitens.org • @readermeter
>>>> > --
>>>> > WikiCite 2016 – May 26-26, 2016, Berlin
>>>> > Meta: https://meta.wikimedia.org/wiki/WikiCite_2016
>>>> > Twitter: https://twitter.com/wikicite16
>>>> > ---
>>>> > You received this message because you are subscribed to the Google
>>>> > "wikicite-discuss" group.
>>>> > To unsubscribe from this group and stop receiving emails from it,
>>>> send an
>>>> > email to wikicite-discuss+unsubscribe(a)wikimedia.org.
>>>> WikiCite 2016 – May 26-26, 2016, Berlin
>>>> Meta: https://meta.wikimedia.org/wiki/WikiCite_2016
>>>> Twitter: https://twitter.com/wikicite16
>>>> You received this message because you are subscribed to the Google
>>>> Groups "wikicite-discuss" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>> an email to wikicite-discuss+unsubscribe(a)wikimedia.org.
>> *Dario Taraborelli *Head of Research, Wikimedia Foundation
>> wikimediafoundation.org • nitens.org • @readermeter
>> WikiCite 2016 – May 26-26, 2016, Berlin
>> Meta: https://meta.wikimedia.org/wiki/WikiCite_2016
>> Twitter: https://twitter.com/wikicite16
>> You received this message because you are subscribed to the Google Groups
>> "wikicite-discuss" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to wikicite-discuss+unsubscribe(a)wikimedia.org.
> Andrew Smeall
> Head of Strategic Projects
> Hindawi Publishing Corporation
> Kirkman House
> 12-14 Whitfield Street, 3rd Floor
> London, W1T 2RF
> United Kingdom
> WikiCite 2016 – May 26-26, 2016, Berlin
> Meta: https://meta.wikimedia.org/wiki/WikiCite_2016
> Twitter: https://twitter.com/wikicite16
> You received this message because you are subscribed to the Google Groups
> "wikicite-discuss" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to wikicite-discuss+unsubscribe(a)wikimedia.org.
a conversation that happened on Twitter the other day suggests that we're
not doing a good enough job at structuring documentation on wiki.
While I think the subpages of [[m:WikiCite]]
<https://meta.wikimedia.org/wiki/WikiCite> are fairly well organized,
Source]] <https://www.wikidata.org/wiki/WD:WikiProject Source> badly needs
- the WikiProject's *landing page* contains a lot of unstructured and
fairly outdated information: some of this information could be moved to
subpages and we could use a navigation template like other WikiProjects
- *data modeling proposals* for different types of works are currently
only captured via a template (Template:Bibliographical_properties
buried on the WP's talk page: we should have a dedicated page to host data
models, ideally a big table listing and annotating properties for different
types of works, as well as their mappings to existing bibliographic models.
- other important proposals (such as the use
of *stated in* (P248) to represent *provenance of citation data* for
statements using *cites *(P2860) or the documentation of specific *data
import strategies* (the Zika corpus, the OA review literature, PMCID
references in enwiki) are similarly buried in the talk page and hard to
find: this may raise a few eyebrows in the Wikidata community if we don't
make it clear how and why this data is being imported or represented.
If someone on this list is willing to spend some time and help with some
documentation/design effort, it would be tremendously useful, especially to
people who are not yet regularly following WikiCite and WP:Source Metadata:
we need to create an inclusive environment and readability/navigability for
newbies is the first important step.
Perhaps of interest to other lists.
---------- Forwarded message ----------
From: Yuri Astrakhan <yastrakhan(a)wikimedia.org>
Date: Wed, Nov 9, 2016 at 9:46 AM
Subject: [Wikitech-l] Localizable data for Graphs and Templates on Commons
To: Wikimedia developers <wikitech-l(a)lists.wikimedia.org>
Following the localizable maps example, here is a structured tabular data
example that also supports localization, shared data, and can be used
directly from the graphs or from Lua scripts on any wiki. Note that the
graph itself is in English wiki (labs), but data comes from Commons. Feel
free to add translations.
On Mon, Nov 7, 2016 at 11:46 PM Yuri Astrakhan <yastrakhan(a)wikimedia.org>
> I would like to show one of the projects that Interactive team has been
> hacking on: localizable maps data (GeoJSON), stored on Commons, and usable
> from multiple wikis. I hope we can get it polished and enabled in
> production soon enough - so far, lab's beta cluster only:
Wikitech-l mailing list
My name is Nazanin and I am a Wikipedia editor. I studied statistics in my
undergraduate program and recently I have begun my master program also in
statistics. According to my interests in Wikimedia projects, I would like
to choose a Wikidata-related project as my master thesis. I have already
worked on descriptive statistics and I look forward to working on Monte
Carlo methods for forcasting, simulation and modelling. How can I help in
Wikidara and how could we have a productive cooperation?
All the best,
Wikimedia is among the 17 organizations in Google Code-in (GCI) 2016!
GCI starts on November 28th. It's a contest for 13-17 year old students
working on small tasks and a great opportunity to let new contributors
make progress and help with smaller tasks on your To-Do list!
There are currently 23 open Wikidata tasks marked as easy:
(and a good bunch of them is already marked for GCI, thanks!)
What we want you to do:
BECOME A MENTOR:
1. Go to https://www.mediawiki.org/wiki/Google_Code-in_2016 and add
yourself to the mentor's table.
2. Get an invitation email to register on the contest site.
PROVIDE SMALL TASKS:
We want your tasks in the following areas: code, outreach/research,
documentation/training, quality assurance, user interface/design.
1. Create a Phabricator task (which would take you 2-3h to complete) or
pick an existing Phabricator task you'd mentor.
2. Add the "Google-Code-In-2016" project tag.
3. Add a comment "I will mentor this in #GCI2016".
Looking for task ideas? Check the "easy" tasks in Phabricator:
https://www.mediawiki.org/wiki/Annoying_little_bugs offers links.
Make sure to cover expectations and deliverables in your task.
And once the contest starts on Nov 28, be ready to answer and review
Any questions? Just ask, we're happy to help.
Thank you for your help broadening our contributor base!
Andre Klapper | Wikimedia Bugwrangler
over the past years a lot of people have rightfully complained about
how we are handling rounding and uncertainty in quanity values. Until
now when entering 124 m it will be parsed, stored and displaied as 124
+/-1 m. People then often tried to change this to 124 +/-0 m in order
to prevent the uncertainty from being shown. This is often incorrect.
People also disagreed with the default of +/-1 instead of +/-0.5.
We have sat down and discussed at length how to improve this situation
and have now prepared two changes:
1) If no uncertainty is explicitely entered, we do not try to guess it
and no uncertainty will be displayed. But if an uncertatinty is given
it will always be displayed even if it is +/-0.
2) In order to apply correct rounding (in particular for unit
conversion) we still need to know the uncertatiny intervall. If no
uncertainty was explicitely given we still will need to guess it. We
have now halfed the default uncertainty intervall to be consistent
with the rounding intervall. So for example 124 would be treated as
124 +/-0.5 when applying rounding.
We now have a lot of quantity values that are marked as exact (+/-0)
while their uncertainty is not explicitely known. With the above
changes all of these will be shown with +/-0. We also have a lot of
values that have the old default precision (+/-1) which prevents the
new default from being used.
Therefor we suggest the following two bot runs:
* Remove +/-0 from quantity values.
* Remove the precision if it is equal to the old default precision.
Both should be applied to properties that represent measured
quantities. We should avoid applying them to properties that represent
conversions or similar wexactly-defined values. We can prepare the bot
This requires a breaking change to the API and JSON binding. Daniel
will send a separate email about that in a minute.
We plan to make this change on the live system on November 15th. It
can be tested on the beta before that. I will let you know when.
Lydia Pintscher - http://about.me/lydia.pintscher
Product Manager for Wikidata
Wikimedia Deutschland e.V.
Tempelhofer Ufer 23-24
Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e. V.
Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg
unter der Nummer 23855 Nz. Als gemeinnützig anerkannt durch das
Finanzamt für Körperschaften I Berlin, Steuernummer 27/029/42207.