Re: [Wikidata] [wikicite-discuss] Entity tagging and fact extraction (from a scholarly publisher perspective)

11 Nov 2016

But the Mix'n'Match tool doesn't work fully -- at least not for Wiki Markup
when I tested it the last time. This functionality is critical.

- Erika

*Erika Herzog*
Wikipedia *User:BrillLyle <https://en.wikipedia.org/wiki/User:BrillLyle>*

On Fri, Nov 11, 2016 at 4:18 PM, Marco Fossati &lt;fossati(a)fbk.eu&gt; wrote:

...
  Hi everyone,

 Just a couple of thoughts, which are in line with Dario's first message:
 1. the primary sources tool lets third party providers release *full
 datasets* in a rather quick way. It is conceived to (a) ease the ingestion
 of *non-curated* data and to (b) make the community directly decide which
 statements should be included, instead of eventually complex a priori
 discussions.
 Important: the datasets should comply with the Wikidata
 vocabulary/ontology.

 2. I see the mix'n'match tool as a way to *link* datasets with Wikidata
 via ID mappings, thus only requiring statements that say "Wikidata entity X
 links to the third party dataset entity Y".
 This is pretty much what the linked data community has been doing so far.
 No need to comply with the Wikidata vocabulary/ontology.

 Best,

 Marco

 Il 11 nov 2016 10:27 AM, "Andrew Smeall" &lt;andrew.smeall(a)hindawi.com&gt; ha
 scritto:

  Regarding the topics/vocabularies issue:

 A challenge we're working on is finding a set of controlled vocabularies
 for all the subject areas we cover.

 We do use MeSH for those subjects, but this only applies to about 40% of
 our papers. In Engineering, for example, we've had more trouble finding an
 open taxonomy with the same level of depth as MeSH. For most internal
 applications, we need 100% coverage of all subjects.

 Machine learning for concept tagging is trendy now, partly because it
 doesn't require a preset vocabulary, but we are somewhat against this
 approach because we want to control the mapping of terms and a taxonomic
 hierarchy can be useful. The current ML tools I've seen can match to a
 controlled vocabulary, but then they need the publisher to supply the terms.

 The temptation to build a new vocabulary is strong, because it's the
 fastest way to get to something that is non-proprietary and universal. We
 can merge existing open vocabularies like MeSH and PLOS to get most of the
 way there, but we then need to extend that with concepts from our corpus.

 Thanks Daniel and Benjamin for your responses. Any other feedback would
 be great, and I'm always happy to delve into issues from the publisher
 perspective if that can be helpful.

 On Fri, Nov 11, 2016 at 4:54 PM, Dario Taraborelli <
 dtaraborelli(a)wikimedia.org&gt; wrote:

  Benjamin – agreed, I too see Wikidata as mainly a
place to hold all the
 mappings. Once we support federated queries in WDQS, the benefit of ID
 mapping (over extensive data ingestion) will become even more apparent.

 Hope Andrew and other interested parties can pick up this thread.

 On Wed, Nov 2, 2016 at 12:11 PM, Benjamin Good &lt;ben.mcgee.good(a)gmail.com
  wrote: 
  Dario,

 One message you can send is that they can and should use existing
 controlled vocabularies and ontologies to construct the metadata they want
 to share.  For example, MeSH descriptors would be a good way for them to
 organize the 'primary topic' assertions for their articles and would make
 it easy to find the corresponding items in Wikidata when uploading.  Our
 group will be continuing to expand coverage of identifiers and concepts
 from vocabularies like that in Wikidata - and any help there from
 publishers would be appreciated!

 My view here is that Wikidata can be a bridge to the terminologies and
 datasets that live outside it - not really a replacement for them.  So, if
 they have good practices about using shared vocabularies already, it should
 (eventually) be relatively easy to move relevant assertions into the
 WIkidata graph while maintaining interoperability and integration with
 external software systems.

 -Ben

 On Wed, Nov 2, 2016 at 8:31 AM, 'Daniel Mietchen' via wikicite-discuss
 &lt;wikicite-discuss(a)wikimedia.org&gt; wrote:

> I'm traveling ( https://twitter.com/EvoMRI/status/793736211009536000
> ), so just in brief:
> In terms of markup, some general comments are in
> https://www.ncbi.nlm.nih.gov/books/NBK159964/ , which is not specific
> to Hindawi but partly applies to them too.
>
> A problem specific to Hindawi (cf.
> https://commons.wikimedia.org/wiki/Category:Media_from_Hindawi) is the
> bundling of the descriptions of all supplementary files, which
> translates into uploads like
> https://commons.wikimedia.org/wiki/File:Evolution-of-Coronar
> y-Flow-in-an-Experimental-Slow-Flow-Model-in-Swines-Angiogra
> phic-and-623986.f1.ogv
> (with descriptions for nine files)
> and eight files with no description, e.g.
> https://commons.wikimedia.org/wiki/File:Evolution-of-Coronar
> y-Flow-in-an-Experimental-Slow-Flow-Model-in-Swines-Angiogra
> phic-and-623986.f2.ogv
> .
>
> There are other problems in their JATS, and it would be good if they
> would participate in
> http://jats4r.org/ . Happy to dig deeper with Andrew or whoever is
> interested.
>
> Where they are ahead of the curve is licensing information, so they
> could help us set up workflows to get that info into Wikidata.
>
> In terms of triple suggestions to Wikidata:
> - as long as article metadata is concerned, I would prefer to
> concentrate on integrating our workflows with the major repositories
> of metadata, to which publishers are already posting. They could help
> us by using more identifiers (e.g. for authors, affiliations, funders
> etc.), potentially even from Wikidata (e.g. for keywords/ P921, for
> both journals and articles) and by contributing to the development of
> tools (e.g. a bot that goes through the CrossRef database every day
> and creates Wikidata items for newly published papers).
> - if they have ways to extract statements from their publication
> corpus, it would be good if they would let us/ ContentMine/ StrepHit
> etc. know, so we could discuss how to move this forward.
> d.
>
> On Wed, Nov 2, 2016 at 1:42 PM, Dario Taraborelli
> &lt;dtaraborelli(a)wikimedia.org&gt; wrote:
> > I'm at the Crossref LIVE 16 event in London where I just gave a
> presentation
> > on WikiCite and Wikidata targeted at scholarly publishers.
> >
> > Beside Crossref and Datacite people, I talked to a bunch of folks
> interested
> > in collaborating on Wikidata integration, particularly from PLOS,
> Hindawi
> > and Springer Nature. I started an interesting discussion with Andrew
> Smeall,
> > who runs strategic projects at Hindawi, and I wanted to open it up to
> > everyone on the lists.
> >
> > Andrew asked me if – aside from efforts like ContentMine and
> StrepHit –
> > there are any recommendations for publishers (especially OA
> publishers) to
> > mark up their contents and facilitate information extraction and
> entity
> > matching or even push triples to Wikidata to be considered for
> ingestion.
> >
> > I don't think we have a recommended workflow for data providers for
> > facilitating triple suggestions to Wikidata, other than leveraging
> the
> > Primary Sources Tool. However, aligning keywords and terms with the
> > corresponding Wikidata items via ID mapping sounds like a good first
> step. I
> > pointed Andrew to Mix'n'Match as a handy way of mapping identifiers,
> but if
> > you have other ideas on how to best support 2-way integration of
> Wikidata
> > with scholarly contents, please chime in.
> >
> > Dario
> >
> > --
> >
> > Dario Taraborelli  Head of Research, Wikimedia Foundation
> > wikimediafoundation.org • nitens.org • @readermeter
> >
> > --
> > WikiCite 2016 – May 26-26, 2016, Berlin
> > Meta: https://meta.wikimedia.org/wiki/WikiCite_2016
> > Twitter: https://twitter.com/wikicite16
> > ---
> > You received this message because you are subscribed to the Google
> Groups
> > "wikicite-discuss" group.
> > To unsubscribe from this group and stop receiving emails from it,
> send an
> > email to wikicite-discuss+unsubscribe(a)wikimedia.org.
>
> --
> WikiCite 2016 – May 26-26, 2016, Berlin
> Meta: https://meta.wikimedia.org/wiki/WikiCite_2016
> Twitter: https://twitter.com/wikicite16
> ---
> You received this message because you are subscribed to the Google
> Groups "wikicite-discuss" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to wikicite-discuss+unsubscribe(a)wikimedia.org.
>
>

 --

 *Dario Taraborelli  *Head of Research, Wikimedia Foundation
 wikimediafoundation.org • nitens.org • @readermeter
 <http://twitter.com/readermeter>

 --
 WikiCite 2016 – May 26-26, 2016, Berlin
 Meta: https://meta.wikimedia.org/wiki/WikiCite_2016
 Twitter: https://twitter.com/wikicite16
 ---
 You received this message because you are subscribed to the Google
 Groups "wikicite-discuss" group.
 To unsubscribe from this group and stop receiving emails from it, send
 an email to wikicite-discuss+unsubscribe(a)wikimedia.org.

 --
 ------------------------------
 Andrew Smeall
 Head of Strategic Projects

 Hindawi Publishing Corporation
 Kirkman House
 12-14 Whitfield Street, 3rd Floor
 London, W1T 2RF
 United Kingdom
 ------------------------------

 --
 WikiCite 2016 – May 26-26, 2016, Berlin
 Meta: https://meta.wikimedia.org/wiki/WikiCite_2016
 Twitter: https://twitter.com/wikicite16
 ---
 You received this message because you are subscribed to the Google Groups
 "wikicite-discuss" group.
 To unsubscribe from this group and stop receiving emails from it, send an
 email to wikicite-discuss+unsubscribe(a)wikimedia.org.
  --
 WikiCite 2016 – May 26-26, 2016, Berlin
 Meta: https://meta.wikimedia.org/wiki/WikiCite_2016
 Twitter: https://twitter.com/wikicite16
 ---
 You received this message because you are subscribed to the Google Groups
 "wikicite-discuss" group.
 To unsubscribe from this group and stop receiving emails from it, send an
 email to wikicite-discuss+unsubscribe(a)wikimedia.org.

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

Re: [Wikidata] [wikicite-discuss] Entity tagging and fact extraction (from a scholarly publisher perspective)