How bad / good does Wikidata as a whole fit the role of an open vocabulary
for content tagging?
On Fri, Nov 11, 2016 at 10:29 AM Andrew Smeall <andrew.smeall(a)hindawi.com>
wrote:
Regarding the topics/vocabularies issue:
A challenge we're working on is finding a set of controlled vocabularies
for all the subject areas we cover.
We do use MeSH for those subjects, but this only applies to about 40% of
our papers. In Engineering, for example, we've had more trouble finding an
open taxonomy with the same level of depth as MeSH. For most internal
applications, we need 100% coverage of all subjects.
Machine learning for concept tagging is trendy now, partly because it
doesn't require a preset vocabulary, but we are somewhat against this
approach because we want to control the mapping of terms and a taxonomic
hierarchy can be useful. The current ML tools I've seen can match to a
controlled vocabulary, but then they need the publisher to supply the terms.
The temptation to build a new vocabulary is strong, because it's the
fastest way to get to something that is non-proprietary and universal. We
can merge existing open vocabularies like MeSH and PLOS to get most of the
way there, but we then need to extend that with concepts from our corpus.
Thanks Daniel and Benjamin for your responses. Any other feedback would be
great, and I'm always happy to delve into issues from the publisher
perspective if that can be helpful.
On Fri, Nov 11, 2016 at 4:54 PM, Dario Taraborelli <
dtaraborelli(a)wikimedia.org> wrote:
Benjamin – agreed, I too see Wikidata as mainly a place to hold all the
mappings. Once we support federated queries in WDQS, the benefit of ID
mapping (over extensive data ingestion) will become even more apparent.
Hope Andrew and other interested parties can pick up this thread.
On Wed, Nov 2, 2016 at 12:11 PM, Benjamin Good <ben.mcgee.good(a)gmail.com>
wrote:
Dario,
One message you can send is that they can and should use existing
controlled vocabularies and ontologies to construct the metadata they want
to share. For example, MeSH descriptors would be a good way for them to
organize the 'primary topic' assertions for their articles and would make
it easy to find the corresponding items in Wikidata when uploading. Our
group will be continuing to expand coverage of identifiers and concepts
from vocabularies like that in Wikidata - and any help there from
publishers would be appreciated!
My view here is that Wikidata can be a bridge to the terminologies and
datasets that live outside it - not really a replacement for them. So, if
they have good practices about using shared vocabularies already, it should
(eventually) be relatively easy to move relevant assertions into the
WIkidata graph while maintaining interoperability and integration with
external software systems.
-Ben
On Wed, Nov 2, 2016 at 8:31 AM, 'Daniel Mietchen' via wikicite-discuss <
wikicite-discuss(a)wikimedia.org> wrote:
I'm traveling (
https://twitter.com/EvoMRI/status/793736211009536000
), so just in brief:
In terms of markup, some general comments are in
https://www.ncbi.nlm.nih.gov/books/NBK159964/ , which is not specific
to Hindawi but partly applies to them too.
A problem specific to Hindawi (cf.
https://commons.wikimedia.org/wiki/Category:Media_from_Hindawi) is the
bundling of the descriptions of all supplementary files, which
translates into uploads like
https://commons.wikimedia.org/wiki/File:Evolution-of-Coronary-Flow-in-an-Ex…
(with descriptions for nine files)
and eight files with no description, e.g.
https://commons.wikimedia.org/wiki/File:Evolution-of-Coronary-Flow-in-an-Ex…
.
There are other problems in their JATS, and it would be good if they
would participate in
http://jats4r.org/ . Happy to dig deeper with Andrew or whoever is
interested.
Where they are ahead of the curve is licensing information, so they
could help us set up workflows to get that info into Wikidata.
In terms of triple suggestions to Wikidata:
- as long as article metadata is concerned, I would prefer to
concentrate on integrating our workflows with the major repositories
of metadata, to which publishers are already posting. They could help
us by using more identifiers (e.g. for authors, affiliations, funders
etc.), potentially even from Wikidata (e.g. for keywords/ P921, for
both journals and articles) and by contributing to the development of
tools (e.g. a bot that goes through the CrossRef database every day
and creates Wikidata items for newly published papers).
- if they have ways to extract statements from their publication
corpus, it would be good if they would let us/ ContentMine/ StrepHit
etc. know, so we could discuss how to move this forward.
d.
On Wed, Nov 2, 2016 at 1:42 PM, Dario Taraborelli
<dtaraborelli(a)wikimedia.org> wrote:
I'm at the Crossref LIVE 16 event in London
where I just gave a
presentation
on WikiCite and Wikidata targeted at scholarly
publishers.
Beside Crossref and Datacite people, I talked to a bunch of folks
interested
in collaborating on Wikidata integration,
particularly from PLOS, Hindawi
and Springer Nature. I started an interesting discussion with Andrew
Smeall,
who runs strategic projects at Hindawi, and I
wanted to open it up to
everyone on the lists.
Andrew asked me if – aside from efforts like ContentMine and StrepHit –
there are any recommendations for publishers (especially OA publishers)
to
mark up their contents and facilitate information
extraction and entity
matching or even push triples to Wikidata to be considered for ingestion.
I don't think we have a recommended workflow for data providers for
facilitating triple suggestions to Wikidata, other than leveraging the
Primary Sources Tool. However, aligning keywords and terms with the
corresponding Wikidata items via ID mapping sounds like a good first
step. I
pointed Andrew to Mix'n'Match as a handy
way of mapping identifiers, but
if
you have other ideas on how to best support 2-way
integration of Wikidata
with scholarly contents, please chime in.
Dario
--
Dario Taraborelli Head of Research, Wikimedia Foundation
wikimediafoundation.org •
nitens.org • @readermeter
--
WikiCite 2016 – May 26-26, 2016, Berlin
Meta:
https://meta.wikimedia.org/wiki/WikiCite_2016
Twitter:
https://twitter.com/wikicite16
---
You received this message because you are subscribed to the Google Groups
"wikicite-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to wikicite-discuss+unsubscribe(a)wikimedia.org.
--
WikiCite 2016 – May 26-26, 2016, Berlin
Meta:
https://meta.wikimedia.org/wiki/WikiCite_2016
Twitter:
https://twitter.com/wikicite16
---
You received this message because you are subscribed to the Google Groups
"wikicite-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to wikicite-discuss+unsubscribe(a)wikimedia.org.
--
*Dario Taraborelli *Head of Research, Wikimedia Foundation
wikimediafoundation.org •
nitens.org • @readermeter
<http://twitter.com/readermeter>
--
WikiCite 2016 – May 26-26, 2016, Berlin
Meta:
https://meta.wikimedia.org/wiki/WikiCite_2016
Twitter:
https://twitter.com/wikicite16
---
You received this message because you are subscribed to the Google Groups
"wikicite-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to wikicite-discuss+unsubscribe(a)wikimedia.org.
--
------------------------------
Andrew Smeall
Head of Strategic Projects
Hindawi Publishing Corporation
Kirkman House
12-14 Whitfield Street, 3rd Floor
London, W1T 2RF
United Kingdom
------------------------------
_______________________________________________
Wikidata mailing list
Wikidata(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata