Forwarding.
Pine
( https://meta.wikimedia.org/wiki/User:Pine )
---------- Forwarded message ---------
From: Gabriel Altay <gabriel.altay(a)gmail.com>
Date: Mon, Feb 3, 2020 at 6:57 PM
Subject: [Wikidata] New Wikimedia dataset for NLP research
To: <wikidata(a)lists.wikimedia.org>
Hello Wikidata folks,
I would like to bring your attention to an open source dataset I've
been developing called the Kensho Derived Wikimedia Dataset (KDWD).
It's a cleaned English subset of Wikipedia/Wikidata with 2.3B tokens,
5.3M pages, 51M nodes, and 120M edges. More details are available
here https://blog.kensho.com/announcing-the-kensho-derived-wikimedia-dataset-5d1…
best,
-Gabriel
_______________________________________________
Wikidata mailing list
Wikidata(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata
Forwarding.
Pine
( https://meta.wikimedia.org/wiki/User:Pine )
---------- Forwarded message ---------
From: Violeta Ilik <ilik.violeta(a)gmail.com>
Date: Mon, Feb 3, 2020 at 1:01 PM
Subject: [Wikidata] Knowledge Graph Conference 2020 - Workshops and
Tutorials Announcement
To: Discussion list for the Wikidata project. <wikidata(a)lists.wikimedia.org>
Dear Wikidata community,
The Knowledge Graph Conference organizing team is pleased to announce
the workshops and tutorials part of the KGC 2020 Program. They are
taking place on May 4 and 5 in Butler Library, Columbia University
Libraries in NYC.
Workshops are stand-alone sub events of the conference. They have
separate calls for papers and their own program and organizing
committee.
Tutorials are learning sessions including both lecture style and
hands-on sessions. Each tutorial will be for half a day unless
specified.
For more information about each workshop and tutorial please visit
this page: https://www.knowledgegraph.tech/the-knowledge-graph-conference-kgc/workshop…
Early Bird registration ends on February 15, 2020. To register please
visit this page:
https://www.knowledgegraph.tech/the-knowledge-graph-conference-kgc/register/
WORKSHOPS
KGC Workshop on Applied Knowledge Graph: Best industry/academic
practices, methods and challenges between representation and reasoning
Organizers:
Vivek Khetan, AI research specialist, Accenture Labs, SF
Colin Puri, R&D Principal - Accenture Labs
Lambert Hogenhout, Chief Analytics, Partnerships and Innovation, United Nations
Limit: 40 people
Date: May 4, 2020
Place: Room 203, Butler Library, Columbia University
Personal Health Knowledge Graphs (PHKG): Challenges and Opportunities
Organizers:
Ching-Hua Chen, PhD, Amar Das, MD PhD, Ying Ding, PhD, Deborah
McGuinness, PhD, Oshani Seneviratne, PhD, and Mohammed J Zaki, PhD
Limit: 40 people
Date: May 5, 2020
Place: Room 203, Butler Library, Columbia University
TUTORIALS
Virtualized Knowledge Graphs for Enterprise Applications
Presenter: Eric Little, PhD – CEO LeapAnalysis
Limit: 20 people
Date and time: May 4, 2020 8:30AM - 12:30PM
Place: Studio Butler, Butler Library, Columbia University
Data discovery on a (free) hybrid BI/Search/Knowledge graph platform:
the Siren Community Edition hands on tutorial
Presenter: Giovanni Tummarello, Ph.D
Limit: 20 people
Date and time: May 4, 2020 8:30AM - 12:30PM
Place: Room 523 Butler Library, Columbia University
Building a Knowledge Graph from schema.org annotations
Presenters: Elias Kärle, Umutcan Simsek, and Dieter Fensel (STI
Innsbruck, University of Innsbruck)
Limit: 25 people
Date and time: May 4, 2020 1:30PM - 5:30PM
Place: Room 523 Butler Library, Columbia University
Designing and Building Enterprise Knowledge Graphs from Relational Databases
Presenter: Juan Sequeda, DataWorld
Limit: 25 people
Date and time: May 5, 2020 8:30AM - 12:30PM
Place: Room 523 Butler Library, Columbia University
Rapid Knowledge Graph development with GraphQL and RDF databases
Presenters: Vassil Momtchev, Ontotext
Limit: 25 people
Date and time: May 5, 2020 1:30PM - 5:30PM
Place: Room 523 Butler Library, Columbia University
Introduction to Logic Knowledge Graphs, Succinct Data Structures and
Delta Encoding for Modern Databases, and the Web Object Query Language
Presenter: Dr. Gavin Mendel-Gleason and Cheukting Ho (DataChemist)
Limit: 20 people
Date and time: May 5, 2020 8:30AM - 12:30PM
Place: Room 306 Butler Library, Columbia University
Modeling Evolving Data in Graphs While Preserving Backward
Compatibility: The Power of RDF Quads
Presenter: Souripriya Das, Matthew Perry, and Eugene I. Chong (Oracle)
Limit: 20 people
Date and time: May 5, 2020 1:30PM - 5:30PM
Place: Room 306 Butler Library, Columbia University
Violeta Ilik
KGC 2020 Workshops & Tutorials Chair
--
Violeta Ilik
_______________________________________________
Wikidata mailing list
Wikidata(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata
Is there a way to attend remotely?
On Sun, Jan 26, 2020 at 8:46 AM Lucie Kaffee <lucie.kaffee(a)gmail.com> wrote:
> Hello everyone!
>
> We are organizing an event for Research and Wikimedia to exchange about
> research done in the field and for researchers and Wikimedia community
> members to work together on new ideas. The idea derived from the fact that
> many researchers reusing Wikipedia, Wikidata and their sister projects
> often are not yet integrated with the community. This makes their work a
> lot more difficult than necessary. At the same time, many research
projects
> are useful for the community, but not yet integrated into Wikipedia and
co.
> We want to change this and facilitate the exchange between researchers and
> Wikimedia community members in an event, where we bring people interested
> in similar topics together. If you are either doing research in the
> Wikipedia space or are a community member of one of the Wikimedia
projects,
> please come by on the 8th of February. More details below.
> Please spread the event invitation in your communities!
>
> Best,
> Lucie
>
> https://www.eventbrite.com/e/research-and-wikimedia-tickets-90824421289
>
> *Description*
> We are organizing an event for Wikimedians and researchers to exchange!
> Come along and learn more about research happening around Wikimedia and
> what Wikimedians can teach you about the different Wikimedia projects!
> A large part of the computer science research community is exploring
> Wikipedia, Wikidata and their sister projects. In the fields of natural
> language processing (NLP) as well as semantic web, Wikipedia and Wikidata
> are often used as a fundamental part of the research world. At the same
> time, the community of Wikidata and Wikipedia could make use of a variety
> of tools developed by researchers. However, currently, the gap between
> things explored in research and actual applications in Wikidata and
> Wikipedia needs bridging. Therefore, we want to build a community of
> Wikidata community members and research to exchange needs, existing tools,
> open challenges and research question to foster an environment, where both
> communities can benefit from the exchange.
> The ideal is to have all the different approaches and commonalities under
> one umbrella to foster exchange and support of different research
> communities and their approaches.
> OpenSym and the WikiWorkshop are already doing that for the people
> submitting to and attending computer science research conferences. But
> without the exchange with the community, there is a lack of communication,
> creating silos of missing exchange.
>
> *The Goal is*
> to connect the researcher and the Wikimedia community to enable an
exchange
> that could ultimately lead to the research projects being implemented as
> tools for Wikipedia. And vice-versa: More research projects build on
> community needs.
>
> *We invite*
> *Researchers*
> Anyone who does or is planning to do research on or around Wikimedia
> projects, such as Wikipedia, Wikidata and others.
> *Wikimedians*
> Anyone in the community, who is interested in improving the research
> happening around Wikimedia - you don’t need any experience in research.
> Wikipedia editor, Wikidata data magician, whatever you do in Wikimedia
> projects, your feedback will be highly valuable.
>
> *What we need from you*
> We would ask all researchers to bring an A2/A3 poster about what they are
> doing in Wikimedia that we can put up so that we can create an easy way to
> exchange on different projects. If you don’t have a project yet, don’t
> worry- just bring a poster with topics you find interesting, and you might
> be able to meet other researchers already working in your field of
> interest. (If you struggle with printing the poster beforehand, please
> reach out to us a few days in advance.)
>
> *Event*
> We will spend a day exchanging on recent challenges around Wikimedia.
> Besides the posters, we aim to form working groups for the afternoon to
> work on topics of shared interest and possibly propose a project of common
> interest.
>
> --
> Lucie-Aimée Kaffee
> _______________________________________________
> Wiki-research-l mailing list
> Wiki-research-l(a)lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>
Travis
Hello everyone!
We are organizing an event for Research and Wikimedia to exchange about
research done in the field and for researchers and Wikimedia community
members to work together on new ideas. The idea derived from the fact that
many researchers reusing Wikipedia, Wikidata and their sister projects
often are not yet integrated with the community. This makes their work a
lot more difficult than necessary. At the same time, many research projects
are useful for the community, but not yet integrated into Wikipedia and co.
We want to change this and facilitate the exchange between researchers and
Wikimedia community members in an event, where we bring people interested
in similar topics together. If you are either doing research in the
Wikipedia space or are a community member of one of the Wikimedia projects,
please come by on the 8th of February. More details below.
Please spread the event invitation in your communities!
Best,
Lucie
https://www.eventbrite.com/e/research-and-wikimedia-tickets-90824421289
*Description*
We are organizing an event for Wikimedians and researchers to exchange!
Come along and learn more about research happening around Wikimedia and
what Wikimedians can teach you about the different Wikimedia projects!
A large part of the computer science research community is exploring
Wikipedia, Wikidata and their sister projects. In the fields of natural
language processing (NLP) as well as semantic web, Wikipedia and Wikidata
are often used as a fundamental part of the research world. At the same
time, the community of Wikidata and Wikipedia could make use of a variety
of tools developed by researchers. However, currently, the gap between
things explored in research and actual applications in Wikidata and
Wikipedia needs bridging. Therefore, we want to build a community of
Wikidata community members and research to exchange needs, existing tools,
open challenges and research question to foster an environment, where both
communities can benefit from the exchange.
The ideal is to have all the different approaches and commonalities under
one umbrella to foster exchange and support of different research
communities and their approaches.
OpenSym and the WikiWorkshop are already doing that for the people
submitting to and attending computer science research conferences. But
without the exchange with the community, there is a lack of communication,
creating silos of missing exchange.
*The Goal is*
to connect the researcher and the Wikimedia community to enable an exchange
that could ultimately lead to the research projects being implemented as
tools for Wikipedia. And vice-versa: More research projects build on
community needs.
*We invite*
*Researchers*
Anyone who does or is planning to do research on or around Wikimedia
projects, such as Wikipedia, Wikidata and others.
*Wikimedians*
Anyone in the community, who is interested in improving the research
happening around Wikimedia - you don’t need any experience in research.
Wikipedia editor, Wikidata data magician, whatever you do in Wikimedia
projects, your feedback will be highly valuable.
*What we need from you*
We would ask all researchers to bring an A2/A3 poster about what they are
doing in Wikimedia that we can put up so that we can create an easy way to
exchange on different projects. If you don’t have a project yet, don’t
worry- just bring a poster with topics you find interesting, and you might
be able to meet other researchers already working in your field of
interest. (If you struggle with printing the poster beforehand, please
reach out to us a few days in advance.)
*Event*
We will spend a day exchanging on recent challenges around Wikimedia.
Besides the posters, we aim to form working groups for the afternoon to
work on topics of shared interest and possibly propose a project of common
interest.
--
Lucie-Aimée Kaffee
Forwarding.
Pine
( https://meta.wikimedia.org/wiki/User:Pine )
---------- Forwarded message ---------
From: Brooke Storm <bstorm(a)wikimedia.org>
Date: Fri, Jan 24, 2020 at 1:03 AM
Subject: [Wikitech-l] PyCon Financial Assistance and Development Sprints
Info
To: <wikitech-l(a)lists.wikimedia.org>, Foundation Optional <
foundation-optional(a)wikimedia.org>, <tech-all(a)wikimedia.org>
Hello Folks!
For the Python enthusiasts on these lists, I’m signal boosting this message
with info on PyCon dev sprints and financial assistance for the conference
from a former Wikimedia colleague.
I plan to attend PyCon this year and am also hoping to figure out setting
up a development sprint around some Wikimedia Cloud Services and Toolforge
code.
Brooke Storm
SRE
Wikimedia Cloud Services
bstorm(a)wikimedia.org <mailto:bstorm@wikimedia.org>
IRC: bstorm_
--------- Forwarded message ————
<snip>
I wanted to mention - feel free to pass this on publicly and in personal
invitations - that PyCon North America, mid-April in Pennsylvania,
offers financial assistance to people who would like to attend:
https://us.pycon.org/2020/financial-assistance/ <
https://us.pycon.org/2020/financial-assistance/>
The deadline for requesting financial assistance is 31 January.
PyCon loves to cross-pollinate with other free and open source
movements, and I know there are many Python developers in Wikimedia
tech. If Wikimedians want to use the April 20-23 in-person sprints
https://us.pycon.org/2020/events/sprints/ <
https://us.pycon.org/2020/events/sprints/> (will be editable soon)
to work on Wikimedia-related Python tools together, that would be cool!
Best wishes.
--
Sumana Harihareswara
Changeset Consulting
https://changeset.nyc <https://changeset.nyc/>
_______________________________________________
Wikitech-l mailing list
Wikitech-l(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Hello Researchers,
Contribution patterns in online communities follow a power distribution
which is known as the 1% rule [1], as Wikipedia told me.
However, the steepness of the distribution can be more or less strong: 50%
of your edits could be contributed by 2% or by 0.002%, the latter showing a
stronger imbalance.
I wonder if there are any estimates/rules-of-thumb of what imbalance is
problematic when seen from the perspective of community health.
I also wonder if there is research on how technology contributes to such
imbalances and how it might be mitigated – e.g training, user-friendliness,
documentation…
(based on my assumption that a steep curve is less desirable, since the
power is more concentrated, the system more fragile and the redistribution
of power more constrained)
Jan
[1] https://en.wikipedia.org/wiki/1%25_rule_(Internet_culture)
--
Jan Dittrich
UX Design/ Research
Wikimedia Deutschland e. V. | Tempelhofer Ufer 23-24 | 10963 Berlin
Tel. (030) 219 158 26-0
https://wikimedia.de
Unsere Vision ist eine Welt, in der alle Menschen am Wissen der Menschheit
teilhaben, es nutzen und mehren können. Helfen Sie uns dabei!
https://spenden.wikimedia.de
Wikimedia Deutschland — Gesellschaft zur Förderung Freien Wissens e. V.
Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg unter
der Nummer 23855 B. Als gemeinnützig anerkannt durch das Finanzamt für
Körperschaften I Berlin, Steuernummer 27/029/42207.
Hi all,
We, the Research team at Wikimedia Foundation, have received some requests
over the past months for making ourselves more available to answer some of
the research questions that you as Wikimedia volunteers, affiliates' staff,
and researchers face in your projects and initiatives. Starting January
2020, we will experiment with monthly office hours organized jointly by our
team and the Analytics team where you can join us and direct your questions
to us. We will revisit this experiment in June 2020 to assess whether to
continue it or not.
The scope
We encourage you to attend the office hour if you have research related
questions. These can be questions about our teams, our projects, or more
importantly questions about your projects or ideas that we can support you
with during the office hours. You can also ask us questions about how to
use a specific dataset available to you, to answer a question you have, or
some other question. Note that the purpose of the office hours is to answer
your questions during the dedicated time of the office hour. Questions that
may require many hours of back-and-forth between our team and you are not
suited for this forum. For these bigger questions, however, we are happy to
brainstorm with you in the office hour and point you to some good
directions to explore further on your own (and maybe come back in the next
office hour and ask more questions).
Time and Location
We meet on the 4th Wednesday of every month 17.00-18.00 (UTC) in
#wikimedia-research IRC channel on freenode [1].
The first meeting will be on January 22.
Up-to-date information on mediawiki [2]
Archiving
If you miss the office hour, you can read the logs of it at [3].
The future announcements about these office hours will only go to the
following lists so please make sure you're subscribed to them if you like
to receive a ping:
* wiki-research-l mailing list [4]
* analytics mailing list [5]
* wikidata mailing list [6]
* the Research category in Space [7]
on behalf of Research and Analytics,
Martin
[1] irc://irc.freenode.net/wikimedia-research
[2] https://www.mediawiki.org/wiki/Wikimedia_Research/Office_hours
[3] https://wm-bot.wmflabs.org/logs/%23wikimedia-research/
[4] https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
[5] https://lists.wikimedia.org/mailman/listinfo/analytics
[6] https://lists.wikimedia.org/mailman/listinfo/wikidata
[7] https://discuss-space.wmflabs.org/tags/research
--
Martin Gerlach
Research Scientist
Wikimedia Foundation
Hi everyone,
We’re preparing for the January 2020 research newsletter and looking for contributors. Please take a look at https://etherpad.wikimedia.org/p/WRN202001 and add your name next to any paper you are interested in covering. Our writing deadline is 25 January 23:59 UTC. If you can't make this deadline but would like to cover a particular paper in the subsequent issue, leave a note next to the paper's entry below. As usual, short notes and one-paragraph reviews are most welcome.
Highlights from this month:
- ‘WP2Cochrane’, a tool linking Wikipedia to the Cochrane Library: Results of a bibliometric analysis evaluating article quality and importance
- Building Knowledge Graphs: Processing Infrastructure and Named Entity Linking
- Individual and collaborative information behaviour of Wikipedians in the context of their involvement with Hebrew Wikipedia
- Keeping Community in the Loop: Understanding Wikipedia Stakeholder Values for Machine Learning-Based Systems
- Knowledge curation work in Wikidata WikiProject discussions
- Knowledge curation work in Wikidata WikiProject discussions
- Strangers in a seemingly open-to-all website: the gender bias in Wikipedia
- Understanding Wikipedia as a Resource for Opportunistic Learning of Computing Concepts
Masssly and Tilman Bayer
[1] http://meta.wikimedia.org/wiki/Research:Newsletter[2] WikiResearch (@WikiResearch) | Twitter
Hey Research Community,
TL;DR New dataset:
https://figshare.com/articles/Wikipedia_Articles_and_Associated_WikiProject…
More details:
I wanted to notify everyone that we have published a dataset of the
articles on English Wikipedia that have been tagged by WikiProjects [1]
through templates on their associated talk pages. We are not planning to
make this an ongoing release, but I have provided the script that I used to
generate it in the Figshare item so that others might update / adjust to
meet their needs.
As anyone who has done research on WikiProjects knows, it can be
complicated to determine what articles fit under a particular WikiProject's
purview. The motivation for generating this dataset was to support our work
in developing topic models for Wikipedia (see [2] for an overview), but we
imagine that there are many other ways in which this dataset might be
useful:
* Previous work has examined how active WikiProjects are based on edits to
their pages in the Wikipedia namespace. This dataset makes it much easier
to identify which Wikiprojects are managing the most valuable articles on
Wikipedia (in terms of quality or pageviews).
* Many topic-level analyses of Wikipedia rely on the category network.
Categories can be very messy and difficult to work with, but WikiProjects
represent an alternative that often is simpler and still quite rich. For
instance, this could be used for temporal analyses of article quality,
demand, or distribution by topic.
* While WikiProjects are English-only and therefore limited in their
utility to other languages, we also provide the Wikidata ID and sitelinks
-- i.e. titles for corresponding articles in other languages -- to allow
for multilingual analyses. This could be used to compare gaps in coverage
-- e.g., akin to past work that has used categories [3].
The main challenge, besides processing time, is how to 1) effectively
extract the WikiProject templates from talk pages, and, 2) consistently
link them to a canonical WikiProject name and topic. For example, the
canonical template for WikiProject Medicine is
https://en.wikipedia.org/wiki/Template:WikiProject_Medicine but another one
used is
https://en.wikipedia.org/w/index.php?title=Template:WPMED&redirect=no (and
there are 13 more). To capture articles tagged with these many templates
and all link them to the same canonical WikiProject and eventually
higher-level topic, we built a near-complete list of WikiProjects based on
the WikiProject Directory [4] and gathered all of their associated
templates. We purposefully excluded WikiProjects under the assistance /
maintenance category [5]. When parsing talk pages from the dump files then,
we check for any of these templates and list them under their canonical
name. As a backup, we also employ case-insensitive string matching with
"WP" and "WikiProject", which helps to guarantee that we did not miss any
WikiProjects but introduces a number of false positives as well. If you
wish to map the WikiProjects listed in the dataset to their higher-level
topics, the mapping is in the figshare item and code that allows you to do
that can be found here:
https://github.com/wikimedia/drafttopic/blob/master/drafttopic/utilities/ta…
[1] https://en.wikipedia.org/wiki/Wikipedia:WikiProject_Council
[2] https://dl.acm.org/doi/10.1145/3274290
[3]
https://meta.wikimedia.org/wiki/Research:Newsletter/2019/September#Wikipedi…
[4] https://en.wikipedia.org/wiki/Wikipedia:WikiProject_Council/Directory
[5]
https://en.wikipedia.org/wiki/Wikipedia:WikiProject_Council/Directory/Wikip…
Best,
Isaac
--
Isaac Johnson (he/him/his) -- Research Scientist -- Wikimedia Foundation
That's fascinating, John; thank you. I'm copying this to wiki-research-l and
Fabian Suchanek, who gave the first part of the Research Showcase last month.
What do you like for coding stories? https://quanteda.io/reference/dfm.html ?
Sentiment is hard because errors are often 180 degrees away from correct.
How do you both feel about Soru et al (June 2018) "Neural Machine Translation
for Query Construction and Composition"
https://www.researchgate.net/publication/326030040 ?
On Sat, Jan 11, 2020 at 3:46 PM John Urbanik <johnurbanik(a)gmail.com> wrote:
>
> Jim,
>
> I used to work as the chief data scientist at Collin's company.
>
> I'd suggest looking at things like relationships between the views / edits for sets of pages as well as aggregating large sets of page views for different pages in various ways. There isn't a lot of literature that is directly applicable, and I can't disclose the precise methods being used due to NDA.
>
> In general, much of the pageview data is weibull or GEV distributed on top of being non-stationary, so I'd suggest looking into papers from extreme value theory literature as well as literature around Hawkes/Queue-Hawkes processes. Most traditional ML and signal processing is not very effective without doing some pretty substantial pre-processing, and even then things are pretty messy, depending on what you're trying to predict; most variables are heteroskedastic w.r.t pageviews and there are a lot of real world events that can cause false positives.
>
> Further, concept drift is pretty rapid in this space and structural breaks happen quite frequently, so the reliability of a given predictor can change extremely rapidly. Understanding how much training data to use for a given prediction problem is itself a super interesting problem since there may be some horizon after which the predictor loses power, but decreasing the horizon too much means over fitting and loss of statistical significance.
>
> Good luck!
>
> John