Wiki-research-l

wiki-research-l@lists.wikimedia.org

1 participants
3027 discussions

Fwd: [Wikidata] New Wikimedia dataset for NLP research
by Pine W 03 Feb '20

03 Feb '20

Forwarding. Pine ( https://meta.wikimedia.org/wiki/User:Pine ) ---------- Forwarded message --------- From: Gabriel Altay <gabriel.altay(a)gmail.com> Date: Mon, Feb 3, 2020 at 6:57 PM Subject: [Wikidata] New Wikimedia dataset for NLP research To: <wikidata(a)lists.wikimedia.org> Hello Wikidata folks, I would like to bring your attention to an open source dataset I've been developing called the Kensho Derived Wikimedia Dataset (KDWD). It's a cleaned English subset of Wikipedia/Wikidata with 2.3B tokens, 5.3M pages, 51M nodes, and 120M edges. More details are available here https://blog.kensho.com/announcing-the-kensho-derived-wikimedia-dataset-5d1… best, -Gabriel _______________________________________________ Wikidata mailing list Wikidata(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

1 0

Fwd: [Wikidata] Knowledge Graph Conference 2020 - Workshops and Tutorials Announcement
by Pine W 03 Feb '20

03 Feb '20

Forwarding. Pine ( https://meta.wikimedia.org/wiki/User:Pine ) ---------- Forwarded message --------- From: Violeta Ilik <ilik.violeta(a)gmail.com> Date: Mon, Feb 3, 2020 at 1:01 PM Subject: [Wikidata] Knowledge Graph Conference 2020 - Workshops and Tutorials Announcement To: Discussion list for the Wikidata project. <wikidata(a)lists.wikimedia.org> Dear Wikidata community, The Knowledge Graph Conference organizing team is pleased to announce the workshops and tutorials part of the KGC 2020 Program. They are taking place on May 4 and 5 in Butler Library, Columbia University Libraries in NYC. Workshops are stand-alone sub events of the conference. They have separate calls for papers and their own program and organizing committee. Tutorials are learning sessions including both lecture style and hands-on sessions. Each tutorial will be for half a day unless specified. For more information about each workshop and tutorial please visit this page: https://www.knowledgegraph.tech/the-knowledge-graph-conference-kgc/workshop… Early Bird registration ends on February 15, 2020. To register please visit this page: https://www.knowledgegraph.tech/the-knowledge-graph-conference-kgc/register/ WORKSHOPS KGC Workshop on Applied Knowledge Graph: Best industry/academic practices, methods and challenges between representation and reasoning Organizers: Vivek Khetan, AI research specialist, Accenture Labs, SF Colin Puri, R&D Principal - Accenture Labs Lambert Hogenhout, Chief Analytics, Partnerships and Innovation, United Nations Limit: 40 people Date: May 4, 2020 Place: Room 203, Butler Library, Columbia University Personal Health Knowledge Graphs (PHKG): Challenges and Opportunities Organizers: Ching-Hua Chen, PhD, Amar Das, MD PhD, Ying Ding, PhD, Deborah McGuinness, PhD, Oshani Seneviratne, PhD, and Mohammed J Zaki, PhD Limit: 40 people Date: May 5, 2020 Place: Room 203, Butler Library, Columbia University TUTORIALS Virtualized Knowledge Graphs for Enterprise Applications Presenter: Eric Little, PhD – CEO LeapAnalysis Limit: 20 people Date and time: May 4, 2020 8:30AM - 12:30PM Place: Studio Butler, Butler Library, Columbia University Data discovery on a (free) hybrid BI/Search/Knowledge graph platform: the Siren Community Edition hands on tutorial Presenter: Giovanni Tummarello, Ph.D Limit: 20 people Date and time: May 4, 2020 8:30AM - 12:30PM Place: Room 523 Butler Library, Columbia University Building a Knowledge Graph from schema.org annotations Presenters: Elias Kärle, Umutcan Simsek, and Dieter Fensel (STI Innsbruck, University of Innsbruck) Limit: 25 people Date and time: May 4, 2020 1:30PM - 5:30PM Place: Room 523 Butler Library, Columbia University Designing and Building Enterprise Knowledge Graphs from Relational Databases Presenter: Juan Sequeda, DataWorld Limit: 25 people Date and time: May 5, 2020 8:30AM - 12:30PM Place: Room 523 Butler Library, Columbia University Rapid Knowledge Graph development with GraphQL and RDF databases Presenters: Vassil Momtchev, Ontotext Limit: 25 people Date and time: May 5, 2020 1:30PM - 5:30PM Place: Room 523 Butler Library, Columbia University Introduction to Logic Knowledge Graphs, Succinct Data Structures and Delta Encoding for Modern Databases, and the Web Object Query Language Presenter: Dr. Gavin Mendel-Gleason and Cheukting Ho (DataChemist) Limit: 20 people Date and time: May 5, 2020 8:30AM - 12:30PM Place: Room 306 Butler Library, Columbia University Modeling Evolving Data in Graphs While Preserving Backward Compatibility: The Power of RDF Quads Presenter: Souripriya Das, Matthew Perry, and Eugene I. Chong (Oracle) Limit: 20 people Date and time: May 5, 2020 1:30PM - 5:30PM Place: Room 306 Butler Library, Columbia University Violeta Ilik KGC 2020 Workshops & Tutorials Chair -- Violeta Ilik _______________________________________________ Wikidata mailing list Wikidata(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

1 0

Re: [Wiki-research-l] Research and Wikipedia/Wikimedia event in London, 08. February
by Travis Smith 27 Jan '20

27 Jan '20

Is there a way to attend remotely? On Sun, Jan 26, 2020 at 8:46 AM Lucie Kaffee <lucie.kaffee(a)gmail.com> wrote: > Hello everyone! > > We are organizing an event for Research and Wikimedia to exchange about > research done in the field and for researchers and Wikimedia community > members to work together on new ideas. The idea derived from the fact that > many researchers reusing Wikipedia, Wikidata and their sister projects > often are not yet integrated with the community. This makes their work a > lot more difficult than necessary. At the same time, many research projects > are useful for the community, but not yet integrated into Wikipedia and co. > We want to change this and facilitate the exchange between researchers and > Wikimedia community members in an event, where we bring people interested > in similar topics together. If you are either doing research in the > Wikipedia space or are a community member of one of the Wikimedia projects, > please come by on the 8th of February. More details below. > Please spread the event invitation in your communities! > > Best, > Lucie > > https://www.eventbrite.com/e/research-and-wikimedia-tickets-90824421289 > > *Description* > We are organizing an event for Wikimedians and researchers to exchange! > Come along and learn more about research happening around Wikimedia and > what Wikimedians can teach you about the different Wikimedia projects! > A large part of the computer science research community is exploring > Wikipedia, Wikidata and their sister projects. In the fields of natural > language processing (NLP) as well as semantic web, Wikipedia and Wikidata > are often used as a fundamental part of the research world. At the same > time, the community of Wikidata and Wikipedia could make use of a variety > of tools developed by researchers. However, currently, the gap between > things explored in research and actual applications in Wikidata and > Wikipedia needs bridging. Therefore, we want to build a community of > Wikidata community members and research to exchange needs, existing tools, > open challenges and research question to foster an environment, where both > communities can benefit from the exchange. > The ideal is to have all the different approaches and commonalities under > one umbrella to foster exchange and support of different research > communities and their approaches. > OpenSym and the WikiWorkshop are already doing that for the people > submitting to and attending computer science research conferences. But > without the exchange with the community, there is a lack of communication, > creating silos of missing exchange. > > *The Goal is* > to connect the researcher and the Wikimedia community to enable an exchange > that could ultimately lead to the research projects being implemented as > tools for Wikipedia. And vice-versa: More research projects build on > community needs. > > *We invite* > *Researchers* > Anyone who does or is planning to do research on or around Wikimedia > projects, such as Wikipedia, Wikidata and others. > *Wikimedians* > Anyone in the community, who is interested in improving the research > happening around Wikimedia - you don’t need any experience in research. > Wikipedia editor, Wikidata data magician, whatever you do in Wikimedia > projects, your feedback will be highly valuable. > > *What we need from you* > We would ask all researchers to bring an A2/A3 poster about what they are > doing in Wikimedia that we can put up so that we can create an easy way to > exchange on different projects. If you don’t have a project yet, don’t > worry- just bring a poster with topics you find interesting, and you might > be able to meet other researchers already working in your field of > interest. (If you struggle with printing the poster beforehand, please > reach out to us a few days in advance.) > > *Event* > We will spend a day exchanging on recent challenges around Wikimedia. > Besides the posters, we aim to form working groups for the afternoon to > work on topics of shared interest and possibly propose a project of common > interest. > > -- > Lucie-Aimée Kaffee > _______________________________________________ > Wiki-research-l mailing list > Wiki-research-l(a)lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/wiki-research-l > Travis

2 1

Research and Wikipedia/Wikimedia event in London, 08. February
by Lucie Kaffee 26 Jan '20

26 Jan '20

Hello everyone! We are organizing an event for Research and Wikimedia to exchange about research done in the field and for researchers and Wikimedia community members to work together on new ideas. The idea derived from the fact that many researchers reusing Wikipedia, Wikidata and their sister projects often are not yet integrated with the community. This makes their work a lot more difficult than necessary. At the same time, many research projects are useful for the community, but not yet integrated into Wikipedia and co. We want to change this and facilitate the exchange between researchers and Wikimedia community members in an event, where we bring people interested in similar topics together. If you are either doing research in the Wikipedia space or are a community member of one of the Wikimedia projects, please come by on the 8th of February. More details below. Please spread the event invitation in your communities! Best, Lucie https://www.eventbrite.com/e/research-and-wikimedia-tickets-90824421289 *Description* We are organizing an event for Wikimedians and researchers to exchange! Come along and learn more about research happening around Wikimedia and what Wikimedians can teach you about the different Wikimedia projects! A large part of the computer science research community is exploring Wikipedia, Wikidata and their sister projects. In the fields of natural language processing (NLP) as well as semantic web, Wikipedia and Wikidata are often used as a fundamental part of the research world. At the same time, the community of Wikidata and Wikipedia could make use of a variety of tools developed by researchers. However, currently, the gap between things explored in research and actual applications in Wikidata and Wikipedia needs bridging. Therefore, we want to build a community of Wikidata community members and research to exchange needs, existing tools, open challenges and research question to foster an environment, where both communities can benefit from the exchange. The ideal is to have all the different approaches and commonalities under one umbrella to foster exchange and support of different research communities and their approaches. OpenSym and the WikiWorkshop are already doing that for the people submitting to and attending computer science research conferences. But without the exchange with the community, there is a lack of communication, creating silos of missing exchange. *The Goal is* to connect the researcher and the Wikimedia community to enable an exchange that could ultimately lead to the research projects being implemented as tools for Wikipedia. And vice-versa: More research projects build on community needs. *We invite* *Researchers* Anyone who does or is planning to do research on or around Wikimedia projects, such as Wikipedia, Wikidata and others. *Wikimedians* Anyone in the community, who is interested in improving the research happening around Wikimedia - you don’t need any experience in research. Wikipedia editor, Wikidata data magician, whatever you do in Wikimedia projects, your feedback will be highly valuable. *What we need from you* We would ask all researchers to bring an A2/A3 poster about what they are doing in Wikimedia that we can put up so that we can create an easy way to exchange on different projects. If you don’t have a project yet, don’t worry- just bring a poster with topics you find interesting, and you might be able to meet other researchers already working in your field of interest. (If you struggle with printing the poster beforehand, please reach out to us a few days in advance.) *Event* We will spend a day exchanging on recent challenges around Wikimedia. Besides the posters, we aim to form working groups for the afternoon to work on topics of shared interest and possibly propose a project of common interest. -- Lucie-Aimée Kaffee

2 1

Fwd: [Wikitech-l] PyCon Financial Assistance and Development Sprints Info
by Pine W 25 Jan '20

25 Jan '20

Forwarding. Pine ( https://meta.wikimedia.org/wiki/User:Pine ) ---------- Forwarded message --------- From: Brooke Storm <bstorm(a)wikimedia.org> Date: Fri, Jan 24, 2020 at 1:03 AM Subject: [Wikitech-l] PyCon Financial Assistance and Development Sprints Info To: <wikitech-l(a)lists.wikimedia.org>, Foundation Optional < foundation-optional(a)wikimedia.org>, <tech-all(a)wikimedia.org> Hello Folks! For the Python enthusiasts on these lists, I’m signal boosting this message with info on PyCon dev sprints and financial assistance for the conference from a former Wikimedia colleague. I plan to attend PyCon this year and am also hoping to figure out setting up a development sprint around some Wikimedia Cloud Services and Toolforge code. Brooke Storm SRE Wikimedia Cloud Services bstorm(a)wikimedia.org <mailto:bstorm@wikimedia.org> IRC: bstorm_ --------- Forwarded message ———— <snip> I wanted to mention - feel free to pass this on publicly and in personal invitations - that PyCon North America, mid-April in Pennsylvania, offers financial assistance to people who would like to attend: https://us.pycon.org/2020/financial-assistance/ < https://us.pycon.org/2020/financial-assistance/> The deadline for requesting financial assistance is 31 January. PyCon loves to cross-pollinate with other free and open source movements, and I know there are many Python developers in Wikimedia tech. If Wikimedians want to use the April 20-23 in-person sprints https://us.pycon.org/2020/events/sprints/ < https://us.pycon.org/2020/events/sprints/> (will be editable soon) to work on Wikimedia-related Python tools together, that would be cool! Best wishes. -- Sumana Harihareswara Changeset Consulting https://changeset.nyc <https://changeset.nyc/> _______________________________________________ Wikitech-l mailing list Wikitech-l(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

1 0

Power law and contributions:
by Jan Dittrich 24 Jan '20

24 Jan '20

Hello Researchers, Contribution patterns in online communities follow a power distribution which is known as the 1% rule [1], as Wikipedia told me. However, the steepness of the distribution can be more or less strong: 50% of your edits could be contributed by 2% or by 0.002%, the latter showing a stronger imbalance. I wonder if there are any estimates/rules-of-thumb of what imbalance is problematic when seen from the perspective of community health. I also wonder if there is research on how technology contributes to such imbalances and how it might be mitigated – e.g training, user-friendliness, documentation… (based on my assumption that a steep curve is less desirable, since the power is more concentrated, the system more fragile and the redistribution of power more constrained) Jan [1] https://en.wikipedia.org/wiki/1%25_rule_(Internet_culture) -- Jan Dittrich UX Design/ Research Wikimedia Deutschland e. V. | Tempelhofer Ufer 23-24 | 10963 Berlin Tel. (030) 219 158 26-0 https://wikimedia.de Unsere Vision ist eine Welt, in der alle Menschen am Wissen der Menschheit teilhaben, es nutzen und mehren können. Helfen Sie uns dabei! https://spenden.wikimedia.de Wikimedia Deutschland — Gesellschaft zur Förderung Freien Wissens e. V. Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg unter der Nummer 23855 B. Als gemeinnützig anerkannt durch das Finanzamt für Körperschaften I Berlin, Steuernummer 27/029/42207.

5 7

New Office hours for WMF/Research&Analytics starting in January 2020
by Martin Gerlach 22 Jan '20

22 Jan '20

Hi all, We, the Research team at Wikimedia Foundation, have received some requests over the past months for making ourselves more available to answer some of the research questions that you as Wikimedia volunteers, affiliates' staff, and researchers face in your projects and initiatives. Starting January 2020, we will experiment with monthly office hours organized jointly by our team and the Analytics team where you can join us and direct your questions to us. We will revisit this experiment in June 2020 to assess whether to continue it or not. The scope We encourage you to attend the office hour if you have research related questions. These can be questions about our teams, our projects, or more importantly questions about your projects or ideas that we can support you with during the office hours. You can also ask us questions about how to use a specific dataset available to you, to answer a question you have, or some other question. Note that the purpose of the office hours is to answer your questions during the dedicated time of the office hour. Questions that may require many hours of back-and-forth between our team and you are not suited for this forum. For these bigger questions, however, we are happy to brainstorm with you in the office hour and point you to some good directions to explore further on your own (and maybe come back in the next office hour and ask more questions). Time and Location We meet on the 4th Wednesday of every month 17.00-18.00 (UTC) in #wikimedia-research IRC channel on freenode [1]. The first meeting will be on January 22. Up-to-date information on mediawiki [2] Archiving If you miss the office hour, you can read the logs of it at [3]. The future announcements about these office hours will only go to the following lists so please make sure you're subscribed to them if you like to receive a ping: * wiki-research-l mailing list [4] * analytics mailing list [5] * wikidata mailing list [6] * the Research category in Space [7] on behalf of Research and Analytics, Martin [1] irc://irc.freenode.net/wikimedia-research [2] https://www.mediawiki.org/wiki/Wikimedia_Research/Office_hours [3] https://wm-bot.wmflabs.org/logs/%23wikimedia-research/ [4] https://lists.wikimedia.org/mailman/listinfo/wiki-research-l [5] https://lists.wikimedia.org/mailman/listinfo/analytics [6] https://lists.wikimedia.org/mailman/listinfo/wikidata [7] https://discuss-space.wmflabs.org/tags/research -- Martin Gerlach Research Scientist Wikimedia Foundation

2 2

Upcoming Research Newsletter: New Papers Open For Review
by Mohammed Sadat Abdulai 20 Jan '20

20 Jan '20

Hi everyone, We’re preparing for the January 2020 research newsletter and looking for contributors. Please take a look at https://etherpad.wikimedia.org/p/WRN202001 and add your name next to any paper you are interested in covering. Our writing deadline is 25 January 23:59 UTC. If you can't make this deadline but would like to cover a particular paper in the subsequent issue, leave a note next to the paper's entry below. As usual, short notes and one-paragraph reviews are most welcome. Highlights from this month: - ‘WP2Cochrane’, a tool linking Wikipedia to the Cochrane Library: Results of a bibliometric analysis evaluating article quality and importance - Building Knowledge Graphs: Processing Infrastructure and Named Entity Linking - Individual and collaborative information behaviour of Wikipedians in the context of their involvement with Hebrew Wikipedia - Keeping Community in the Loop: Understanding Wikipedia Stakeholder Values for Machine Learning-Based Systems - Knowledge curation work in Wikidata WikiProject discussions - Knowledge curation work in Wikidata WikiProject discussions - Strangers in a seemingly open-to-all website: the gender bias in Wikipedia - Understanding Wikipedia as a Resource for Opportunistic Learning of Computing Concepts Masssly and Tilman Bayer [1] http://meta.wikimedia.org/wiki/Research:Newsletter[2] WikiResearch (@WikiResearch) | Twitter

1 0

New dataset of articles tagged by WikiProjects
by Isaac Johnson 16 Jan '20

16 Jan '20

Hey Research Community, TL;DR New dataset: https://figshare.com/articles/Wikipedia_Articles_and_Associated_WikiProject… More details: I wanted to notify everyone that we have published a dataset of the articles on English Wikipedia that have been tagged by WikiProjects [1] through templates on their associated talk pages. We are not planning to make this an ongoing release, but I have provided the script that I used to generate it in the Figshare item so that others might update / adjust to meet their needs. As anyone who has done research on WikiProjects knows, it can be complicated to determine what articles fit under a particular WikiProject's purview. The motivation for generating this dataset was to support our work in developing topic models for Wikipedia (see [2] for an overview), but we imagine that there are many other ways in which this dataset might be useful: * Previous work has examined how active WikiProjects are based on edits to their pages in the Wikipedia namespace. This dataset makes it much easier to identify which Wikiprojects are managing the most valuable articles on Wikipedia (in terms of quality or pageviews). * Many topic-level analyses of Wikipedia rely on the category network. Categories can be very messy and difficult to work with, but WikiProjects represent an alternative that often is simpler and still quite rich. For instance, this could be used for temporal analyses of article quality, demand, or distribution by topic. * While WikiProjects are English-only and therefore limited in their utility to other languages, we also provide the Wikidata ID and sitelinks -- i.e. titles for corresponding articles in other languages -- to allow for multilingual analyses. This could be used to compare gaps in coverage -- e.g., akin to past work that has used categories [3]. The main challenge, besides processing time, is how to 1) effectively extract the WikiProject templates from talk pages, and, 2) consistently link them to a canonical WikiProject name and topic. For example, the canonical template for WikiProject Medicine is https://en.wikipedia.org/wiki/Template:WikiProject_Medicine but another one used is https://en.wikipedia.org/w/index.php?title=Template:WPMED&redirect=no (and there are 13 more). To capture articles tagged with these many templates and all link them to the same canonical WikiProject and eventually higher-level topic, we built a near-complete list of WikiProjects based on the WikiProject Directory [4] and gathered all of their associated templates. We purposefully excluded WikiProjects under the assistance / maintenance category [5]. When parsing talk pages from the dump files then, we check for any of these templates and list them under their canonical name. As a backup, we also employ case-insensitive string matching with "WP" and "WikiProject", which helps to guarantee that we did not miss any WikiProjects but introduces a number of false positives as well. If you wish to map the WikiProjects listed in the dataset to their higher-level topics, the mapping is in the figshare item and code that allows you to do that can be found here: https://github.com/wikimedia/drafttopic/blob/master/drafttopic/utilities/ta… [1] https://en.wikipedia.org/wiki/Wikipedia:WikiProject_Council [2] https://dl.acm.org/doi/10.1145/3274290 [3] https://meta.wikimedia.org/wiki/Research:Newsletter/2019/September#Wikipedi… [4] https://en.wikipedia.org/wiki/Wikipedia:WikiProject_Council/Directory [5] https://en.wikipedia.org/wiki/Wikipedia:WikiProject_Council/Directory/Wikip… Best, Isaac -- Isaac Johnson (he/him/his) -- Research Scientist -- Wikimedia Foundation

3 3

Re: [Wiki-research-l] Availability of hourly pagecounts files
by James Salsman 12 Jan '20

12 Jan '20

That's fascinating, John; thank you. I'm copying this to wiki-research-l and Fabian Suchanek, who gave the first part of the Research Showcase last month. What do you like for coding stories? https://quanteda.io/reference/dfm.html ? Sentiment is hard because errors are often 180 degrees away from correct. How do you both feel about Soru et al (June 2018) "Neural Machine Translation for Query Construction and Composition" https://www.researchgate.net/publication/326030040 ? On Sat, Jan 11, 2020 at 3:46 PM John Urbanik <johnurbanik(a)gmail.com> wrote: > > Jim, > > I used to work as the chief data scientist at Collin's company. > > I'd suggest looking at things like relationships between the views / edits for sets of pages as well as aggregating large sets of page views for different pages in various ways. There isn't a lot of literature that is directly applicable, and I can't disclose the precise methods being used due to NDA. > > In general, much of the pageview data is weibull or GEV distributed on top of being non-stationary, so I'd suggest looking into papers from extreme value theory literature as well as literature around Hawkes/Queue-Hawkes processes. Most traditional ML and signal processing is not very effective without doing some pretty substantial pre-processing, and even then things are pretty messy, depending on what you're trying to predict; most variables are heteroskedastic w.r.t pageviews and there are a lot of real world events that can cause false positives. > > Further, concept drift is pretty rapid in this space and structural breaks happen quite frequently, so the reliability of a given predictor can change extremely rapidly. Understanding how much training data to use for a given prediction problem is itself a super interesting problem since there may be some horizon after which the predictor loses power, but decreasing the horizon too much means over fitting and loss of statistical significance. > > Good luck! > > John

1 0

Jump to page:

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

Wiki-research-l