Analytics

analytics@lists.wikimedia.org

1815 discussions

Scheduled downtime for Hadoop and analytics services tomorrow (2023/03/28) from 13:30 UTC

by Ben Tullis

Hello, Tomorrow the SRE team will be carrying out an upgrade of the switches in eqiad row B: (https://phabricator.wikimedia.org/T330165) at 14:00 UTC. The network outage to this row resulting from this work is expected to be around 30 minutes, all being well. In support of this work, the Data Engineering team will be putting HDFS file system into safe mode at approximately 13:30 UTC tomorrow, which means that write operations to the cluster will be refused. Jobs sent to the YARN cluster will also be refused from around the same time, so please try to plan any work that you may have for the cluster to avoid this maintenance window. Some additional internal-facing services for analytics such as Hive, Superset, Presto, and the Druid-analytics cluster will also be largely unavailable for some periods while the switch upgrade takes place. The public-facing Analytics Query Service (AQS) will continue to function, albebeit with a degraded response to some queries. However Wikistats (stats.wikimedia.org) will be unavailable whilst the switch upgrade is in progress. Finally, two of the stats servers, stat1007 and stat1009, will be unavailable, so please save any work that you may have on these servers before the loss of connectivity. Please do reach out via any of the normal channels (email: analytics(a)lists.wikimedia.org , IRC: #wikimedia-analytics , Slack #data-engineering ) if you have any queries or concerns. Kind regards, Ben -- *Ben Tullis*(he/him) Senior Site Reliability Engineer Wikimedia Foundation <https://wikimediafoundation.org/>

1 year, 1 month

Wikimania 2023 and importance of Data Analytics

by bustrias＠gmail.com

Hi Folks, We encourage each and everyone of you to create a program submission. You can submit an interactive workshop or panel, a lecture, a short lighting talk or a poster for our dedicated poster session. Submissions are catered to both onsite and online (live or pre-recorded) or a hybrid combination. We would love to see submissions from all over the world, and this year there is an 'Open Data' track for projects relating to Linked Open Data. The theme for this year's Wikimania is Diversity, Collaboration, Future. Topics that strengthen collaboration on Open Data including Data Analytics are topics we like to see this year. Session submissions for Wikimania 2023 are open until 28 March. Visit the following links for further info: Wiki page: https://wikimania.wikimedia.org/wiki/2023:Program/Submissions Diff post: https://diff.wikimedia.org/2023/02/28/be-part-of-the-wikimania-2023-program/ Program Submission Form: https://pretalx.com/wm2023/cfp Kind regards, Butch Bustria Chair, Program Subcommittee Event lead, ESEAP Wikimania 2023 Core Organizing Team

1 year, 1 month

Wikigrowth updated

by fdansv＠gmail.com

Hi friends, Just a quick note that the Wikigrowth site has been updated to include wiki page creation data from 2021 and 2022. https://francisco.dance/wikigrowth/ Sorry for the two year hiatus. Any suggestions to improve the tool and make it more useful (or even merge it with a current site) are always welcome. Much love, Fran Dans

1 year, 1 month

[Wikimedia Research Showcase] March 15

by Emily Lescak

Hi all, The next Research Showcase, focused on Gender and Equity on Wikipedia, will be live-streamed Wednesday, March 15, at 9:30 AM PST / 16:30 UTC. Find your local time here <https://zonestamp.toolforge.org/1678897840>. YouTube stream: https://www.youtube.com/watch?v=lw4MzJgDIzo You can join the conversation on IRC at #wikimedia-research. You can also watch our past research showcases here: https://www.mediawiki.org/wiki/Wikimedia_Research/Showcase This month's presentations: Men Are elected, women are marriedː events gender bias on Wikipedia By *Jiao Sun, University of Southern California*Human activities can be seen as sequences of events, which are crucial to understanding societies. Disproportional event distribution for different demographic groups can manifest and amplify social stereotypes, and potentially jeopardize the ability of members in some groups to pursue certain goals. In this paper, we present the first event-centric study of gender biases in a Wikipedia corpus. To facilitate the study, we curate a corpus of career and personal life descriptions with demographic information consisting of 7,854 fragments from 10,412 celebrities. Then we detect events with a state-of-the-art event detection model, calibrate the results using strategically generated templates, and extract events that have asymmetric associations with genders. Our study discovers that the Wikipedia pages tend to intermingle personal life events with professional events for females but not for males, which calls for the awareness of the Wikipedia community to formalize guidelines and train the editors to mind the implicit biases that contributors carry. Our work also lays the foundation for future works on quantifying and discovering event biases at the corpus level. - Paperː Sun, J. & Peng, N. (2021). Men Are Elected, Women Are Married: Events Gender Bias on Wikipedia. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Conference on Natural Language Processing, 350-360. <https://aclanthology.org/2021.acl-short.45.pdf> Twitter reacts to absence of women on Wikipediaː a mixed-methods analysis of #VisibleWikiWomen campaignBy *Sneh Gupta, Guru Gobind Singh Indraprastha University*Digital gender divide (DGD) is visible in access, participation, representation, and biases against women embedded in Wikipedia, the largest digital reservoir of co-created content. This article examined the content of #VisibleWikiWomen, a global digital advocacy campaign aimed at encouraging inclusion of women voices in the global technology conversation and improving digital sustainability of feminist data on Wikipedia. In a mixed-methods study, Sentiment Analysis followed by a Feminist Critical Discourse Analysis of the campaign tweets reveals how digital gender divide manifested in the public response. An overwhelming majority of tweets expressed positive sentiment towards the objective of the campaign. An inductive reading of the coded tweets (n = 1067) generated five themes: Feminist Activism, Invisibility & Marginalization of Women, Technology for Women Empowerment, Gendered Knowledge Inequity, and Power Dynamics in the Digital Sphere. Twitter discourse presented many agitated digital users calling out the epistemic injustice on Wikipedia that goes beyond the invisibility of women. Their tweets reveal that they want an equal social platform inclusive of women of color and varied identities currently absent in the Wikipedia universe. Extracting ideas, values, and themes from new media campaigns holds unparalleled potential in the diffusion of interventions and messages on a larger scale. - Paperː Gupta, S., & Trehan, K. (2022). Twitter reacts to absence of women on Wikipedia: a mixed-methods analysis of #VisibleWikiWomen campaign. Media Asia, 49(2), 130-154. <https://www.researchgate.net/publication/356909618_Twitter_reacts_to_absenc…> Warm regards, Emily -- Emily Lescak (she / her) Senior Research Community Officer The Wikimedia Foundation

1 year, 1 month

The Data Lake will be read-only today at approximately 13:50 UTC

by Ben Tullis

Hello and apologies for the short notice. We are required to put HDFS into safe mode at approximately 13:50 UTC today, which means that the file system will be read-only. This might be for as little as 30 minutes, but the maintenance window we're working within is for up to 2 hours, so the actual period of read-only access will depend on the outcome of the eqiad row A switches upgrade (https://phabricator.wikimedia.org/T329073) by the Infrastructure Foundations team. We will be pausing ingestion to the Data Lake a little ahead of this time, so there will be a delay in dataset availability on HDFS, Cassandra, and Druid etc. Apologies for any inconvenience that this disruption to service will cause you. Please do let us know by reply to this list or in #wikimedia-analytics on IRC if you have any queries, or would like to follow-along with our support of the maintenance work. Kind regards, Ben Tullis -- *Ben Tullis*(he/him) Senior Site Reliability Engineer Wikimedia Foundation <https://wikimediafoundation.org/>

1 year, 2 months

API Outages

by Joshua Haecker

Hi all, Just curious if there is a known cause for the multiple long delays we've had on the AQS API data being available this week? I know periodic delays are not uncommon but these seem beyond normal levels. Thanks! ~Josh

1 year, 2 months

Fwd: [Wiki-research-l] [events] Wiki Workshop 2023 Call for Papers

by Leila Zia

Hi all, Please see the call for papers for the 10th edition of Wiki Workshop below. The call is for extended abstracts (2 pages) of ongoing or completed work. The deadline is March 23. The submissions are non-archival which means you can submit work that is already published as well! :) Submit and join us in conversations about research on the Wikimedia projects. Best, Leila -- Leila Zia Head of Research Wikimedia Foundation ---------- Forwarded message --------- From: Martin Gerlach <mgerlach(a)wikimedia.org> Date: Mon, Feb 20, 2023 at 1:29 AM Subject: [Wiki-research-l] [events] Wiki Workshop 2023 Call for Papers To: <wiki-research-l(a)lists.wikimedia.org> Hi everyone, The call for papers for the 10th Wiki Workshop in 2023 is out: https://wikiworkshop.org/2023/#call Submit your 2-page abstracts by March 23 (all submissions are non-archival). The workshop will take place on May 11, 2023. For more information, see the workshop website [1]. If you have questions about the workshop, please let us know on this list or at wikiworkshop(a)googlegroups.com. Looking forward to seeing many of you in this year's edition. Best, Pablo Aragón, Wikimedia Foundation Martin Gerlach, Wikimedia Foundation Evelin Heidel, Wikimedistas de Uruguay Emily Lescak, Wikimedia Foundation Francesca Tripodi, University of North Carolina Bob West, EPFL Leila Zia, Wikimedia Foundation [1] https://wikiworkshop.org/2023/ — We invite contributions to the 10th edition (!) of Wiki Workshop, which will take place virtually on May 11, 2023 (tentatively 12:00-19:00 UTC). Wiki Workshop is the largest Wikimedia research event of the year, aimed at bringing together researchers who study all aspects of Wikimedia projects (including, but not limited to, Wikipedia, Wikidata, Wikimedia Commons, Wikisource, and Wiktionary) as well as Wikimedia developers, affiliate organizations, and volunteer editors. Co-organized by the Wikimedia Foundation’s Research team and members of the Wikimedia research community, the workshop facilitates a direct pathway for exchanging ideas between the organizations that serve Wikimedia projects and the researchers actively studying them. New this year: Building on the successful experiences of organizing Wiki Workshop in 2015 <https://wikiworkshop.org/2015/>, 2016 <https://wikiworkshop.org/2016/>, 2017 <https://wikiworkshop.org/2017/>, 2018 <https://wikiworkshop.org/2018/>, 2019 <https://wikiworkshop.org/2019/> , 2020 <https://wikiworkshop.org/2020/>, 2021 <https://wikiworkshop.org/2021/>, and 2022 <https://wikiworkshop.org/2022/> and based on feedback from authors and participants over the years, we are introducing a few updates to the research track of the workshop for 2023: - This 10th edition will take place as a standalone event (rather than in co-location with a conference, as in previous years). - We have changed the format of submissions and will only accept 2-page extended abstracts (following the successful IC2S2 model). - Submissions are non-archival, so we welcome ongoing, completed, and already published work. - We are excited to share that the authors of Wiki Workshop 2023 will have the opportunity to receive feedback, improve their work, and submit the extended version of their research paper to a special issue of the ACM Transactions on the Web, which will have a dedicated open call for papers later in 2023. Topics include, but are not limited to: - new technologies and initiatives to grow content, quality, equity, diversity, and participation across Wikimedia projects - use of bots, algorithms, and crowdsourcing strategies to curate, source, or verify content and structured data - bias in content and gaps of knowledge on Wikimedia projects - relation between Wikimedia projects and the broader (open) knowledge ecosystem - exploration of what constitutes a source and how/if the incorporation of other kinds of sources are possible (e.g., oral histories, video) - detection of low-quality, promotional, or fake content (misinformation or disinformation), as well as fake accounts (e.g., sock puppets) - questions related to community health (e.g., sentiment analysis, harassment detection, tools that could increase harmony) - motivations, engagement models, incentives, and needs of editors, readers, and/or developers of Wikimedia projects - innovative uses of Wikipedia and other Wikimedia projects for AI and NLP applications and vice versa - consensus-finding and conflict resolution on editorial issues - dynamics of content reuse across projects and the impact of policies and community norms on reuse privacy, security, and trust - collaborative content creation - innovative uses of Wikimedia projects' content and consumption patterns as sensors for real-world events, culture, etc. - open-source research code, datasets, and tools to support research on Wikimedia contents and communities - connections between Wikimedia projects and the Semantic Web - strategies for how to incorporate Wikimedia projects into media literacy interventions This year’s Wiki Workshop solicits extended abstracts (PDF format, maximum 2 pages, including references). Submissions that exceed the 2-page limit will be automatically rejected. Authors may include 1 additional page with figures and/or tables (including captions) only. Initial submissions require names and affiliations of authors, 5 keywords, a title, abstract, and a main text outlining the contribution, methods, findings, and impact of the work, whichever is relevant. Submissions will be non-archival and as a result may have already been published, under review, or ongoing research. All submissions will be reviewed by multiple members of the Wiki Workshop Program Committee. The names of the authors will be revealed to the reviewers, whereas reviewers will remain anonymous to authors. Authors of accepted abstracts will be invited to present their research in a pre-recorded oral presentation with dedicated time for live Q&A on May 11, 2023. Accepted abstracts may be shared on the website prior to the event. The template for formatting the submission as well as the submission link to easychair will be made available by February 23. -- Martin Gerlach (he/him) | Senior Research Scientist | Wikimedia Foundation _______________________________________________ Wiki-research-l mailing list -- wiki-research-l(a)lists.wikimedia.org To unsubscribe send an email to wiki-research-l-leave(a)lists.wikimedia.org

1 year, 2 months

[Wikimedia Research Showcase] February 15 at 9:30AM PT, 17:30 UTC

by Emily Lescak

Hello everyone, The next Research Showcase will be livestreamed next Wednesday, February 15 at 9:30AM PT / 17:30 UTC. The theme is The Free Knowledge Ecosystem. YouTube stream: https://www.youtube.com/watch?v=8VJmR-3lTac We welcome you to join the conversation on IRC at #wikimedia-research. You can also watch our past research showcases: https://www.mediawiki.org/wiki/Wikimedia_Research/Showcase This month's presentations: The evolution of humanitarian mapping in OpenStreetMap (OSM) and how it affects map completeness and inequalities in OSMBy *Benjamin Herfort, Heidelberg Institute for Geoinformation Technology*Mapping efforts of communities in OpenStreetMap (OSM) over the previous decade have created a unique global geographic database, which is accessible to all with no licensing costs. The collaborative maps of OSM have been used to support humanitarian efforts around the world as well as to fill important data gaps for implementing major development frameworks such as the Sustainable Development Goals (SDGs). Besides the well-examined Global North - Global South bias in OSM, the OSM data as of 2023 shows a much more spatially diverse spread pattern than previously considered, which was shaped by regional, socio-economic and demographic factors across several scales. Humanitarian mapping efforts of the previous decade have already made OSM more inclusive, contributing to diversify and expand the spatial footprint of the areas mapped. However, methods to quantify and account for the remaining biases in OSM’s coverage are needed so that researchers and practitioners will be able to draw the right conclusions, e .g. about progress towards the SDGs in cities. Dataset reuseː Toward translating principles to practiceBy *Laura Koesten, University of Vienna*The web provides access to millions of datasets. These data can have additional impact when used beyond the context for which they were originally created. But using a dataset beyond the context in which it originated remains challenging. Simply making data available does not mean it will be or can be easily used by others. At the same time, we have little empirical insight into what makes a dataset reusable and which of the existing guidelines and frameworks have an impact.In this talk, I will discuss our research on what makes data reusable in practice. This is informed by a synthesis of literature on the topic, our studies on how people evaluate and make sense of data, and a case study on datasets on GitHub. In the case study, we describe a corpus of more than 1.4 million data files from over 65,000 repositories. Building on reuse features from the literature, we use GitHub’s engagement metrics as proxies for dataset reuse and devise an initial model, using deep neural networks, to predict a dataset’s reusability. This demonstrates the practical gap between principles and actionable insights that might allow data publishers and tool designers to implement functionalities that facilitate reuse. We hope you can join us! Warm regards, Emily -- Emily Lescak (she / her) Senior Research Community Officer The Wikimedia Foundation

1 year, 2 months

Re: energy used to store

by Andrew Otto

Hi Willy, (Forwarding your question to the public analytics list for others who might know more.) > Do you have any data that shows how many times audio files were downloaded in 2022? I think your best bet is the Mediacounts dataset <https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Traffic/Mediacounts>, which is available in a public API <https://wikitech.wikimedia.org/wiki/Analytics/AQS/Mediarequests>. E.g., to get # requested of audio downloads in 2022: https://wikimedia.org/api/rest_v1/metrics/mediarequests/aggregate/all-refer… However, it doesn't look like data transfer details are available in the Public API. The backing dataset in Hive does have a total_response_size field so you could probably get this info more specifically by querying for it in Hive. Good luck! On Wed, Feb 1, 2023 at 7:11 PM Willy Pao <wpao(a)wikimedia.org> wrote: > Hey Andrew - hope all is going well. I've been working on gathering some > data for Wikimedia's Annual Sustainability Report, and there was a question > that Deb sent over regarding the usage of Audio files. With Jaime's help > from Data Persistence SRE, we were able to figure out some of the numbers > around storage and energy consumption. There was one part I was hoping you > (or someone from your team) might be able to help with though. Do you have > any data that shows how many times audio files were downloaded in 2022? > Much appreciated in advance. > > Thanks, > Willy > > ---------- Forwarded message --------- > From: Deb Tankersley <dtankersley(a)wikimedia.org> > Date: Mon, Jan 30, 2023 at 1:41 PM > Subject: energy used to store > To: Willy Pao <wpao(a)wikimedia.org>, Erin Morris <emorris(a)wikimedia.org>, > Cassie Casares <ccasares(a)wikimedia.org> > > > Hey Willy! > > I got an interesting question (bolded below) from Wikimedia Sweden on the > energy that we use to store and serve audio files. Here's their full > comment / question: > > *"As part of my yearly planning for 2023, we are conducting a study >> regarding digitization of audio tapes, which climate footprints the various >> stages in the process generate and whether some of these can be made more >> energy efficient. We have limited the study to audio tapes, because it is a >> prioritized material category and a very data-intensive business, and >> because the limitation hopefully gives us relatively accurate numbers. >> Since we have been publishing digital audio originally from audio tapes on >> Wikimedia Commons for the past few years, I was wondering if there are any >> statistics related to energy consumption and carbon dioxide emissions >> available?* >> >> >> *What we would like to know is how much energy is required in the year >> 2022 to store our total amount of uploaded audio files (with the exception >> of Karl Tirén's phonograph recordings), how many times they have been >> downloaded and how large a total amount of data is involved. We suspect >> that downloading the high-resolution audio files is also relatively data >> intensive. As mentioned, the goal is not to stop this activity, or even >> reduce it without seeing how it looks and then investigating whether there >> are any links in the chain that can be tweaked to possibly reduce the >> climate impact. If numbers cannot be obtained, this is also valuable >> information."* >> > > > I'm not sure if we can narrow down this enough to get them a decent / > solid answer. What are your thoughts? > > > Thanks, > > > Deb > > -- > > deb tankersley (she/her) > > senior program manager, engineering > > Wikimedia Foundation > > > > >

1 year, 3 months

best programme ot work with data

by Robert Garrigos

Hi, I just enrolled this list, thanks to Dan Andreescu, who let me know about it, and I have a question on processing clickstream data. I downloaded a file for last month clickstream data (https://dumps.wikimedia.org/other/clickstream/2022-12/clickstream-eswiki-20…) and have problems to open it and processing it. The only programme I could open it was OpenRefine. Other programmes (Numbers and LibreOffice) just couldn't cope with it. I can use OpenRefine to do some transformation and delete some rows I don't need, but even then, with some 1.5milion rows, I can not open it with numbers or libreoffice to do sum of the column 4. Which tools do you use to work with such big files? Thanks. -- ======================== Robert Garrigós i Castro https://garrigos.cat +34 620 91 87 01

1 year, 3 months

← Newer
1
2
3
4
5
6
7
8
...
182
Older →

Jump to page:

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

Analytics