Hi,
I just enrolled this list, thanks to Dan Andreescu, who let me know
about it, and I have a question on processing clickstream data.
I downloaded a file for last month clickstream data
(https://dumps.wikimedia.org/other/clickstream/2022-12/clickstream-eswiki-20…)
and have problems to open it and processing it.
The only programme I could open it was OpenRefine. Other programmes
(Numbers and LibreOffice) just couldn't cope with it.
I can use OpenRefine to do some transformation and delete some rows I
don't need, but even then, with some 1.5milion rows, I can not open it
with numbers or libreoffice to do sum of the column 4.
Which tools do you use to work with such big files?
Thanks.
--
========================
Robert Garrigós i Castro
https://garrigos.cat
+34 620 91 87 01
Hi:
As it is the first time I'm working in Wikimedia analytics I found a case
that was weird to me. In some cases I can't get data from the API.
- en.wikivoyage.org
- Culturally significant landscapes in Jaén
- 2022121700
- API call: https://w.wiki/6DjC
I got the «The date(s) you used are valid, but we either do not have data
for those date(s)» message, which looks strange to me because the resource
exists as can be checked:
- 2022121600
- API call: https://w.wiki/6DjE
If there is no visit for 2022121700 I would have expected a correct
response with value=0.
Is this the expected behavior or I have found a glitch? I found a few other
cases, so I prefer to ask here.
Thanks.
--
Ismael Olea
http://olea.org/diario/
Hi:
Do we have tools, metrics or traces about the evolution of quality in
articles? Or something like that. Not sure if the ORES technology is
appropriate for it.
--
Ismael Olea
http://olea.org/diario/
Hello everyone,
The next Research Showcase, focused on Editor Retention, will be
live-streamed Wednesday, January 18. Find your local time here
<https://zonestamp.toolforge.org/1674063059>.
YouTube stream: https://www.youtube.com/watch?v=gS8ELcVZ8Q4
You can join the conversation on IRC at #wikimedia-research. You can also
watch our past research showcases here:
https://www.mediawiki.org/wiki/Wikimedia_Research/Showcase
This month's presentations:
Vital Signsː Measuring Wikipedia Communities’ HealthBy *Cristian Consonni,
Eurecat - Centre Tecnològic de Catalunya, Barcelona*Community health in
Wikipedia is a complex topic that has been at the center of discussion for
Wikipedia and the scientific community for years. Researchers observed that
the number of active editors for the largest Wikipedias started declining
after an initial phase of exponential growth. Some media outlets picked
this fact as a death announcement for the project, but the news of
Wikipedia's death turned out to be greatly exaggerated. However, it remains
true that researchers and community activists need to understand how to
measure community health and describe it more accurately. In this
presentation, we would like to go beyond the traditional metrics used to
describe the status of the community. We propose the creation of 6 sets of
language-independent indicators that we call "Vital Signs." We borrow the
analogy from the medical field, as these indicators represent a first step
in defining the health status of a community; they can constitute a
valuable reference point to foresee and prevent future risks. We present
our analysis for several Wikipedia language editions, showing that
communities renew their productive force even with stagnating absolute
numbers; we observe a general need for renewal in positions related to
particular functions or administratorship. We created a dashboard to
visualize all the indicators we have computed and hope that the communities
will find it helpful for improving their health.
- Paperː Community Vital Signs: Measuring Wikipedia Communities’
Sustainable Growth and Renewal
<https://meta.wikimedia.org/wiki/File:Community_Vital_Signs_Research_Paper_-…>
Learning to Predict the Departure Dynamics of Wikidata EditorsBy *Guangyuan
Piao, Maynooth University*Wikidata as one of the largest open collaborative
knowledge bases has drawn much attention from researchers and practitioners
since its launch in 2012. As it is collaboratively developed and maintained
by a community of a great number of volunteer editors, understanding and
predicting the departure dynamics of those editors are crucial but have not
been studied extensively in previous works. In this paper, we investigate
the synergistic effect of two different types of features: statistical and
pattern-based ones with DeepFM as our classification model which has not
been explored in a similar context and problem for predicting whether a
Wikidata editor will stay or leave the platform. Our experimental results
show that using the two sets of features with DeepFM provides the best
performance regarding AUROC (0.9561) and F1 score (0.8843), and achieves
substantial improvement compared to using either of the sets of features
and over a wide range of baselines.
- Paperː Learning to Predict the Departure Dynamics of Wikidata Editors
<https://parklize.github.io/publications/ISWC2021.pdf>
--
Emily Lescak (she / her)
Senior Research Community Officer
The Wikimedia Foundation
Hi:
I'm completely new to analytics in Wikimedia.
We are working with a heritage institution in a GLAM project and they are
interested in access statistics for the resources they have released in
Wikimedia. I think I got the point about how the pageviews concept is and
how to use it but, as far as I understand, it's not possible to get
details like article pageviews, for example, per country. Is this correct?
If so, what should be the way to get (or process) the information to
produce the data?
Also, I'm reading about the resulting format[1] but I can't find the
related logs.
Any suggestions? Thanks.
[1] https://meta.wikimedia.org/wiki/Research:Page_view#Resulting_format
--
Ismael Olea
http://olea.org/diario/
Hi all,
The next Research Showcase will be live-streamed next Wednesday, December
14. Find your local time here <https://zonestamp.toolforge.org/1671039024>.
The title of the Showcase is, 'A year in review from the WMF Research team:
Tying our work to the research community.'
The Wikimedia Research community is key to tackling the many strategic
challenges of the Wikimedia movement. As we are ending the year, the
Research team will reflect on why working with the community is important
to us. We will share the initiatives, tools, and resources developed
throughout 2022 to bring the community together, facilitate researchers’
contributions to the Wikimedia projects, and encourage a diversity of
research questions.
YouTube stream: https://www.youtube.com/watch?v=a0ss9ckUlvQ
You can join the conversation on IRC at #wikimedia-research. You can also
watch our past Showcases here:
https://www.mediawiki.org/wiki/Wikimedia_Research/Showcase
Warm regards,
Emily
--
Emily Lescak (she / her)
Senior Research Community Officer
The Wikimedia Foundation
Hello everyone,
The next Research Showcase will be live-streamed Wednesday, November 16, at
9:30 AM PST/16:30 UTC. Find your local time here
<https://zonestamp.toolforge.org/1668619830>.
YouTube stream: https://www.youtube.com/watch?v=sFanZoHjUnY
Members of the Research team will collect questions on IRC at
#wikimedia-research and YouTube.
This month's theme is 'Libraries and Wikipedia Knowledge.'
In the first talk, Laurie Bridges (Oregon State University) and Michael
David Miller (McGill University) will co-present on Wikipedia and Academic
Libraries.
Abstract: In 2021 an open-access edited book, Wikipedia and Academic
Libraries: A Global Project <https://doi.org/10.3998/mpub.11778416>, was
published, featuring 20 chapters from over 50 authors. In this
presentation, Laurie Bridges, one of the co-editors, will discuss the
process for creating and publishing an OA-edited book. Michael David
Miller, one of the chapter authors, will discuss his chapter about
contributions to local Québécois LGBTQ+ content in Francophone Wikipedia.
The second talk will be on Ethical Considerations of Including Gender
Information in Open Knowledge Platforms, presented by Nerissa Lindsey (San
Diego State University).
Abstract: In recent years, galleries, libraries, archives, and museums
(GLAMs) have sought to leverage open knowledge platforms such as Wikidata
to highlight or provide more visibility for traditionally marginalized
groups and their work, collections, or contributions. Efforts like Art +
Feminism, local edit-a-thons, and, more recently, GLAM institution-led
projects have promoted open knowledge initiatives to a broader audience of
participants. One such open knowledge project, the Program for Cooperative
Cataloging (PCC) Wikidata Pilot, has brought together over seventy GLAM
organizations to contribute linked open data for individuals associated
with their institutions, collections, or archives. However, these projects
have brought up ethical concerns around including potentially sensitive
personal demographic information, such as gender identity, sexual
orientation, race, and ethnicity, in entries in an open knowledge base
about living persons. GLAM institutions are thus in a position of balancing
open access with ethical cataloging, which should include adhering to the
personal preferences of the individuals whose data is being shared. People
working in libraries and archives have been increasingly focusing their
energies on issues of diversity, equity, and inclusion in their descriptive
practices, including remediating legacy data and addressing biased
language. Moving this work into a more public sphere and scaling up in
volume creates potential risks to the individuals being described. While
adding demographic information on living people to open knowledge bases has
the potential to enhance, highlight, and celebrate diversity, it could also
potentially be used to the detriment of the subjects through surveillance
and targeting activities. In our research we investigated the changing role
of metadata and open knowledge in addressing, or not addressing, issues of
under- and misrepresentation, especially as they pertain to gender identity
as described in the sex or gender property in Wikidata. We reported our
findings from a survey investigating how organizations participating in
open knowledge projects are addressing ethical concerns around including
personal demographic information as part of their projects, including what,
if any, policies they have implemented and what implications these
activities may have for the living people being described.
You can also watch our past research showcases here:
https://www.mediawiki.org/wiki/Wikimedia_Research/Showcase
We hope you can join us!
Warm regards,
Emily, on behalf of the WMF Research team
--
Emily Lescak (she / her)
Senior Research Community Officer
The Wikimedia Foundation
Hello,
For an academic research, I'd like to see which are the most viewed images through the "media viewer".
Do you know if it’s possible to get this information? I looked on the wikitech portal, but I found just the mediacounts (https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Traffic/Mediacounts) which is not what I’m looking for.
Thank you
Michele