Hi,
I just enrolled this list, thanks to Dan Andreescu, who let me know
about it, and I have a question on processing clickstream data.
I downloaded a file for last month clickstream data
(https://dumps.wikimedia.org/other/clickstream/2022-12/clickstream-eswiki-20…)
and have problems to open it and processing it.
The only programme I could open it was OpenRefine. Other programmes
(Numbers and LibreOffice) just couldn't cope with it.
I can use OpenRefine to do some transformation and delete some rows I
don't need, but even then, with some 1.5milion rows, I can not open it
with numbers or libreoffice to do sum of the column 4.
Which tools do you use to work with such big files?
Thanks.
--
========================
Robert Garrigós i Castro
https://garrigos.cat
+34 620 91 87 01
Hi:
As it is the first time I'm working in Wikimedia analytics I found a case
that was weird to me. In some cases I can't get data from the API.
- en.wikivoyage.org
- Culturally significant landscapes in Jaén
- 2022121700
- API call: https://w.wiki/6DjC
I got the «The date(s) you used are valid, but we either do not have data
for those date(s)» message, which looks strange to me because the resource
exists as can be checked:
- 2022121600
- API call: https://w.wiki/6DjE
If there is no visit for 2022121700 I would have expected a correct
response with value=0.
Is this the expected behavior or I have found a glitch? I found a few other
cases, so I prefer to ask here.
Thanks.
--
Ismael Olea
http://olea.org/diario/
Hi:
Do we have tools, metrics or traces about the evolution of quality in
articles? Or something like that. Not sure if the ORES technology is
appropriate for it.
--
Ismael Olea
http://olea.org/diario/
Hello everyone,
The next Research Showcase, focused on Editor Retention, will be
live-streamed Wednesday, January 18. Find your local time here
<https://zonestamp.toolforge.org/1674063059>.
YouTube stream: https://www.youtube.com/watch?v=gS8ELcVZ8Q4
You can join the conversation on IRC at #wikimedia-research. You can also
watch our past research showcases here:
https://www.mediawiki.org/wiki/Wikimedia_Research/Showcase
This month's presentations:
Vital Signsː Measuring Wikipedia Communities’ HealthBy *Cristian Consonni,
Eurecat - Centre Tecnològic de Catalunya, Barcelona*Community health in
Wikipedia is a complex topic that has been at the center of discussion for
Wikipedia and the scientific community for years. Researchers observed that
the number of active editors for the largest Wikipedias started declining
after an initial phase of exponential growth. Some media outlets picked
this fact as a death announcement for the project, but the news of
Wikipedia's death turned out to be greatly exaggerated. However, it remains
true that researchers and community activists need to understand how to
measure community health and describe it more accurately. In this
presentation, we would like to go beyond the traditional metrics used to
describe the status of the community. We propose the creation of 6 sets of
language-independent indicators that we call "Vital Signs." We borrow the
analogy from the medical field, as these indicators represent a first step
in defining the health status of a community; they can constitute a
valuable reference point to foresee and prevent future risks. We present
our analysis for several Wikipedia language editions, showing that
communities renew their productive force even with stagnating absolute
numbers; we observe a general need for renewal in positions related to
particular functions or administratorship. We created a dashboard to
visualize all the indicators we have computed and hope that the communities
will find it helpful for improving their health.
- Paperː Community Vital Signs: Measuring Wikipedia Communities’
Sustainable Growth and Renewal
<https://meta.wikimedia.org/wiki/File:Community_Vital_Signs_Research_Paper_-…>
Learning to Predict the Departure Dynamics of Wikidata EditorsBy *Guangyuan
Piao, Maynooth University*Wikidata as one of the largest open collaborative
knowledge bases has drawn much attention from researchers and practitioners
since its launch in 2012. As it is collaboratively developed and maintained
by a community of a great number of volunteer editors, understanding and
predicting the departure dynamics of those editors are crucial but have not
been studied extensively in previous works. In this paper, we investigate
the synergistic effect of two different types of features: statistical and
pattern-based ones with DeepFM as our classification model which has not
been explored in a similar context and problem for predicting whether a
Wikidata editor will stay or leave the platform. Our experimental results
show that using the two sets of features with DeepFM provides the best
performance regarding AUROC (0.9561) and F1 score (0.8843), and achieves
substantial improvement compared to using either of the sets of features
and over a wide range of baselines.
- Paperː Learning to Predict the Departure Dynamics of Wikidata Editors
<https://parklize.github.io/publications/ISWC2021.pdf>
--
Emily Lescak (she / her)
Senior Research Community Officer
The Wikimedia Foundation