Hi,
My name is Emily Chen and I'm a Computer Science Ph.D. student at the University of Southern California. I tried sending this email earlier before I had joined the mailer, so apologies if this email was sent out twice! I'm currently conducting research on collective attention decay in Wikipedia articles that are more heavily cited by other Wikipedia articles within the Wikipedia ecosystem. This work builds upon the observations made in Candia et al's paper on "The universal decay of collective memory and attention https://urldefense.proofpoint.com/v2/url?u=https-3A__www.nature.com_articles_s41562-2D018-2D0474-2D5_&d=DwMFaQ&c=clK7kQUTWtAVEOVIgvi0NU5BOUHhpN0H8p7CSfnc_gI&r=L28nNkR1PtjB2SmfWmCyJg&m=3zQGbMO5CmRLzz0FfWu7BXNsaJO9bff2gb1F5xG8EB8&s=tViSDkiMKEu9TZRabpoJ3dZ-BjHniCvK_5KtxIEVXts&e=", and I have been using the number of page views articles receive as a proxy for attention.
From what I can find, there is a maintained page view data set on
dumps.wikipmedia.org https://urldefense.proofpoint.com/v2/url?u=http-3A__dumps.wikipmedia.org&d=DwMFaQ&c=clK7kQUTWtAVEOVIgvi0NU5BOUHhpN0H8p7CSfnc_gI&r=L28nNkR1PtjB2SmfWmCyJg&m=3zQGbMO5CmRLzz0FfWu7BXNsaJO9bff2gb1F5xG8EB8&s=UtOnWjAQWI4l2Mz9WGXCjzGTD1DyHmyToCBOcoipq3c&e= that spans 2011-current, and statistics that Domas Mituzas began collecting from 2007 - 2016. This data seems to capture the gradual decay in an individual article's pageviews, but doesn't capture the initial growth of an article's page views. Would you happen to know if there are article page view statistics from the earlier years of Wikipedia (2001-2007) or if there are any general page view statistics from that time frame? Or would you happen to know who I could contact for such a dataset? It would be really interesting to study the temporal page view dynamics over Wikipedia's lifespan alongside my current work in collective attention.
Thank you so much for your time!
Best, Emily Chen
Note that the definition of pageviews has changed several times over the years. Only the data from 2015 to present is strictly comparable. I'm sure some data analysts will chime in with more details. Good luck with your project!
On Jan 15, 2020, at 6:59 PM, Emily Chen echen920@usc.edu wrote:
Hi,
My name is Emily Chen and I'm a Computer Science Ph.D. student at the University of Southern California. I tried sending this email earlier before I had joined the mailer, so apologies if this email was sent out twice! I'm currently conducting research on collective attention decay in Wikipedia articles that are more heavily cited by other Wikipedia articles within the Wikipedia ecosystem. This work builds upon the observations made in Candia et al's paper on "The universal decay of collective memory and attention", and I have been using the number of page views articles receive as a proxy for attention.
From what I can find, there is a maintained page view data set on dumps.wikipmedia.org that spans 2011-current, and statistics that Domas Mituzas began collecting from 2007 - 2016. This data seems to capture the gradual decay in an individual article's pageviews, but doesn't capture the initial growth of an article's page views. Would you happen to know if there are article page view statistics from the earlier years of Wikipedia (2001-2007) or if there are any general page view statistics from that time frame? Or would you happen to know who I could contact for such a dataset? It would be really interesting to study the temporal page view dynamics over Wikipedia's lifespan alongside my current work in collective attention.
Thank you so much for your time!
Best, Emily Chen
-- Emily Chen (echen920 [at] usc [dot] edu) Ph.D. Student | Computer Science Viterbi School of Engineering & Information Sciences Institute University of Southern California
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Emily, I believe the pagecount data was never collected in a structured way before 2007. See for example this discussion about some archive data that took some pains to uncover: https://phabricator.wikimedia.org/T232563
If edits per article would work as a proxy for attention, or in combination with views you can extrapolate somehow, we are in the process of vetting and releasing a simple full history of editing on all wikis: https://dumps.wikimedia.org/other/mediawiki_history/readme.html
On Thu, Jan 16, 2020 at 7:52 AM Ryan Kaldari rkaldari@wikimedia.org wrote:
Note that the definition of pageviews has changed several times over the years. Only the data from 2015 to present is strictly comparable. I'm sure some data analysts will chime in with more details. Good luck with your project!
On Jan 15, 2020, at 6:59 PM, Emily Chen echen920@usc.edu wrote:
Hi,
My name is Emily Chen and I'm a Computer Science Ph.D. student at the University of Southern California. I tried sending this email earlier before I had joined the mailer, so apologies if this email was sent out twice! I'm currently conducting research on collective attention decay in Wikipedia articles that are more heavily cited by other Wikipedia articles within the Wikipedia ecosystem. This work builds upon the observations made in Candia et al's paper on "The universal decay of collective memory and attention https://urldefense.proofpoint.com/v2/url?u=https-3A__www.nature.com_articles_s41562-2D018-2D0474-2D5_&d=DwMFaQ&c=clK7kQUTWtAVEOVIgvi0NU5BOUHhpN0H8p7CSfnc_gI&r=L28nNkR1PtjB2SmfWmCyJg&m=3zQGbMO5CmRLzz0FfWu7BXNsaJO9bff2gb1F5xG8EB8&s=tViSDkiMKEu9TZRabpoJ3dZ-BjHniCvK_5KtxIEVXts&e=", and I have been using the number of page views articles receive as a proxy for attention.
From what I can find, there is a maintained page view data set on dumps.wikipmedia.org https://urldefense.proofpoint.com/v2/url?u=http-3A__dumps.wikipmedia.org&d=DwMFaQ&c=clK7kQUTWtAVEOVIgvi0NU5BOUHhpN0H8p7CSfnc_gI&r=L28nNkR1PtjB2SmfWmCyJg&m=3zQGbMO5CmRLzz0FfWu7BXNsaJO9bff2gb1F5xG8EB8&s=UtOnWjAQWI4l2Mz9WGXCjzGTD1DyHmyToCBOcoipq3c&e= that spans 2011-current, and statistics that Domas Mituzas began collecting from 2007 - 2016. This data seems to capture the gradual decay in an individual article's pageviews, but doesn't capture the initial growth of an article's page views. Would you happen to know if there are article page view statistics from the earlier years of Wikipedia (2001-2007) or if there are any general page view statistics from that time frame? Or would you happen to know who I could contact for such a dataset? It would be really interesting to study the temporal page view dynamics over Wikipedia's lifespan alongside my current work in collective attention.
Thank you so much for your time!
Best, Emily Chen
-- Emily Chen (echen920 [at] usc [dot] edu) Ph.D. Student | Computer Science Viterbi School of Engineering & Information Sciences Institute University of Southern California
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Hi Ryan and Dan,
Thanks for such a prompt response to my email - I really appreciate it.
Ryan - while the definition of pageviews has changed over the years, do you think that these changes in definitions significantly impact the overall pageview trends we might observe?
Dan - thank you for the pointers! I definitely think that edits per article can give me insight into which articles were attracting more of the many editors' attention, which is telling especially when Wikipedia content was initially being built up. Thank you for spending the time to dig up the archived traffic analysis report - I was actually thinking of using clickstream data to track how temporal shifts in referral origins might contribute to the attention that Wikipedia articles experience. Since I believe the collected clickstream data starts from 2015, these statistics give me a baseline that I can use of what these distributions looked like in 2011.
Thanks again for all of your help, time and feedback!
Best,
Emily
On Thu, Jan 16, 2020 at 9:37 AM Dan Andreescu dandreescu@wikimedia.org wrote:
Emily, I believe the pagecount data was never collected in a structured way before 2007. See for example this discussion about some archive data that took some pains to uncover: https://phabricator.wikimedia.org/T232563 https://urldefense.proofpoint.com/v2/url?u=https-3A__phabricator.wikimedia.org_T232563&d=DwMFaQ&c=clK7kQUTWtAVEOVIgvi0NU5BOUHhpN0H8p7CSfnc_gI&r=L28nNkR1PtjB2SmfWmCyJg&m=DFMx9qCgoTHiMWveeKqffXxSpJgL6xFb5aOg725giXo&s=_OhvvSpN70SngTAQk94EiJTCdmXUsxX_MVniOsj5WP0&e=
If edits per article would work as a proxy for attention, or in combination with views you can extrapolate somehow, we are in the process of vetting and releasing a simple full history of editing on all wikis: https://dumps.wikimedia.org/other/mediawiki_history/readme.html https://urldefense.proofpoint.com/v2/url?u=https-3A__dumps.wikimedia.org_other_mediawiki-5Fhistory_readme.html&d=DwMFaQ&c=clK7kQUTWtAVEOVIgvi0NU5BOUHhpN0H8p7CSfnc_gI&r=L28nNkR1PtjB2SmfWmCyJg&m=DFMx9qCgoTHiMWveeKqffXxSpJgL6xFb5aOg725giXo&s=Gh8Li-GjhhVxqpT-DaDV8jVSfT2SGvBDghJPlqaWb9s&e=
On Thu, Jan 16, 2020 at 7:52 AM Ryan Kaldari rkaldari@wikimedia.org wrote:
Note that the definition of pageviews has changed several times over the years. Only the data from 2015 to present is strictly comparable. I'm sure some data analysts will chime in with more details. Good luck with your project!
On Jan 15, 2020, at 6:59 PM, Emily Chen echen920@usc.edu wrote:
Hi,
My name is Emily Chen and I'm a Computer Science Ph.D. student at the University of Southern California. I tried sending this email earlier before I had joined the mailer, so apologies if this email was sent out twice! I'm currently conducting research on collective attention decay in Wikipedia articles that are more heavily cited by other Wikipedia articles within the Wikipedia ecosystem. This work builds upon the observations made in Candia et al's paper on "The universal decay of collective memory and attention https://urldefense.proofpoint.com/v2/url?u=https-3A__www.nature.com_articles_s41562-2D018-2D0474-2D5_&d=DwMFaQ&c=clK7kQUTWtAVEOVIgvi0NU5BOUHhpN0H8p7CSfnc_gI&r=L28nNkR1PtjB2SmfWmCyJg&m=3zQGbMO5CmRLzz0FfWu7BXNsaJO9bff2gb1F5xG8EB8&s=tViSDkiMKEu9TZRabpoJ3dZ-BjHniCvK_5KtxIEVXts&e=", and I have been using the number of page views articles receive as a proxy for attention.
From what I can find, there is a maintained page view data set on dumps.wikipmedia.org https://urldefense.proofpoint.com/v2/url?u=http-3A__dumps.wikipmedia.org&d=DwMFaQ&c=clK7kQUTWtAVEOVIgvi0NU5BOUHhpN0H8p7CSfnc_gI&r=L28nNkR1PtjB2SmfWmCyJg&m=3zQGbMO5CmRLzz0FfWu7BXNsaJO9bff2gb1F5xG8EB8&s=UtOnWjAQWI4l2Mz9WGXCjzGTD1DyHmyToCBOcoipq3c&e= that spans 2011-current, and statistics that Domas Mituzas began collecting from 2007 - 2016. This data seems to capture the gradual decay in an individual article's pageviews, but doesn't capture the initial growth of an article's page views. Would you happen to know if there are article page view statistics from the earlier years of Wikipedia (2001-2007) or if there are any general page view statistics from that time frame? Or would you happen to know who I could contact for such a dataset? It would be really interesting to study the temporal page view dynamics over Wikipedia's lifespan alongside my current work in collective attention.
Thank you so much for your time!
Best, Emily Chen
-- Emily Chen (echen920 [at] usc [dot] edu) Ph.D. Student | Computer Science Viterbi School of Engineering & Information Sciences Institute University of Southern California
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.wikimedia.org_mailman_listinfo_analytics&d=DwMFaQ&c=clK7kQUTWtAVEOVIgvi0NU5BOUHhpN0H8p7CSfnc_gI&r=L28nNkR1PtjB2SmfWmCyJg&m=DFMx9qCgoTHiMWveeKqffXxSpJgL6xFb5aOg725giXo&s=BLL3023THgrxsOGfuuNAAz1oQalGt10TxJJk4FXKeq8&e=
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.wikimedia.org_mailman_listinfo_analytics&d=DwMFaQ&c=clK7kQUTWtAVEOVIgvi0NU5BOUHhpN0H8p7CSfnc_gI&r=L28nNkR1PtjB2SmfWmCyJg&m=DFMx9qCgoTHiMWveeKqffXxSpJgL6xFb5aOg725giXo&s=BLL3023THgrxsOGfuuNAAz1oQalGt10TxJJk4FXKeq8&e=
Analytics mailing list Analytics@lists.wikimedia.org
https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.wikimedia.org_mai...
as for the pageview definition change, the major differences are spelled out here: https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Traffic/Pageviews#Di...
Basically, new dumps include mobile traffic, exclude bots to the best of our ability (lots of bots that do not self-identify are not yet detected, that's a work in progress), and do not have any manual patching. You'd have to do an analysis to see how the signal you're tracking carries from the old to the new data, I would guess it's possible at a high enough aggregation level but for individual pages you may have too many anomalies (skewed towards mobile, lots of bots, etc.)
On Fri, Jan 17, 2020 at 11:00 AM Emily Chen echen920@usc.edu wrote:
Hi Ryan and Dan,
Thanks for such a prompt response to my email - I really appreciate it.
Ryan - while the definition of pageviews has changed over the years, do you think that these changes in definitions significantly impact the overall pageview trends we might observe?
Dan - thank you for the pointers! I definitely think that edits per article can give me insight into which articles were attracting more of the many editors' attention, which is telling especially when Wikipedia content was initially being built up. Thank you for spending the time to dig up the archived traffic analysis report - I was actually thinking of using clickstream data to track how temporal shifts in referral origins might contribute to the attention that Wikipedia articles experience. Since I believe the collected clickstream data starts from 2015, these statistics give me a baseline that I can use of what these distributions looked like in 2011.
Thanks again for all of your help, time and feedback!
Best,
Emily
On Thu, Jan 16, 2020 at 9:37 AM Dan Andreescu dandreescu@wikimedia.org wrote:
Emily, I believe the pagecount data was never collected in a structured way before 2007. See for example this discussion about some archive data that took some pains to uncover: https://phabricator.wikimedia.org/T232563 https://urldefense.proofpoint.com/v2/url?u=https-3A__phabricator.wikimedia.org_T232563&d=DwMFaQ&c=clK7kQUTWtAVEOVIgvi0NU5BOUHhpN0H8p7CSfnc_gI&r=L28nNkR1PtjB2SmfWmCyJg&m=DFMx9qCgoTHiMWveeKqffXxSpJgL6xFb5aOg725giXo&s=_OhvvSpN70SngTAQk94EiJTCdmXUsxX_MVniOsj5WP0&e=
If edits per article would work as a proxy for attention, or in combination with views you can extrapolate somehow, we are in the process of vetting and releasing a simple full history of editing on all wikis: https://dumps.wikimedia.org/other/mediawiki_history/readme.html https://urldefense.proofpoint.com/v2/url?u=https-3A__dumps.wikimedia.org_other_mediawiki-5Fhistory_readme.html&d=DwMFaQ&c=clK7kQUTWtAVEOVIgvi0NU5BOUHhpN0H8p7CSfnc_gI&r=L28nNkR1PtjB2SmfWmCyJg&m=DFMx9qCgoTHiMWveeKqffXxSpJgL6xFb5aOg725giXo&s=Gh8Li-GjhhVxqpT-DaDV8jVSfT2SGvBDghJPlqaWb9s&e=
On Thu, Jan 16, 2020 at 7:52 AM Ryan Kaldari rkaldari@wikimedia.org wrote:
Note that the definition of pageviews has changed several times over the years. Only the data from 2015 to present is strictly comparable. I'm sure some data analysts will chime in with more details. Good luck with your project!
On Jan 15, 2020, at 6:59 PM, Emily Chen echen920@usc.edu wrote:
Hi,
My name is Emily Chen and I'm a Computer Science Ph.D. student at the University of Southern California. I tried sending this email earlier before I had joined the mailer, so apologies if this email was sent out twice! I'm currently conducting research on collective attention decay in Wikipedia articles that are more heavily cited by other Wikipedia articles within the Wikipedia ecosystem. This work builds upon the observations made in Candia et al's paper on "The universal decay of collective memory and attention https://urldefense.proofpoint.com/v2/url?u=https-3A__www.nature.com_articles_s41562-2D018-2D0474-2D5_&d=DwMFaQ&c=clK7kQUTWtAVEOVIgvi0NU5BOUHhpN0H8p7CSfnc_gI&r=L28nNkR1PtjB2SmfWmCyJg&m=3zQGbMO5CmRLzz0FfWu7BXNsaJO9bff2gb1F5xG8EB8&s=tViSDkiMKEu9TZRabpoJ3dZ-BjHniCvK_5KtxIEVXts&e=", and I have been using the number of page views articles receive as a proxy for attention.
From what I can find, there is a maintained page view data set on dumps.wikipmedia.org https://urldefense.proofpoint.com/v2/url?u=http-3A__dumps.wikipmedia.org&d=DwMFaQ&c=clK7kQUTWtAVEOVIgvi0NU5BOUHhpN0H8p7CSfnc_gI&r=L28nNkR1PtjB2SmfWmCyJg&m=3zQGbMO5CmRLzz0FfWu7BXNsaJO9bff2gb1F5xG8EB8&s=UtOnWjAQWI4l2Mz9WGXCjzGTD1DyHmyToCBOcoipq3c&e= that spans 2011-current, and statistics that Domas Mituzas began collecting from 2007 - 2016. This data seems to capture the gradual decay in an individual article's pageviews, but doesn't capture the initial growth of an article's page views. Would you happen to know if there are article page view statistics from the earlier years of Wikipedia (2001-2007) or if there are any general page view statistics from that time frame? Or would you happen to know who I could contact for such a dataset? It would be really interesting to study the temporal page view dynamics over Wikipedia's lifespan alongside my current work in collective attention.
Thank you so much for your time!
Best, Emily Chen
-- Emily Chen (echen920 [at] usc [dot] edu) Ph.D. Student | Computer Science Viterbi School of Engineering & Information Sciences Institute University of Southern California
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.wikimedia.org_mailman_listinfo_analytics&d=DwMFaQ&c=clK7kQUTWtAVEOVIgvi0NU5BOUHhpN0H8p7CSfnc_gI&r=L28nNkR1PtjB2SmfWmCyJg&m=DFMx9qCgoTHiMWveeKqffXxSpJgL6xFb5aOg725giXo&s=BLL3023THgrxsOGfuuNAAz1oQalGt10TxJJk4FXKeq8&e=
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.wikimedia.org_mailman_listinfo_analytics&d=DwMFaQ&c=clK7kQUTWtAVEOVIgvi0NU5BOUHhpN0H8p7CSfnc_gI&r=L28nNkR1PtjB2SmfWmCyJg&m=DFMx9qCgoTHiMWveeKqffXxSpJgL6xFb5aOg725giXo&s=BLL3023THgrxsOGfuuNAAz1oQalGt10TxJJk4FXKeq8&e=
Analytics mailing list Analytics@lists.wikimedia.org
https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.wikimedia.org_mai...
-- Emily Chen (echen920 [at] usc [dot] edu) Ph.D. Student | Computer Science Viterbi School of Engineering & Information Sciences Institute University of Southern California
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics