Hi all,
Thanks to those of you who responded to the data release survey we released in October. The WMF Security team has developed a prioritization plan for releasing data in the coming year: https://meta.wikimedia.org/wiki/Differential_privacy/Proposed/DP_dataset_rel.... We invite you to leave questions or comments on the talk page.
Warm regards,
Emily Lescak, WMF Research team
Hal Triedman, WMF Security team
On Wed, Oct 26, 2022 at 2:50 PM Emily Lescak elescak@wikimedia.org wrote:
Hi all,
As part of our efforts to better serve the Wikimedia research community, we are happy to share that we are collaborating with the Security team at WMF to help prioritize the release of data that can be useful for your research. The Security team is working to make more datasets privatized and public to avoid the need for non-disclosure agreements. You can learn more here: https://meta.wikimedia.org/wiki/Differential_privacy.
Over the next 12 months, the Security team plans to release 5 datasets:
country-language-pageview ongoing (end of 2022)
country-language-pageview historical (March 2023)
geo-aggregated grants data back to 2009 (Feb 2023)
geoeditors monthly (June 2023)
dataset informed by research community priorities identified in this survey (second half of 2023)
The released datasets need to meet certain privacy requirements:
They can not include any natural language (e.g. specific search queries or deletion logs) so as to avoid the release of personally identifiable information;
They need to be sufficiently large (at least thousands of entries, preferably more) so as to reduce noise;
The data can not be so sensitive that an individual user will be harmed by disclosure of the data (e.g. IP addresses, content containing personally identifying information).
We invite you to complete a brief survey https://docs.google.com/forms/d/e/1FAIpQLSe_LAt6V2Q1GUf3Z8lnt8uAOZnHTO5rNgFfufx_gDKk1znrlw/viewform?usp=sf_link to help us identify and prioritize the types of datasets that you would find useful for your work. Results of this survey will inform the fifth dataset, scheduled to be released in late 2023. This survey is conducted via a third-party service, which may subject it to additional terms. For more information on privacy and data-handling, see the survey privacy statement: https://foundation.wikimedia.org/wiki/Legal:Data_Release_Priorities_Survey_P...
The survey will remain open until November 3, 2022. After that time, members of the Research and Security teams will review the data and report out about the suggestions that were received and how the work will proceed. If you prefer to not respond via the Google form, you can email your feedback to us or set up a time to discuss. You can also leave questions and comments on the Talk page: https://meta.wikimedia.org/wiki/Differential_privacy
Thanks for your help!
Emily Lescak, WMF Research team
Hal Triedman, WMF Security team
-- Emily Lescak (she / her) Senior Research Community Officer The Wikimedia Foundation