Hi Analytics People,
The Wikimedia Analytics Team is pleased to announce the release of the most complete dataset we have to date to analyze content and contributors metadata: Mediawiki History [1] [2].
Data is in TSV format, released monthly around the 3rd of the month usually, and every new release contains the full history of metadata.
The dataset contains an enhanced [3] and historified [4] version of user, page and revision metadata and serves as a base to Wiksitats API on edits, users and pages [5] [6].
We hope you will have as much fun playing with the data as we have building it, and we're eager to hear from you [7], whether for issues, ideas or usage of the data.
Analytically yours,
Hi Joseph,
Thanks for this announcement.
I am looking for license information regarding the dumps, and I'm not finding it in the pages that you linked at [1] or [2]. The license that applies to text on Wikimedia sites is often CC-BY-SA 3.0, and the WMF Terms of Use at https://foundation.wikimedia.org/wiki/Terms_of_Use do not appear to provide any exception for metadata. In the absence of a specific license, I think that the CC-BY-SA or other relevant licenses would apply to the metadata, and that the licensing information should be prominently included on relevant pages and in the dumps themselves.
What do you think?
Pine ( https://meta.wikimedia.org/wiki/User:Pine )
On Mon, Feb 10, 2020 at 4:28 PM Joseph Allemandou jallemandou@wikimedia.org wrote:
Hi Analytics People,
The Wikimedia Analytics Team is pleased to announce the release of the most complete dataset we have to date to analyze content and contributors metadata: Mediawiki History [1] [2].
Data is in TSV format, released monthly around the 3rd of the month usually, and every new release contains the full history of metadata.
The dataset contains an enhanced [3] and historified [4] version of user, page and revision metadata and serves as a base to Wiksitats API on edits, users and pages [5] [6].
We hope you will have as much fun playing with the data as we have building it, and we're eager to hear from you [7], whether for issues, ideas or usage of the data.
Analytically yours,
-- Joseph Allemandou (joal) (he / him) Sr Data Engineer Wikimedia Foundation
[1] https://dumps.wikimedia.org/other/mediawiki_history/readme.html [2] https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Edits/Mediawiki_hist... [3] Many pre-computed fields are present in the dataset, from edit-counts by user and page to reverts and reverted information, as well as time between events. [4] As accurate as possible historical usernames and page-titles (as well as user-groups and blocks) is available in addition to current values, and are provided in a denormalized way to every event of the dataset. [5] https://wikitech.wikimedia.org/wiki/Analytics/AQS/Wikistats_2 [6] https://wikimedia.org/api/rest_v1/ [7] https://phabricator.wikimedia.org/maniphest/task/edit/?title=Mediawiki%20His... _______________________________________________ Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
I was thinking about the licensing issue some more. Apparently there was a relevant United States court case regarding metadata several years ago in the United States, but it's unclear to me from my brief web search whether this holding would apply to metadata from every nation. Also, I don't know if the underlying statues have changed since the time of that ruling. I think that WMF Legal should be consulted regarding the copyright status of the metadata. Also, I think that the licensing of metadata should be explicitly addressed in the Terms of Use or a similar document which is easily accessible to all contributors to Wikimedia sites.
Pine ( https://meta.wikimedia.org/wiki/User:Pine )
On Tue, Feb 11, 2020 at 12:17 AM Pine W wiki.pine@gmail.com wrote:
Hi Joseph,
Thanks for this announcement.
I am looking for license information regarding the dumps, and I'm not finding it in the pages that you linked at [1] or [2]. The license that applies to text on Wikimedia sites is often CC-BY-SA 3.0, and the WMF Terms of Use at https://foundation.wikimedia.org/wiki/Terms_of_Use do not appear to provide any exception for metadata. In the absence of a specific license, I think that the CC-BY-SA or other relevant licenses would apply to the metadata, and that the licensing information should be prominently included on relevant pages and in the dumps themselves.
What do you think?
Pine ( https://meta.wikimedia.org/wiki/User:Pine )
On Mon, Feb 10, 2020 at 4:28 PM Joseph Allemandou jallemandou@wikimedia.org wrote:
Hi Analytics People,
The Wikimedia Analytics Team is pleased to announce the release of the most complete dataset we have to date to analyze content and contributors metadata: Mediawiki History [1] [2].
Data is in TSV format, released monthly around the 3rd of the month usually, and every new release contains the full history of metadata.
The dataset contains an enhanced [3] and historified [4] version of user, page and revision metadata and serves as a base to Wiksitats API on edits, users and pages [5] [6].
We hope you will have as much fun playing with the data as we have building it, and we're eager to hear from you [7], whether for issues, ideas or usage of the data.
Analytically yours,
-- Joseph Allemandou (joal) (he / him) Sr Data Engineer Wikimedia Foundation
[1] https://dumps.wikimedia.org/other/mediawiki_history/readme.html [2] https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Edits/Mediawiki_hist... [3] Many pre-computed fields are present in the dataset, from edit-counts by user and page to reverts and reverted information, as well as time between events. [4] As accurate as possible historical usernames and page-titles (as well as user-groups and blocks) is available in addition to current values, and are provided in a denormalized way to every event of the dataset. [5] https://wikitech.wikimedia.org/wiki/Analytics/AQS/Wikistats_2 [6] https://wikimedia.org/api/rest_v1/ [7] https://phabricator.wikimedia.org/maniphest/task/edit/?title=Mediawiki%20His... _______________________________________________ Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Hi Joseph and team,
summary: congratulations and some suggestions/requests.
I second and third Nate and Neil. Congratulations on meeting this milestone. This effort can empower the research community to spend less time on joining datasets and trying to resolve existing, known (to some) and complex issues with mediawiki history data and instead spend time doing the research. Nice! :)
I'm eager to see what the dataset(s) will be used for by others. On my end, I am looking forward to seeing more research on how Wiki(m|p)edia projects have evolved over the past almost 2 decades now that this data is more readily available for studying. What we learn from the Wikimedia projects and their evolution can be helpful in understanding the broader web ecosystem and its evolution as well (as the Web is only 30 years old now).
I have some requests if I may:
* Pine brings up a good point about licenses. It would be great to make that clear in the documentation page(s). There are many examples of this (that you know better than I), just in case, I find the License section of https://iccl.inf.tu-dresden.de/web/Wikidata/Maps-06-2015/en informative, for example.
* The other request I have is that you make the template for citing this data-set clear to the end-user in your documentation pages (including readme). You can do this in a few different ways:
** In the documentation pages, put a suggested citation link. For example (for bibtex):
@misc{wmfanalytics2020mediawikihistory, title = {MediaWiki History}, author = {nameoftheauthors}, howpublished = "\url{https://dumps.wikimedia.org/other/mediawiki_history/%7D", note = {Accessed on date x}, year={2020} }
** Upload a paper about the work on arxiv.org. This way, your work gets a DOI that you can use in your documentation pages for folks to use for citation. Note that this step can be relatively light-weight. (no peer-review in this case and it's relatively quick.)
** Submit the paper to a conference. Some conferences have a data-set paper track where you publish about the dataset you release. Research is happy to support you with guidance if you need it and if you choose to go down this path. This takes some more time and in return it will give you a "peer-review" stamp and more experience in publishing if you like that.
Unless you like publishing your work in a peer-reviewed venue, I suggest one of the first two approaches.
* I'm not sure if you intend to make the dataset more discoverable through places such as https://datasetsearch.research.google.com/ . You may want to consider that.
Thanks, Leila
-- Leila Zia Head of Research Wikimedia Foundation
On Mon, Feb 10, 2020 at 9:28 PM Pine W wiki.pine@gmail.com wrote:
I was thinking about the licensing issue some more. Apparently there was a relevant United States court case regarding metadata several years ago in the United States, but it's unclear to me from my brief web search whether this holding would apply to metadata from every nation. Also, I don't know if the underlying statues have changed since the time of that ruling. I think that WMF Legal should be consulted regarding the copyright status of the metadata. Also, I think that the licensing of metadata should be explicitly addressed in the Terms of Use or a similar document which is easily accessible to all contributors to Wikimedia sites.
Pine ( https://meta.wikimedia.org/wiki/User:Pine )
On Tue, Feb 11, 2020 at 12:17 AM Pine W wiki.pine@gmail.com wrote:
Hi Joseph,
Thanks for this announcement.
I am looking for license information regarding the dumps, and I'm not finding it in the pages that you linked at [1] or [2]. The license that applies to text on Wikimedia sites is often CC-BY-SA 3.0, and the WMF Terms of Use at https://foundation.wikimedia.org/wiki/Terms_of_Use do not appear to provide any exception for metadata. In the absence of a specific license, I think that the CC-BY-SA or other relevant licenses would apply to the metadata, and that the licensing information should be prominently included on relevant pages and in the dumps themselves.
What do you think?
Pine ( https://meta.wikimedia.org/wiki/User:Pine )
On Mon, Feb 10, 2020 at 4:28 PM Joseph Allemandou jallemandou@wikimedia.org wrote:
Hi Analytics People,
The Wikimedia Analytics Team is pleased to announce the release of the most complete dataset we have to date to analyze content and contributors metadata: Mediawiki History [1] [2].
Data is in TSV format, released monthly around the 3rd of the month usually, and every new release contains the full history of metadata.
The dataset contains an enhanced [3] and historified [4] version of user, page and revision metadata and serves as a base to Wiksitats API on edits, users and pages [5] [6].
We hope you will have as much fun playing with the data as we have building it, and we're eager to hear from you [7], whether for issues, ideas or usage of the data.
Analytically yours,
-- Joseph Allemandou (joal) (he / him) Sr Data Engineer Wikimedia Foundation
[1] https://dumps.wikimedia.org/other/mediawiki_history/readme.html [2] https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Edits/Mediawiki_hist... [3] Many pre-computed fields are present in the dataset, from edit-counts by user and page to reverts and reverted information, as well as time between events. [4] As accurate as possible historical usernames and page-titles (as well as user-groups and blocks) is available in addition to current values, and are provided in a denormalized way to every event of the dataset. [5] https://wikitech.wikimedia.org/wiki/Analytics/AQS/Wikistats_2 [6] https://wikimedia.org/api/rest_v1/ [7] https://phabricator.wikimedia.org/maniphest/task/edit/?title=Mediawiki%20His... _______________________________________________ Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Regarding Licensing, there is already a ticket: https://phabricator.wikimedia.org/T244685
If you take a look the bottom of wikistats (https://stats.wikimedia.org/v2) you will see that dedication is CC0, the data in both systems is the same but, of course, it can be made more explicit.
Thanks,
Nuria
On Tue, Feb 11, 2020 at 12:48 PM Leila Zia lzia@wikimedia.org wrote:
Hi Joseph and team,
summary: congratulations and some suggestions/requests.
I second and third Nate and Neil. Congratulations on meeting this milestone. This effort can empower the research community to spend less time on joining datasets and trying to resolve existing, known (to some) and complex issues with mediawiki history data and instead spend time doing the research. Nice! :)
I'm eager to see what the dataset(s) will be used for by others. On my end, I am looking forward to seeing more research on how Wiki(m|p)edia projects have evolved over the past almost 2 decades now that this data is more readily available for studying. What we learn from the Wikimedia projects and their evolution can be helpful in understanding the broader web ecosystem and its evolution as well (as the Web is only 30 years old now).
I have some requests if I may:
- Pine brings up a good point about licenses. It would be great to
make that clear in the documentation page(s). There are many examples of this (that you know better than I), just in case, I find the License section of https://iccl.inf.tu-dresden.de/web/Wikidata/Maps-06-2015/en informative, for example.
- The other request I have is that you make the template for citing
this data-set clear to the end-user in your documentation pages (including readme). You can do this in a few different ways:
** In the documentation pages, put a suggested citation link. For example (for bibtex):
@misc{wmfanalytics2020mediawikihistory, title = {MediaWiki History}, author = {nameoftheauthors}, howpublished = "\url{ https://dumps.wikimedia.org/other/mediawiki_history/%7D", note = {Accessed on date x}, year={2020} }
** Upload a paper about the work on arxiv.org. This way, your work gets a DOI that you can use in your documentation pages for folks to use for citation. Note that this step can be relatively light-weight. (no peer-review in this case and it's relatively quick.)
** Submit the paper to a conference. Some conferences have a data-set paper track where you publish about the dataset you release. Research is happy to support you with guidance if you need it and if you choose to go down this path. This takes some more time and in return it will give you a "peer-review" stamp and more experience in publishing if you like that.
Unless you like publishing your work in a peer-reviewed venue, I suggest one of the first two approaches.
- I'm not sure if you intend to make the dataset more discoverable
through places such as https://datasetsearch.research.google.com/ . You may want to consider that.
Thanks, Leila
-- Leila Zia Head of Research Wikimedia Foundation
On Mon, Feb 10, 2020 at 9:28 PM Pine W wiki.pine@gmail.com wrote:
I was thinking about the licensing issue some more. Apparently there was a relevant United States court case regarding metadata several years ago in the United States, but it's unclear to me from my brief web search whether this holding would apply to metadata from every nation. Also, I don't know if the underlying statues have changed since the time of that ruling. I think that WMF Legal should be consulted regarding the copyright status of the metadata. Also, I think that the licensing of metadata should be explicitly addressed in the Terms of Use or a similar document which is easily accessible to all contributors to Wikimedia sites.
Pine ( https://meta.wikimedia.org/wiki/User:Pine )
On Tue, Feb 11, 2020 at 12:17 AM Pine W wiki.pine@gmail.com wrote:
Hi Joseph,
Thanks for this announcement.
I am looking for license information regarding the dumps, and I'm not finding it in the pages that you linked at [1] or [2]. The license that applies to text on Wikimedia sites is often CC-BY-SA 3.0, and the WMF Terms of Use at https://foundation.wikimedia.org/wiki/Terms_of_Use do not appear to provide any exception for metadata. In the absence of a specific license, I think that the CC-BY-SA or other relevant licenses would apply to the metadata, and that the licensing information should be prominently included on relevant pages and in the dumps themselves.
What do you think?
Pine ( https://meta.wikimedia.org/wiki/User:Pine )
On Mon, Feb 10, 2020 at 4:28 PM Joseph Allemandou jallemandou@wikimedia.org wrote:
Hi Analytics People,
The Wikimedia Analytics Team is pleased to announce the release of
the most complete dataset we have to date to analyze content and contributors metadata: Mediawiki History [1] [2].
Data is in TSV format, released monthly around the 3rd of the month
usually, and every new release contains the full history of metadata.
The dataset contains an enhanced [3] and historified [4] version of
user, page and revision metadata and serves as a base to Wiksitats API on edits, users and pages [5] [6].
We hope you will have as much fun playing with the data as we have
building it, and we're eager to hear from you [7], whether for issues, ideas or usage of the data.
Analytically yours,
-- Joseph Allemandou (joal) (he / him) Sr Data Engineer Wikimedia Foundation
[1] https://dumps.wikimedia.org/other/mediawiki_history/readme.html [2]
https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Edits/Mediawiki_hist...
[3] Many pre-computed fields are present in the dataset, from
edit-counts by user and page to reverts and reverted information, as well as time between events.
[4] As accurate as possible historical usernames and page-titles (as
well as user-groups and blocks) is available in addition to current values, and are provided in a denormalized way to every event of the dataset.
[5] https://wikitech.wikimedia.org/wiki/Analytics/AQS/Wikistats_2 [6] https://wikimedia.org/api/rest_v1/ [7]
https://phabricator.wikimedia.org/maniphest/task/edit/?title=Mediawiki%20His...
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Hello,
We have added a footer to dumps pages with the CC-0 note. Please see: https://dumps.wikimedia.org/other/analytics/
For other changes that you think are needed please do file a phab ticket.
Thanks,
Nuria
On Tue, Feb 11, 2020 at 2:50 PM Nuria Ruiz nruiz@wikimedia.org wrote:
Regarding Licensing, there is already a ticket: https://phabricator.wikimedia.org/T244685
If you take a look the bottom of wikistats (https://stats.wikimedia.org/v2) you will see that dedication is CC0, the data in both systems is the same but, of course, it can be made more explicit.
Thanks,
Nuria
On Tue, Feb 11, 2020 at 12:48 PM Leila Zia lzia@wikimedia.org wrote:
Hi Joseph and team,
summary: congratulations and some suggestions/requests.
I second and third Nate and Neil. Congratulations on meeting this milestone. This effort can empower the research community to spend less time on joining datasets and trying to resolve existing, known (to some) and complex issues with mediawiki history data and instead spend time doing the research. Nice! :)
I'm eager to see what the dataset(s) will be used for by others. On my end, I am looking forward to seeing more research on how Wiki(m|p)edia projects have evolved over the past almost 2 decades now that this data is more readily available for studying. What we learn from the Wikimedia projects and their evolution can be helpful in understanding the broader web ecosystem and its evolution as well (as the Web is only 30 years old now).
I have some requests if I may:
- Pine brings up a good point about licenses. It would be great to
make that clear in the documentation page(s). There are many examples of this (that you know better than I), just in case, I find the License section of https://iccl.inf.tu-dresden.de/web/Wikidata/Maps-06-2015/en informative, for example.
- The other request I have is that you make the template for citing
this data-set clear to the end-user in your documentation pages (including readme). You can do this in a few different ways:
** In the documentation pages, put a suggested citation link. For example (for bibtex):
@misc{wmfanalytics2020mediawikihistory, title = {MediaWiki History}, author = {nameoftheauthors}, howpublished = "\url{ https://dumps.wikimedia.org/other/mediawiki_history/%7D", note = {Accessed on date x}, year={2020} }
** Upload a paper about the work on arxiv.org. This way, your work gets a DOI that you can use in your documentation pages for folks to use for citation. Note that this step can be relatively light-weight. (no peer-review in this case and it's relatively quick.)
** Submit the paper to a conference. Some conferences have a data-set paper track where you publish about the dataset you release. Research is happy to support you with guidance if you need it and if you choose to go down this path. This takes some more time and in return it will give you a "peer-review" stamp and more experience in publishing if you like that.
Unless you like publishing your work in a peer-reviewed venue, I suggest one of the first two approaches.
- I'm not sure if you intend to make the dataset more discoverable
through places such as https://datasetsearch.research.google.com/ . You may want to consider that.
Thanks, Leila
-- Leila Zia Head of Research Wikimedia Foundation
On Mon, Feb 10, 2020 at 9:28 PM Pine W wiki.pine@gmail.com wrote:
I was thinking about the licensing issue some more. Apparently there was a relevant United States court case regarding metadata several years ago in the United States, but it's unclear to me from my brief web search whether this holding would apply to metadata from every nation. Also, I don't know if the underlying statues have changed since the time of that ruling. I think that WMF Legal should be consulted regarding the copyright status of the metadata. Also, I think that the licensing of metadata should be explicitly addressed in the Terms of Use or a similar document which is easily accessible to all contributors to Wikimedia sites.
Pine ( https://meta.wikimedia.org/wiki/User:Pine )
On Tue, Feb 11, 2020 at 12:17 AM Pine W wiki.pine@gmail.com wrote:
Hi Joseph,
Thanks for this announcement.
I am looking for license information regarding the dumps, and I'm not finding it in the pages that you linked at [1] or [2]. The license that applies to text on Wikimedia sites is often CC-BY-SA 3.0, and the WMF Terms of Use at
https://foundation.wikimedia.org/wiki/Terms_of_Use
do not appear to provide any exception for metadata. In the absence of a specific license, I think that the CC-BY-SA or other relevant licenses would apply to the metadata, and that the licensing information should be prominently included on relevant pages and in the dumps themselves.
What do you think?
Pine ( https://meta.wikimedia.org/wiki/User:Pine )
On Mon, Feb 10, 2020 at 4:28 PM Joseph Allemandou jallemandou@wikimedia.org wrote:
Hi Analytics People,
The Wikimedia Analytics Team is pleased to announce the release of
the most complete dataset we have to date to analyze content and contributors metadata: Mediawiki History [1] [2].
Data is in TSV format, released monthly around the 3rd of the month
usually, and every new release contains the full history of metadata.
The dataset contains an enhanced [3] and historified [4] version of
user, page and revision metadata and serves as a base to Wiksitats API on edits, users and pages [5] [6].
We hope you will have as much fun playing with the data as we have
building it, and we're eager to hear from you [7], whether for issues, ideas or usage of the data.
Analytically yours,
-- Joseph Allemandou (joal) (he / him) Sr Data Engineer Wikimedia Foundation
[1] https://dumps.wikimedia.org/other/mediawiki_history/readme.html [2]
https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Edits/Mediawiki_hist...
[3] Many pre-computed fields are present in the dataset, from
edit-counts by user and page to reverts and reverted information, as well as time between events.
[4] As accurate as possible historical usernames and page-titles
(as well as user-groups and blocks) is available in addition to current values, and are provided in a denormalized way to every event of the dataset.
[5] https://wikitech.wikimedia.org/wiki/Analytics/AQS/Wikistats_2 [6] https://wikimedia.org/api/rest_v1/ [7]
https://phabricator.wikimedia.org/maniphest/task/edit/?title=Mediawiki%20His...
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics