I want to check what effect MediaViewer had on file namespace edits. Aggregating the standard MediaWiki dumps over all wikis seems like a pain; is there a more convenient source for that data? Even better if it can be filtered by the editcount of the user at the time of the edit.
I looked at the Edit* EventLogging schemas, but those are either fairly recent or not used. Is there any other source where this information could be retrieved from?
thanks Gergő
places to get edits? Well....the revision table? I'm sort of confused as to what you're looking for, I guess, that the db wouldn't have.
On 7 January 2015 at 20:52, Gergo Tisza gtisza@wikimedia.org wrote:
I want to check what effect MediaViewer had on file namespace edits. Aggregating the standard MediaWiki dumps over all wikis seems like a pain; is there a more convenient source for that data? Even better if it can be filtered by the editcount of the user at the time of the edit.
I looked at the Edit* EventLogging schemas, but those are either fairly recent or not used. Is there any other source where this information could be retrieved from?
thanks Gergő
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
On Wed, Jan 7, 2015 at 6:26 PM, Oliver Keyes okeyes@wikimedia.org wrote:
places to get edits? Well....the revision table? I'm sort of confused as to what you're looking for, I guess, that the db wouldn't have.
There are a thousand or so wikis; it would be nice if there was a single table with all the edits. I guess I can generate a query with a thousand unions...
The harder problem is that it would be nice to group by editor activity levels. One of the concerns about MediaViewer was that it makes harder for new editors to understand file pages and start editing them; so it would be a plausible hypothesis that the number of file edits by new editors would drop sharply after making MV default, but the total file edit count wouldn't be visibly affected because it would be dominated by power users who already know how to curate image metadata.
So I would like to look at something like the number of first edits per month, or the number of edits by editors who at the time had less than 10 edits... recovering that kind of data from the revision table seems extremely difficult.
On 8 January 2015 at 02:31, Gergo Tisza gtisza@wikimedia.org wrote:
On Wed, Jan 7, 2015 at 6:26 PM, Oliver Keyes okeyes@wikimedia.org wrote:
places to get edits? Well....the revision table? I'm sort of confused as to what you're looking for, I guess, that the db wouldn't have.
There are a thousand or so wikis; it would be nice if there was a single table with all the edits. I guess I can generate a query with a thousand unions...
We've talked about having a unified db; the problem is some tables exist on some wikis and not on others (and keeping it synced up would be a pain). The most we have is a machine with all of the dbs on it, and a table containing all the dbnames; it should be fairly trivial to write something in Python or similar to iterate through them.
The harder problem is that it would be nice to group by editor activity levels. One of the concerns about MediaViewer was that it makes harder for new editors to understand file pages and start editing them; so it would be a plausible hypothesis that the number of file edits by new editors would drop sharply after making MV default, but the total file edit count wouldn't be visibly affected because it would be dominated by power users who already know how to curate image metadata.
So I would like to look at something like the number of first edits per month, or the number of edits by editors who at the time had less than 10 edits... recovering that kind of data from the revision table seems extremely difficult.
Yeah, that is difficult. Aaron has, I believe, precomputed some things; Aaron?
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
On Thu, Jan 8, 2015 at 2:33 AM, Oliver Keyes okeyes@wikimedia.org wrote:
On 8 January 2015 at 02:31, Gergo Tisza gtisza@wikimedia.org wrote:
On Wed, Jan 7, 2015 at 6:26 PM, Oliver Keyes okeyes@wikimedia.org
wrote:
places to get edits? Well....the revision table? I'm sort of confused as to what you're looking for, I guess, that the db wouldn't have.
There are a thousand or so wikis; it would be nice if there was a single table with all the edits. I guess I can generate a query with a thousand unions...
We agree. And that's why we're building a data warehouse. We are currently going back and forth with Sean vetting a load process that creates exactly the "edit" table as you describe it. The nice thing about the schema we are putting together is that not only would you be able to see the namespace of the page at the time of query but also throughout the page's history (as it moves from draft to main, etc.)
The harder problem is that it would be nice to group by editor activity levels. One of the concerns about MediaViewer was that it makes harder
for
new editors to understand file pages and start editing them; so it would
be
a plausible hypothesis that the number of file edits by new editors would drop sharply after making MV default, but the total file edit count
wouldn't
be visibly affected because it would be dominated by power users who
already
know how to curate image metadata.
So I would like to look at something like the number of first edits per month, or the number of edits by editors who at the time had less than 10 edits... recovering that kind of data from the revision table seems extremely difficult.
Yeah, that is difficult. Aaron has, I believe, precomputed some things; Aaron?
IANAA (I am not an Aaron) but I'm happy to help with the query. I know of most of the stuff Aaron pre-computed as of a couple of months ago and this specific thing wasn't done. Gergo, if you could precisely spell out a few queries you'd like to do, I can translate to SQL and use the experience to inform our data warehouse work.
It turns out that I did do some pre-computing here. See db1047.eqiad.wmnet:staging.editor_month_by_namespace
[staging]> explain editor_month_by_namespace; +----------------+--------------+------+-----+---------+-------+ | Field | Type | Null | Key | Default | Extra | +----------------+--------------+------+-----+---------+-------+ | wiki | varchar(50) | NO | PRI | | | | month | varbinary(7) | NO | PRI | | | | user_id | varchar(255) | NO | PRI | | | | page_namespace | int(11) | NO | PRI | 0 | | | archived | int(11) | YES | | NULL | | | revisions | int(11) | YES | | NULL | | | mmonth | date | YES | | NULL | | | reverted | int(11) | YES | | NULL | | +----------------+--------------+------+-----+---------+-------+ 8 rows in set (0.01 sec)
As you'll notice, the table has a column for Wiki -- which means you can use it to do cross-wiki analysis.
mmonth and reverted were added by Leila, so she'll need to comment on that.
Otherwise:
- wiki - wikidb name (e.g. "enwiki") - month - YYYYMM - user_id - corresponds to user table - page_namespace - namespace ID number - archived - # of revisions to deleted pages - revisions - # of all revisions (archived or not)
-Aaron
On Thu, Jan 8, 2015 at 9:00 AM, Dan Andreescu dandreescu@wikimedia.org wrote:
On Thu, Jan 8, 2015 at 2:33 AM, Oliver Keyes okeyes@wikimedia.org wrote:
On 8 January 2015 at 02:31, Gergo Tisza gtisza@wikimedia.org wrote:
On Wed, Jan 7, 2015 at 6:26 PM, Oliver Keyes okeyes@wikimedia.org
wrote:
places to get edits? Well....the revision table? I'm sort of confused as to what you're looking for, I guess, that the db wouldn't have.
There are a thousand or so wikis; it would be nice if there was a single table with all the edits. I guess I can generate a query with a thousand unions...
We agree. And that's why we're building a data warehouse. We are currently going back and forth with Sean vetting a load process that creates exactly the "edit" table as you describe it. The nice thing about the schema we are putting together is that not only would you be able to see the namespace of the page at the time of query but also throughout the page's history (as it moves from draft to main, etc.)
The harder problem is that it would be nice to group by editor activity levels. One of the concerns about MediaViewer was that it makes harder
for
new editors to understand file pages and start editing them; so it
would be
a plausible hypothesis that the number of file edits by new editors
would
drop sharply after making MV default, but the total file edit count
wouldn't
be visibly affected because it would be dominated by power users who
already
know how to curate image metadata.
So I would like to look at something like the number of first edits per month, or the number of edits by editors who at the time had less than
10
edits... recovering that kind of data from the revision table seems extremely difficult.
Yeah, that is difficult. Aaron has, I believe, precomputed some things; Aaron?
IANAA (I am not an Aaron) but I'm happy to help with the query. I know of most of the stuff Aaron pre-computed as of a couple of months ago and this specific thing wasn't done. Gergo, if you could precisely spell out a few queries you'd like to do, I can translate to SQL and use the experience to inform our data warehouse work.
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Gergo, this table has edits per name space aggregated by month. In your original email, you ask for edit count and time of edit. If that's the case, this table can't help (but how Aaron has generated this table can).
mmonth: last day of the month (month is YYYYMM form) reverted: total number of reverted revisions done by the user (or reverts by the user, I'm not 100% sure right now, but given your questions, you can safely ignore this column).
On Thu, Jan 8, 2015 at 10:30 AM, Aaron Halfaker ahalfaker@wikimedia.org wrote:
It turns out that I did do some pre-computing here. See db1047.eqiad.wmnet:staging.editor_month_by_namespace
[staging]> explain editor_month_by_namespace; +----------------+--------------+------+-----+---------+-------+ | Field | Type | Null | Key | Default | Extra | +----------------+--------------+------+-----+---------+-------+ | wiki | varchar(50) | NO | PRI | | | | month | varbinary(7) | NO | PRI | | | | user_id | varchar(255) | NO | PRI | | | | page_namespace | int(11) | NO | PRI | 0 | | | archived | int(11) | YES | | NULL | | | revisions | int(11) | YES | | NULL | | | mmonth | date | YES | | NULL | | | reverted | int(11) | YES | | NULL | | +----------------+--------------+------+-----+---------+-------+ 8 rows in set (0.01 sec)
As you'll notice, the table has a column for Wiki -- which means you can use it to do cross-wiki analysis.
mmonth and reverted were added by Leila, so she'll need to comment on that.
Otherwise:
- wiki - wikidb name (e.g. "enwiki")
- month - YYYYMM
- user_id - corresponds to user table
- page_namespace - namespace ID number
- archived - # of revisions to deleted pages
- revisions - # of all revisions (archived or not)
-Aaron
On Thu, Jan 8, 2015 at 9:00 AM, Dan Andreescu dandreescu@wikimedia.org wrote:
On Thu, Jan 8, 2015 at 2:33 AM, Oliver Keyes okeyes@wikimedia.org wrote:
On 8 January 2015 at 02:31, Gergo Tisza gtisza@wikimedia.org wrote:
On Wed, Jan 7, 2015 at 6:26 PM, Oliver Keyes okeyes@wikimedia.org
wrote:
places to get edits? Well....the revision table? I'm sort of confused as to what you're looking for, I guess, that the db wouldn't have.
There are a thousand or so wikis; it would be nice if there was a
single
table with all the edits. I guess I can generate a query with a
thousand
unions...
We agree. And that's why we're building a data warehouse. We are currently going back and forth with Sean vetting a load process that creates exactly the "edit" table as you describe it. The nice thing about the schema we are putting together is that not only would you be able to see the namespace of the page at the time of query but also throughout the page's history (as it moves from draft to main, etc.)
The harder problem is that it would be nice to group by editor activity levels. One of the concerns about MediaViewer was that it makes harder
for
new editors to understand file pages and start editing them; so it
would be
a plausible hypothesis that the number of file edits by new editors
would
drop sharply after making MV default, but the total file edit count
wouldn't
be visibly affected because it would be dominated by power users who
already
know how to curate image metadata.
So I would like to look at something like the number of first edits per month, or the number of edits by editors who at the time had less than
10
edits... recovering that kind of data from the revision table seems extremely difficult.
Yeah, that is difficult. Aaron has, I believe, precomputed some things; Aaron?
IANAA (I am not an Aaron) but I'm happy to help with the query. I know of most of the stuff Aaron pre-computed as of a couple of months ago and this specific thing wasn't done. Gergo, if you could precisely spell out a few queries you'd like to do, I can translate to SQL and use the experience to inform our data warehouse work.
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Gergo Tisza, 08/01/2015 02:52:
Even better if it can be filtered by the editcount of the user at the time of the edit.
Then you probably want something like https://stats.wikimedia.org/EN/TablesWikipediaHU.htm#editor_activity_levels but with File namespace disaggregated from "Other".
Nemo
On Wed, Jan 7, 2015 at 11:15 PM, Federico Leva (Nemo) nemowiki@gmail.com wrote:
Then you probably want something like https://stats.wikimedia.org/ EN/TablesWikipediaHU.htm#editor_activity_levels but with File namespace disaggregated from "Other".
I was looking for the number of edits; that's the number of editors. Although that would be interesting too, if available.