It turns out that I did do some pre-computing here.
See
db1047.eqiad.wmnet:staging.editor_month_by_namespace
[staging]> explain editor_month_by_namespace;
+----------------+--------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+----------------+--------------+------+-----+---------+-------+
| wiki | varchar(50) | NO | PRI | | |
| month | varbinary(7) | NO | PRI | | |
| user_id | varchar(255) | NO | PRI | | |
| page_namespace | int(11) | NO | PRI | 0 | |
| archived | int(11) | YES | | NULL | |
| revisions | int(11) | YES | | NULL | |
| mmonth | date | YES | | NULL | |
| reverted | int(11) | YES | | NULL | |
+----------------+--------------+------+-----+---------+-------+
8 rows in set (0.01 sec)
As you'll notice, the table has a column for Wiki -- which means you can
use it to do cross-wiki analysis.
mmonth and reverted were added by Leila, so she'll need to comment on
that.
Otherwise:
- wiki - wikidb name (e.g. "enwiki")
- month - YYYYMM
- user_id - corresponds to user table
- page_namespace - namespace ID number
- archived - # of revisions to deleted pages
- revisions - # of all revisions (archived or not)
-Aaron
On Thu, Jan 8, 2015 at 9:00 AM, Dan Andreescu <dandreescu(a)wikimedia.org>
wrote:
On Thu, Jan 8, 2015 at 2:33 AM, Oliver Keyes <okeyes(a)wikimedia.org>
wrote:
On 8 January 2015 at 02:31, Gergo Tisza
<gtisza(a)wikimedia.org> wrote:
On Wed, Jan 7, 2015 at 6:26 PM, Oliver Keyes
<okeyes(a)wikimedia.org>
wrote:
>
> places to get edits? Well....the revision table? I'm sort of confused
> as to what you're looking for, I guess, that the db wouldn't have.
There are a thousand or so wikis; it would be nice if there was a
single
table with all the edits. I guess I can generate
a query with a
thousand
> unions...
We agree. And that's why we're building a data warehouse. We are
currently going back and forth with Sean vetting a load process that
creates exactly the "edit" table as you describe it. The nice thing about
the schema we are putting together is that not only would you be able to
see the namespace of the page at the time of query but also throughout the
page's history (as it moves from draft to main, etc.)
The harder problem is that it would be nice to
group by editor activity
levels. One of the concerns about MediaViewer was that it makes harder
for
new editors to understand file pages and start
editing them; so it
would be
a plausible hypothesis that the number of file
edits by new editors
would
drop sharply after making MV default, but the
total file edit count
wouldn't
be visibly affected because it would be dominated
by power users who
already
know how to curate image metadata.
So I would like to look at something like the number of first edits per
month, or the number of edits by editors who at the time had less than
10
edits... recovering that kind of data from the
revision table seems
extremely difficult.
Yeah, that is difficult. Aaron has, I believe, precomputed some things;
Aaron?
IANAA (I am not an Aaron) but I'm happy to help with the query. I know
of most of the stuff Aaron pre-computed as of a couple of months ago and
this specific thing wasn't done. Gergo, if you could precisely spell out a
few queries you'd like to do, I can translate to SQL and use the experience
to inform our data warehouse work.
_______________________________________________
Analytics mailing list
Analytics(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics
_______________________________________________
Analytics mailing list
Analytics(a)lists.wikimedia.org