The only reason I didn't break this down by namespace was because they queries would have taken an order of magnitude longer to join the revision and page tables.  The query I used didn't even need to read the revision or archive tables.  It only read an index on those tables.  That made it go pretty fast.  :)  I'd be interested in taking another pass if you guys don't mind dealing with a heavier server load.  

On Thu, Jun 12, 2014 at 7:05 PM, Dario Taraborelli <dtaraborelli@wikimedia.org> wrote:
Aaron – this is fantastic.
Two quick questions:

- was the decision not to break down the data by namespace (matching Erik Zachte’s master editor data dump) intentional?
- are we expecting to refresh the archived revision count field every month?

Dario


On Jun 12, 2014, at 2:33 PM, Aaron Halfaker <ahalfaker@wikimedia.org> wrote:

+1

For example, the last time I sent a similar email to the list, it was for the wiki_info table.  One of the tasks I have is to break the code for generating that table out of the analysis project it lives in and make it a separate repo so that Oliver can send pull requests to fix issues and/or maintain his own managed table.

It would be great to work towards an architecture that allows us to keep these tables up-to-date without user-based cron jobs.  

-Aaron

On Thu, Jun 12, 2014 at 4:24 PM, Dan Andreescu <dandreescu@wikimedia.org> wrote:
This is great.  I'd like to go on record saying that this is leaning towards a data warehouse kind of approach - basically pre-aggregating useful datasets.  So we might want to do this in a more organized way down the line.


On Thu, Jun 12, 2014 at 2:57 PM, Oliver Keyes <okeyes@wikimedia.org> wrote:
This is fricking awesome!


On 12 June 2014 10:58, Aaron Halfaker <ahalfaker@wikimedia.org> wrote:
I created a new table on analytics-store.eqiad.wmnet.  It contains the monthly edit counts for all wikis.  See a brief overview below.    

Note that the "revisions" column contains a count of all revisions -- archived or not.  The "archived" column contains a count of archived revisions.   So revisions - archived == non-archived revisions. 

analytics-store.eqiad.wmnet [staging]> explain editor_month;
+-------------------+----------------+------+-----+---------+-------+
| Field             | Type           | Null | Key | Default | Extra |
+-------------------+----------------+------+-----+---------+-------+
| wiki              | varbinary(50)  | NO   | PRI |         |       |
| month             | varbinary(7)   | NO   | PRI |         |       |
| user_id           | int(11)        | NO   | PRI | 0       |       |
| user_name         | varbinary(191) | YES  |     | NULL    |       |
| user_registration | varbinary(14)  | YES  |     | NULL    |       |
| archived          | int(11)        | YES  |     | NULL    |       |
| revisions         | int(11)        | YES  |     | NULL    |       |
+-------------------+----------------+------+-----+---------+-------+
7 rows in set (0.01 sec)

analytics-store.eqiad.wmnet [staging]> select * from editor_month limit 3;
+--------+---------+---------+------------+-------------------+----------+-----------+
| wiki   | month   | user_id | user_name  | user_registration | archived | revisions |
+--------+---------+---------+------------+-------------------+----------+-----------+
| enwiki | 2001-01 |      34 | WojPob     | 20010129110725    |        0 |        13 |
| enwiki | 2001-01 |      99 | RoseParks  | 20010121021221    |        0 |         7 |
| enwiki | 2001-01 |     479 | JimboWales | 20010123223416    |        0 |        13 |
+--------+---------+---------+------------+-------------------+----------+-----------+
3 rows in set (0.03 sec)

Feedback is welcome.   One of the next things, I'd like to do is remove the "-" from the month column as it ruins comparison with MW timestamps. 

-Aaron

_______________________________________________
wmfresearch mailing list
wmfresearch@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wmfresearch




--
Oliver Keyes
Research Analyst
Wikimedia Foundation

_______________________________________________
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics



_______________________________________________
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics


_______________________________________________
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics