Hi :)
The following table is easily the largest in eventlogging and growing fastest:
114G UniversalLanguageSelector-tofu_7629564
Is there a plan for purging old data from this one? I realize it's mostly new data; just wondering if growth will be unbounded.
Why does it have an odd name "-tofu"? Is it intended?
There is a duplicate table called UniversalLanguageSelecTor-tofu_7629564 -- note the uppercase T -- with a single row. Is that needed?
The next biggest are:
67G PageContentSaveComplete_5588433.ibd 61G MediaViewer_8572637.ibd 57G MediaViewer_8245578.ibd 33G MobileWebClickTracking_5929948.ibd
BR Sean
--- DBA @ WMF
he odd name is frustrating to me too :/. I'd be interested to see if we need the MV tables (or, the really old data in them): as I understand it those are aggregated for public consumption fairly regularly.
On 2 July 2014 22:21, Sean Pringle springle@wikimedia.org wrote:
Hi :)
The following table is easily the largest in eventlogging and growing fastest:
114G UniversalLanguageSelector-tofu_7629564
Is there a plan for purging old data from this one? I realize it's mostly new data; just wondering if growth will be unbounded.
Why does it have an odd name "-tofu"? Is it intended?
There is a duplicate table called UniversalLanguageSelecTor-tofu_7629564 -- note the uppercase T -- with a single row. Is that needed?
The next biggest are:
67G PageContentSaveComplete_5588433.ibd 61G MediaViewer_8572637.ibd 57G MediaViewer_8245578.ibd 33G MobileWebClickTracking_5929948.ibd
BR Sean
DBA @ WMF
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
I have the feeling there’s no need to keep 114Gb of raw client-side instrumentation data for tofu detection. Copying Amir, Gilles and Jon who are the respective owners of the schemas in Sean’s list.
On Jul 2, 2014, at 7:44 PM, Oliver Keyes okeyes@wikimedia.org wrote:
he odd name is frustrating to me too :/. I'd be interested to see if we need the MV tables (or, the really old data in them): as I understand it those are aggregated for public consumption fairly regularly.
On 2 July 2014 22:21, Sean Pringle springle@wikimedia.org wrote: Hi :)
The following table is easily the largest in eventlogging and growing fastest:
114G UniversalLanguageSelector-tofu_7629564
Is there a plan for purging old data from this one? I realize it's mostly new data; just wondering if growth will be unbounded.
Why does it have an odd name "-tofu"? Is it intended?
There is a duplicate table called UniversalLanguageSelecTor-tofu_7629564 -- note the uppercase T -- with a single row. Is that needed?
The next biggest are:
67G PageContentSaveComplete_5588433.ibd 61G MediaViewer_8572637.ibd 57G MediaViewer_8245578.ibd 33G MobileWebClickTracking_5929948.ibd
BR Sean
DBA @ WMF
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Oliver Keyes Research Analyst Wikimedia Foundation _______________________________________________ Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Just to be clear:
I'm interested in identifying the expected growth bounds rather than limiting tables arbitrarily.
If someone knows X months or years of data is required for certain tables, feel free to speak up and Ops will ensure necessary storage capacity is planned in time.
On Thu, Jul 3, 2014 at 2:57 PM, Dario Taraborelli < dtaraborelli@wikimedia.org> wrote:
I have the feeling there’s no need to keep 114Gb of raw client-side instrumentation data for tofu detection. Copying Amir, Gilles and Jon who are the respective owners of the schemas in Sean’s list.
On Jul 2, 2014, at 7:44 PM, Oliver Keyes okeyes@wikimedia.org wrote:
he odd name is frustrating to me too :/. I'd be interested to see if we need the MV tables (or, the really old data in them): as I understand it those are aggregated for public consumption fairly regularly.
On 2 July 2014 22:21, Sean Pringle springle@wikimedia.org wrote:
Hi :)
The following table is easily the largest in eventlogging and growing fastest:
114G UniversalLanguageSelector-tofu_7629564
Is there a plan for purging old data from this one? I realize it's mostly new data; just wondering if growth will be unbounded.
Why does it have an odd name "-tofu"? Is it intended?
There is a duplicate table called UniversalLanguageSelecTor-tofu_7629564 -- note the uppercase T -- with a single row. Is that needed?
The next biggest are:
67G PageContentSaveComplete_5588433.ibd 61G MediaViewer_8572637.ibd 57G MediaViewer_8245578.ibd 33G MobileWebClickTracking_5929948.ibd
BR Sean
DBA @ WMF
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Oliver Keyes Research Analyst Wikimedia Foundation _______________________________________________ Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
The tofu logging is stopping - I already committed the code to stop it and it goes on the usual deployment train. I plan to run some analysis on it this week, and after that it can be discarded. I'll need a bit of help from Nuria with the analysis.
-- Amir Elisha Aharoni ። אָמִיר אֱלִישָׁע אַהֲרוֹנִי Language Engineering ። הַנְדָּסָה לְשׁוֹנִית Wikimedia Foundation ። קֶרֶן וִיקִימֶדְיָה
2014-07-03 7:57 GMT+03:00 Dario Taraborelli dtaraborelli@wikimedia.org:
I have the feeling there’s no need to keep 114Gb of raw client-side instrumentation data for tofu detection. Copying Amir, Gilles and Jon who are the respective owners of the schemas in Sean’s list.
On Jul 2, 2014, at 7:44 PM, Oliver Keyes okeyes@wikimedia.org wrote:
he odd name is frustrating to me too :/. I'd be interested to see if we need the MV tables (or, the really old data in them): as I understand it those are aggregated for public consumption fairly regularly.
On 2 July 2014 22:21, Sean Pringle springle@wikimedia.org wrote:
Hi :)
The following table is easily the largest in eventlogging and growing fastest:
114G UniversalLanguageSelector-tofu_7629564
Is there a plan for purging old data from this one? I realize it's mostly new data; just wondering if growth will be unbounded.
Why does it have an odd name "-tofu"? Is it intended?
There is a duplicate table called UniversalLanguageSelecTor-tofu_7629564 -- note the uppercase T -- with a single row. Is that needed?
The next biggest are:
67G PageContentSaveComplete_5588433.ibd 61G MediaViewer_8572637.ibd 57G MediaViewer_8245578.ibd 33G MobileWebClickTracking_5929948.ibd
BR Sean
DBA @ WMF
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Oliver Keyes Research Analyst Wikimedia Foundation _______________________________________________ Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Hi Sean,
On Thu, Jul 03, 2014 at 12:21:34PM +1000, Sean Pringle wrote:
The following table is easily the largest in eventlogging and growing fastest:
114G UniversalLanguageSelector-tofu_7629564
thanks for the heads up!
We are aware of UniversalLanguageSelector-tofu producing too much data since 2014-06-25 ([1], [2]), and Nuria is on it.
As I could not find a corresponding bug, I created one to track the issue at: https://bugzilla.wikimedia.org/show_bug.cgi?id=67463
Is there a plan for purging old data from this one?
Just to make expectations explicit: Since in a different part of this thread you are asking more for expected growth bounds, I assume that the table can stay at that size until discussion with Language about the way forward produced concrete next steps, and you do not expect us to prune data right away.
There is a duplicate table called UniversalLanguageSelecTor-tofu_7629564 -- note the uppercase T -- with a single row. Is that needed?
I noted that too when looking at the issue last week, but decided against calling it out, since it's just a single small table. I expect we see these artifacts from time to time. Do they get in the way somehow, or is it ok to just keep them around?
Thanks, Christian
[1] http://lists.wikimedia.org/pipermail/analytics/2014-June/002260.html [2] search for “tofu” on http://bots.wmflabs.org/~wm-bot/logs/%23wikimedia-analytics/20140625.txt
(cc-ing Amir)
[sean] Is there a plan for purging old data from this one?
[christian]Just to make expectations explicit: [christian] Since in a different part of this thread you are asking more
for
[christian] expected growth bounds, I assume that the table can stay at
that size
[christian] until discussion with Language about the way forward produced
concrete
[christian] next steps, and you do not expect us to prune data right away.
So we are all on the same page, the table has a lot of data cause i18n team was not aware logging was happening until we notify them of that fact. As Amir mentioned, the bug that prompted the logging has been fixed. As Dario said we definitely do not need that much data. I confirmed last week that we only need to 2 weeks of data to analyze, the data is just a short "survey" of what our users have available when it comes to fonts. So, yes, we could delete a bunch of the data and I believe Amir was about to request us to do so.
Since I have no permits to create tables, could we create a temporary table that holds the last two weeks of data? We could use that for our analysis and get rid of the other table once the bugfix is in production and logging has stopped.
[sean] I'm interested in identifying the expected growth bounds rather
than limiting tables arbitrarily.
This is definitely an item on our court, we need to determine those bounds and throttle when they are exceeded. We do not have any throttling when it comes to record creation. We detect the higher throughput of data but that's about it.
I have created a backlog item to this extent: https://bugzilla.wikimedia.org/show_bug.cgi?id=67470
On Thu, Jul 3, 2014 at 11:02 AM, Christian Aistleitner < christian@quelltextlich.at> wrote:
Hi Sean,
On Thu, Jul 03, 2014 at 12:21:34PM +1000, Sean Pringle wrote:
The following table is easily the largest in eventlogging and growing fastest:
114G UniversalLanguageSelector-tofu_7629564
thanks for the heads up!
We are aware of UniversalLanguageSelector-tofu producing too much data since 2014-06-25 ([1], [2]), and Nuria is on it.
As I could not find a corresponding bug, I created one to track the issue at: https://bugzilla.wikimedia.org/show_bug.cgi?id=67463
Is there a plan for purging old data from this one?
Just to make expectations explicit: Since in a different part of this thread you are asking more for expected growth bounds, I assume that the table can stay at that size until discussion with Language about the way forward produced concrete next steps, and you do not expect us to prune data right away.
There is a duplicate table called UniversalLanguageSelecTor-tofu_7629564
--
note the uppercase T -- with a single row. Is that needed?
I noted that too when looking at the issue last week, but decided against calling it out, since it's just a single small table. I expect we see these artifacts from time to time. Do they get in the way somehow, or is it ok to just keep them around?
Thanks, Christian
[1] http://lists.wikimedia.org/pipermail/analytics/2014-June/002260.html [2] search for “tofu” on http://bots.wmflabs.org/~wm-bot/logs/%23wikimedia-analytics/20140625.txt
-- ---- quelltextlich e.U. ---- \ ---- Christian Aistleitner ---- Companies' registry: 360296y in Linz Christian Aistleitner Kefermarkterstrasze 6a/3 Email: christian@quelltextlich.at 4293 Gutau, Austria Phone: +43 7946 / 20 5 81 Fax: +43 7946 / 20 5 81 Homepage: http://quelltextlich.at/
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Hi Nuria,
On Thu, Jul 03, 2014 at 12:36:56PM +0200, Nuria Ruiz wrote:
Since I have no permits to create tables, could we create a temporary table that holds the last two weeks of data?
By the data you showed around other tasks, I assume your having the credentials to the "research" user for dbstore1002. (If you don't have them, let us find a way to get you those.)
On dbstore1002, the "research" user can both read tables from the log database and also create tables on the staging database.
That should allow you to pull off what you want directly without depending on Sean.
Have fun, Christian
On Thu, Jul 3, 2014 at 8:36 PM, Nuria Ruiz nuria@wikimedia.org wrote:
I confirmed last week that we only need to 2 weeks of data to analyze, the data is just a short "survey" of what our users have available when it comes to fonts. So, yes, we could delete a bunch of the data and I believe Amir was about to request us to do so.
I'd say don't worry about it. Since collection is already ceasing, use the table as it is, then drop it in one go.
This is definitely an item on our court, we need to determine those bounds and throttle when they are exceeded. We do not have any throttling when it comes to record creation. We detect the higher throughput of data but that's about it.
I have created a backlog item to this extent: https://bugzilla.wikimedia.org/show_bug.cgi?id=67470
Thanks!
On Thu, Jul 3, 2014 at 7:02 PM, Christian Aistleitner < christian@quelltextlich.at> wrote:
Is there a plan for purging old data from this one?
Just to make expectations explicit: Since in a different part of this thread you are asking more for expected growth bounds, I assume that the table can stay at that size until discussion with Language about the way forward produced concrete next steps, and you do not expect us to prune data right away.
Certainly the table can remain. We're not in danger of hitting any storage limit in the short term.
There is a duplicate table called UniversalLanguageSelecTor-tofu_7629564
--
note the uppercase T -- with a single row. Is that needed?
I noted that too when looking at the issue last week, but decided against calling it out, since it's just a single small table. I expect we see these artifacts from time to time. Do they get in the way somehow, or is it ok to just keep them around?
Fine to keep them around. This one just looked odd when combined with the first table's size, as though something generally odd was happening :-)