Aha, so wmf_raw.webrequest is expected to have duplicates? Okay! That could do it :). I'll re-run across wmf.webrequest; thanks Christian for the spot, and Andrew for having thought 3 stages ahead as usual :D
On 23 February 2015 at 09:35, Andrew Otto aotto@wikimedia.org wrote:
We should address automatic duplicate cleaning very soon, as Christian warned a while ago. He manually cleaned up duplicates a few times but we know it's a problem that needs solving.
Duplicates are already cleaned up, in the refined table. There should never be any duplicates in the wmf.webrequest table.
https://gerrit.wikimedia.org/r/#/c/177522/
Seeing as this was merged on Jan 26, it is possible that it was not deployed when on Jan 27 when Oliver is noticing duplicates.
We should be calculating a per-host arithmetic series over the sequence numbers when data is loaded.
Please see the wmf_raw.webrequest_sequence_stats tables, for hourly partition statistics, including duplicates and losses.
-Ao
On Feb 23, 2015, at 09:01, Dan Andreescu dandreescu@wikimedia.org wrote:
We should address automatic duplicate cleaning very soon, as Christian warned a while ago. He manually cleaned up duplicates a few times but we know it's a problem that needs solving.
On Mon, Feb 23, 2015 at 6:22 AM, Christian Aistleitner christian@quelltextlich.at wrote:
Hi Oliver,
On Sun, Feb 22, 2015 at 06:46:37PM -0500, Oliver Keyes wrote:
And, an additional point; I don't understand why, if dupes is the problem, the Hive query was not hit as badly by this as the equivalent UDF.
just shooting in the dark, since you did not provide your query, but if you by accident had been querying the
wmf_raw.webrequest
(database name ending in “_raw”) table instead of
wmf.webrequest
(no “_raw” in the database name), the difference you described would be plausible (and given the patching of GHOST, they'd even be expected).
Have fun, Christian
-- ---- quelltextlich e.U. ---- \ ---- Christian Aistleitner ---- Companies' registry: 360296y in Linz Christian Aistleitner Kefermarkterstrasze 6a/3 Email: christian@quelltextlich.at 4293 Gutau, Austria Phone: +43 7946 / 20 5 81 Fax: +43 7946 / 20 5 81 Homepage: http://quelltextlich.at/
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics