Duplicates are already cleaned up, in the refined table. There should never be any duplicates in the wmf.webrequest table.We should address automatic duplicate cleaning very soon, as Christian warned a while ago. He manually cleaned up duplicates a few times but we know it's a problem that needs solving.Seeing as this was merged on Jan 26, it is possible that it was not deployed when on Jan 27 when Oliver is noticing duplicates.We should be calculating a per-host arithmetic series over the sequence numbers
when data is loaded.Please see the wmf_raw.webrequest_sequence_stats tables, for hourly partition statistics, including duplicates and losses.-AoOn Feb 23, 2015, at 09:01, Dan Andreescu <dandreescu@wikimedia.org> wrote:We should address automatic duplicate cleaning very soon, as Christian warned a while ago. He manually cleaned up duplicates a few times but we know it's a problem that needs solving._______________________________________________On Mon, Feb 23, 2015 at 6:22 AM, Christian Aistleitner <christian@quelltextlich.at> wrote:Hi Oliver,
On Sun, Feb 22, 2015 at 06:46:37PM -0500, Oliver Keyes wrote:
> And, an additional point; I don't understand why, if dupes is the
> problem, the Hive query was not hit as badly by this as the equivalent
> UDF.
just shooting in the dark, since you did not provide your query, but
if you by accident had been querying the
wmf_raw.webrequest
(database name ending in “_raw”) table instead of
wmf.webrequest
(no “_raw” in the database name), the difference you described would
be plausible (and given the patching of GHOST, they'd even be
expected).
Have fun,
Christian
--
---- quelltextlich e.U. ---- \\ ---- Christian Aistleitner ----
Companies' registry: 360296y in Linz
Christian Aistleitner
Kefermarkterstrasze 6a/3 Email: christian@quelltextlich.at
4293 Gutau, Austria Phone: +43 7946 / 20 5 81
Fax: +43 7946 / 20 5 81
Homepage: http://quelltextlich.at/
---------------------------------------------------------------
_______________________________________________
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics
_______________________________________________
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics