We should address automatic duplicate cleaning very soon, as Christian warned a while ago.  He manually cleaned up duplicates a few times but we know it's a problem that needs solving.
Duplicates are already cleaned up, in the refined table.  There should never be any duplicates in the wmf.webrequest table.

https://gerrit.wikimedia.org/r/#/c/177522/

Seeing as this was merged on Jan 26, it is possible that it was not deployed when on Jan 27 when Oliver is noticing duplicates.

We should be calculating a per-host arithmetic series over the sequence numbers
when data is loaded.
Please see the wmf_raw.webrequest_sequence_stats tables, for hourly partition statistics, including duplicates and losses.

-Ao




On Feb 23, 2015, at 09:01, Dan Andreescu <dandreescu@wikimedia.org> wrote:

We should address automatic duplicate cleaning very soon, as Christian warned a while ago.  He manually cleaned up duplicates a few times but we know it's a problem that needs solving.

On Mon, Feb 23, 2015 at 6:22 AM, Christian Aistleitner <christian@quelltextlich.at> wrote:
Hi Oliver,

On Sun, Feb 22, 2015 at 06:46:37PM -0500, Oliver Keyes wrote:
> And, an additional point; I don't understand why, if dupes is the
> problem, the Hive query was not hit as badly by this as the equivalent
> UDF.

just shooting in the dark, since you did not provide your query, but
if you by accident had been querying the

  wmf_raw.webrequest

(database name ending in “_raw”) table instead of

  wmf.webrequest

(no “_raw” in the database name), the difference you described would
be plausible (and given the patching of GHOST, they'd even be
expected).


Have fun,
Christian



--
---- quelltextlich e.U. ---- \\ ---- Christian Aistleitner ----
                           Companies' registry: 360296y in Linz
Christian Aistleitner
Kefermarkterstrasze 6a/3     Email:  christian@quelltextlich.at
4293 Gutau, Austria          Phone:          +43 7946 / 20 5 81
                             Fax:            +43 7946 / 20 5 81
                             Homepage: http://quelltextlich.at/
---------------------------------------------------------------

_______________________________________________
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics


_______________________________________________
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics