We should address automatic duplicate cleaning very soon, as Christian warned a while ago.  He manually cleaned up duplicates a few times but we know it's a problem that needs solving.

On Mon, Feb 23, 2015 at 6:22 AM, Christian Aistleitner <christian@quelltextlich.at> wrote:
Hi Oliver,

On Sun, Feb 22, 2015 at 06:46:37PM -0500, Oliver Keyes wrote:
> And, an additional point; I don't understand why, if dupes is the
> problem, the Hive query was not hit as badly by this as the equivalent
> UDF.

just shooting in the dark, since you did not provide your query, but
if you by accident had been querying the

  wmf_raw.webrequest

(database name ending in “_raw”) table instead of

  wmf.webrequest

(no “_raw” in the database name), the difference you described would
be plausible (and given the patching of GHOST, they'd even be
expected).


Have fun,
Christian



--
---- quelltextlich e.U. ---- \\ ---- Christian Aistleitner ----
                           Companies' registry: 360296y in Linz
Christian Aistleitner
Kefermarkterstrasze 6a/3     Email:  christian@quelltextlich.at
4293 Gutau, Austria          Phone:          +43 7946 / 20 5 81
                             Fax:            +43 7946 / 20 5 81
                             Homepage: http://quelltextlich.at/
---------------------------------------------------------------

_______________________________________________
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics