We should address automatic duplicate cleaning very soon, as Christian warned a while ago. He manually cleaned up duplicates a few times but we know it's a problem that needs solving.
We should be calculating a per-host arithmetic series over the sequence numbers
when data is loaded.
On Feb 23, 2015, at 09:01, Dan Andreescu <dandreescu@wikimedia.org> wrote:We should address automatic duplicate cleaning very soon, as Christian warned a while ago. He manually cleaned up duplicates a few times but we know it's a problem that needs solving._______________________________________________On Mon, Feb 23, 2015 at 6:22 AM, Christian Aistleitner <christian@quelltextlich.at> wrote:Hi Oliver,
On Sun, Feb 22, 2015 at 06:46:37PM -0500, Oliver Keyes wrote:
> And, an additional point; I don't understand why, if dupes is the
> problem, the Hive query was not hit as badly by this as the equivalent
> UDF.
just shooting in the dark, since you did not provide your query, but
if you by accident had been querying the
wmf_raw.webrequest
(database name ending in “_raw”) table instead of
wmf.webrequest
(no “_raw” in the database name), the difference you described would
be plausible (and given the patching of GHOST, they'd even be
expected).
Have fun,
Christian
--
---- quelltextlich e.U. ---- \\ ---- Christian Aistleitner ----
Companies' registry: 360296y in Linz
Christian Aistleitner
Kefermarkterstrasze 6a/3 Email: christian@quelltextlich.at
4293 Gutau, Austria Phone: +43 7946 / 20 5 81
Fax: +43 7946 / 20 5 81
Homepage: http://quelltextlich.at/
---------------------------------------------------------------
_______________________________________________
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics