Thanks Joseph. Am I correct in saying that the counts in pageviews are just the aggregated counts for decoded page titles from pagecounts-all-sites?
Bo
On Tue, Mar 1, 2016 at 1:39 PM, Joseph Allemandou jallemandou@wikimedia.org wrote:
Hi ! pagecounts are regenerated but shouldn't be impacted by the encoding issue since page_title is not decoded :) Files I expect to have changed are the new version of pageviews: http://dumps.wikimedia.org/other/pageviews/2016/2016-02/ Joseph
On Tue, Mar 1, 2016 at 9:52 PM, Bo Han bo.ning.han@gmail.com wrote:
Thanks!
Diffing the newly-uploaded files for 20160223-160000 and 20160223-170000 with the previously-uploaded ones shows that their contents are the same. Were the original pagecounts files at http://dumps.wikimedia.org/other/pagecounts-all-sites/2016/2016-02/ not corrupted? The backfill is referring to other data, I assume?
Bo
On Tue, Mar 1, 2016 at 11:26 AM, Andrew Otto otto@wikimedia.org wrote:
https://phabricator.wikimedia.org/T128295
On Tue, Mar 1, 2016 at 2:15 PM, Bo Han bo.ning.han@gmail.com wrote:
Hi,
Would you mind linking the bug fix here? I couldn't find it on phabricator.
Thanks, Bo
On Tue, Mar 1, 2016 at 7:24 AM, Joseph Allemandou jallemandou@wikimedia.org wrote:
Hey Oliver, It depends on what data you've used: if page_title or other 'encoding sensitive' data (I can't think of any other, but ...) is part of it, then yes, you should !
On Tue, Mar 1, 2016 at 3:27 PM, Oliver Keyes okeyes@wikimedia.org wrote:
Hey Joseph,
Thanks for letting us know. So we should delete and backfill last week's data, for our regularly scheduled scripts?
On 1 March 2016 at 08:26, Joseph Allemandou jallemandou@wikimedia.org wrote: > Hi, > > TL,DR: Please don't use hive / spark / hadoop before next week. > > Last week the Analytics Team performed an upgrade to the Hadoop > Cluster. > It went reasonably well except for many of the hadoop processes > were > launched with a special option to NOT use utf-8 as default > encoding. > This issue caused trouble particularly in page title extraction > and > was > detected last sunday (many kudos to the people having filled bugs > on > Analytics API about encoding :) > We found the bug and fixed it yesterday, and backfill starts > today, > with > the > cluster recomputing every dataset starting 2016-02-23 onward. > This means you shouldn't query last week data during this week, > first > because it is incorrect, and second because you'll curse the > cluster > for > being too slow :) > > We are sorry for the inconvenience. > Don't hesitate to contact us if you have any question > > > -- > Joseph Allemandou > Data Engineer @ Wikimedia Foundation > IRC: joal > > _______________________________________________ > Engineering mailing list > Engineering@lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/engineering >
-- Oliver Keyes Count Logula Wikimedia Foundation
-- Joseph Allemandou Data Engineer @ Wikimedia Foundation IRC: joal
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Joseph Allemandou Data Engineer @ Wikimedia Foundation IRC: joal
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics