Re: [Wikitech-l] TechCom topics 2020-11-04 (fixed)

17 Nov 2020


      ...
Maybe something exists already in Hadoop
The page properties table is already loaded into Hadoop on a monthly basis
(wmf_raw.mediawiki_page_props).  I haven't played with it much, but Hive
also has JSON-parsing goodies, so give it a shot and let me know if you get
stuck.  In general, data from the databases can be sqooped into Hadoop.  We
do this for large pipelines like edit history
https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Edit_data_loading
and
it's very easy
https://github.com/wikimedia/analytics-refinery/blob/master/python/refinery/sqoop.py#L505
to add a table.  We're looking at just replicating the whole db on a more
frequent basis, but we have to do some groundwork first to allow
incremental updates (see Apache Iceberg if you're interested).

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] TechCom topics 2020-11-04 (fixed)