Re: [Wikitech-l] TechCom topics 2020-11-04 (fixed)

18 Nov 2020

Dan Andreescu &lt;dandreescu(a)wikimedia.org&gt; wrote:

...
  Maybe something exists already in Hadoop

 The page properties table is already loaded into Hadoop on a monthly basis
 (wmf_raw.mediawiki_page_props).  I haven't played with it much, but Hive
 also has JSON-parsing goodies, so give it a shot and let me know if you get
 stuck.  In general, data from the databases can be sqooped into Hadoop.  We
 do this for large pipelines like edit history
 <https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Edit_data_loading>
and
 it's very easy

<https://github.com/wikimedia/analytics-refinery/blob/master/python/refinery/sqoop.py#L505   to add a table.  We're looking at just
replicating the whole db on a more
 frequent basis, but we have to do some groundwork first to allow
 incremental updates (see Apache Iceberg if you're interested).

 Yes, I like that and all of the other wmf_raw goodies! I'll follow up off
thread on accessing the parser cache DBs (they're in site.pp and
db-eqiad.php, but I don't think those are presently represented by
refinery.util as they're not in .dblist files).

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] TechCom topics 2020-11-04 (fixed)