Re: [Wikitech-l] TechCom topics 2020-11-04 (fixed)

18 Nov 2020


      Dan Andreescu dandreescu@wikimedia.org wrote:
...
Maybe something exists already in Hadoop
...
The page properties table is already loaded into Hadoop on a monthly basis
(wmf_raw.mediawiki_page_props).  I haven't played with it much, but Hive
also has JSON-parsing goodies, so give it a shot and let me know if you get
stuck.  In general, data from the databases can be sqooped into Hadoop.  We
do this for large pipelines like edit history
https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Edit_data_loading and
it's very easy
https://github.com/wikimedia/analytics-refinery/blob/master/python/refinery/sqoop.py#L505
to add a table.  We're looking at just replicating the whole db on a more
frequent basis, but we have to do some groundwork first to allow
incremental updates (see Apache Iceberg if you're interested).
Yes, I like that and all of the other wmf_raw goodies! I'll follow up off
thread on accessing the parser cache DBs (they're in site.pp and
db-eqiad.php, but I don't think those are presently represented by
refinery.util as they're not in .dblist files).

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] TechCom topics 2020-11-04 (fixed)