Dan Andreescu <dandreescu(a)wikimedia.org> wrote:
Maybe something exists already in Hadoop
The page properties table is already loaded into Hadoop on a monthly basis
(wmf_raw.mediawiki_page_props). I haven't played with it much, but Hive
also has JSON-parsing goodies, so give it a shot and let me know if you get
stuck. In general, data from the databases can be sqooped into Hadoop. We
do this for large pipelines like edit history
it's very easy
to add a table. We're looking at just
replicating the whole db on a more
frequent basis, but we have to do some groundwork first to allow
incremental updates (see Apache Iceberg if you're interested).
Yes, I like that and all of the other wmf_raw goodies! I'll follow up off
thread on accessing the parser cache DBs (they're in site.pp and
db-eqiad.php, but I don't think those are presently represented by
refinery.util as they're not in .dblist files).