Dan Andreescu <dandreescu@wikimedia.org> wrote:

Maybe something exists already in Hadoop

The page properties table is already loaded into Hadoop on a monthly basis (wmf_raw.mediawiki_page_props). I haven't played with it much, but Hive also has JSON-parsing goodies, so give it a shot and let me know if you get stuck. In general, data from the databases can be sqooped into Hadoop. We do this for large pipelines like edit history and it's very easy to add a table. We're looking at just replicating the whole db on a more frequent basis, but we have to do some groundwork first to allow incremental updates (see Apache Iceberg if you're interested).

Yes, I like that and all of the other wmf_raw goodies! I'll follow up off thread on accessing the parser cache DBs (they're in site.pp and db-eqiad.php, but I don't think those are presently represented by refinery.util as they're not in .dblist files).