Maybe something exists already in Hadoop

The page properties table is already loaded into Hadoop on a monthly basis (wmf_raw.mediawiki_page_props).  I haven't played with it much, but Hive also has JSON-parsing goodies, so give it a shot and let me know if you get stuck.  In general, data from the databases can be sqooped into Hadoop.  We do this for large pipelines like edit history and it's very easy to add a table.  We're looking at just replicating the whole db on a more frequent basis, but we have to do some groundwork first to allow incremental updates (see Apache Iceberg if you're interested).