Re: [Wiki-research-l] Availability of hourly pagecounts files - Wiki-research-l

12 Jan 2020


      That's fascinating, John; thank you. I'm copying this to wiki-research-l and
Fabian Suchanek, who gave the first part of the Research Showcase last month.
What do you like for coding stories? https://quanteda.io/reference/dfm.html ?
Sentiment is hard because errors are often 180 degrees away from correct.
How do you both feel about Soru et al (June 2018) "Neural Machine Translation
for Query Construction and Composition"
https://www.researchgate.net/publication/326030040 ?
On Sat, Jan 11, 2020 at 3:46 PM John Urbanik johnurbanik@gmail.com wrote:
...
Jim,
I used to work as the chief data scientist at Collin's company.
I'd suggest looking at things like relationships between the views / edits for sets of pages as well as aggregating large sets of page views for different pages in various ways. There isn't a lot of literature that is directly applicable, and I can't disclose the precise methods being used due to NDA.
In general, much of the pageview data is weibull or GEV distributed on top of being non-stationary, so I'd suggest looking into papers from extreme value theory literature as well as literature around Hawkes/Queue-Hawkes processes. Most traditional ML and signal processing is not very effective without doing some pretty substantial pre-processing, and even then things are pretty messy, depending on what you're trying to predict; most variables are heteroskedastic w.r.t pageviews and there are a lot of real world events that can cause false positives.
Further, concept drift is pretty rapid in this space and structural breaks happen quite frequently, so the reliability of a given predictor can change extremely rapidly. Understanding how much training data to use for a given prediction problem is itself a super interesting problem since there may be some horizon after which the predictor loses power, but decreasing the horizon too much means over fitting and loss of statistical significance.
Good luck!
John