That's fascinating, John; thank you. I'm copying this to wiki-research-l and
Fabian Suchanek, who gave the first part of the Research Showcase last month.
What do you like for coding stories?
https://quanteda.io/reference/dfm.html ?
Sentiment is hard because errors are often 180 degrees away from correct.
How do you both feel about Soru et al (June 2018) "Neural Machine Translation
for Query Construction and Composition"
https://www.researchgate.net/publication/326030040 ?
On Sat, Jan 11, 2020 at 3:46 PM John Urbanik <johnurbanik(a)gmail.com> wrote:
Jim,
I used to work as the chief data scientist at Collin's company.
I'd suggest looking at things like relationships between the views / edits for sets
of pages as well as aggregating large sets of page views for different pages in various
ways. There isn't a lot of literature that is directly applicable, and I can't
disclose the precise methods being used due to NDA.
In general, much of the pageview data is weibull or GEV distributed on top of being
non-stationary, so I'd suggest looking into papers from extreme value theory
literature as well as literature around Hawkes/Queue-Hawkes processes. Most traditional ML
and signal processing is not very effective without doing some pretty substantial
pre-processing, and even then things are pretty messy, depending on what you're trying
to predict; most variables are heteroskedastic w.r.t pageviews and there are a lot of real
world events that can cause false positives.
Further, concept drift is pretty rapid in this space and structural breaks happen quite
frequently, so the reliability of a given predictor can change extremely rapidly.
Understanding how much training data to use for a given prediction problem is itself a
super interesting problem since there may be some horizon after which the predictor loses
power, but decreasing the horizon too much means over fitting and loss of statistical
significance.
Good luck!
John