Hey,
This is the 26th and 27th weekly update from revision scoring team that we
have sent to this mailing list. We forgot to send the update for last week!
Last week, we were featured in Research's quarterly review. In the last 3
months, we achieved our goals to expand the ORES extension to 6 wikis (we
made it to 8!) and to release datasets of article quality predictions. The
minutes from the quarterly review are not yet online, but once they are,
you'll be able to see them at [1].
Maintenance and robustness:
- We discussed and decided on a set of strategies for handling
goodfaith/naive DOS attacks on ORES[2]
- We fixed an i18n issue in Wiki Labels[3]
- We updated the article quality models (wikiclass/wp10) to use
revscoring 1.3.0[4]
- We investigated and solved a memory leak in our pre-caching utility[5]
- We configured celery to send its logs to a place where we can read
them for easier debugging[6]
- We deployed a set of schema changes to constrain the ORES Review Tools
database appropriately[7]
- Also worth noting is that the services cluster (SCB) has been
expanded[8]. ORES has now doubled in capacity
Datasets
- We discussed how to make the historical article quality dataset
available via quarry[8]. Regretfully, it seems that we'll not be able to do
that for at least a couple of months.
New development
- We've implemented embedding of machine-readable scores in a JS
variable on-wiki[9]. This will make it easier for tool developers to
experiment with new ways of displaying Special:RecentChanges more easily.
It's also a necessary precondition for adding color-based signaling of
ORES' confidence about an edit.
1.
https://meta.wikimedia.org/wiki/Wikimedia_Foundation_metrics_and_activities…
2. https://phabricator.wikimedia.org/T148347 -- [Discuss] DOS attacks on
ORES. What to do?
3. https://phabricator.wikimedia.org/T139587 -- Revision not found error
unformatted and not localized
4. https://phabricator.wikimedia.org/T147201 -- Update wikiclass for
revscoring 1.3.0
5. https://phabricator.wikimedia.org/T146500 -- Investigate memory leak in
precached
6. https://phabricator.wikimedia.org/T147898 -- Send celery logs to
/srv/log/ores instead of /var/lib/daemon.log
7. https://phabricator.wikimedia.org/T147734 -- Review and deploy 309825
8. https://phabricator.wikimedia.org/T147903 -- Expand SCB cluster
9. https://phabricator.wikimedia.org/T146718 -- [Discuss] Hosting the
monthly article quality dataset on labsDB
10. https://phabricator.wikimedia.org/T143611 -- Embed machine readable
ores scores as data on pages where ORES scores things
Sincerely,
Aaron from the Revision Scoring team
Hello!
The Wikimedia Developer Summit
<https://www.mediawiki.org/wiki/Wikimedia_Developer_Summit> is the annual
meeting to push the evolution of MediaWiki and other technologies
supporting the Wikimedia movement. The next edition will be held in San
Francisco on January 9-11, 2017.
We welcome all Wikimedia technical contributors, third party developers,
and users of MediaWiki and the Wikimedia APIs. We specifically want to
increase the participation of volunteer developers and other contributors
dealing with extensions, apps, tools, bots, gadgets, and templates.
Important deadlines:
- Monday, October 24: This is the last day to request travel
sponsorship. Applying takes less than five minutes.
- Monday, October 31: This is the last day to propose an activity. Bring
the topics you care about!
Subscribe to weekly updates: https://www.mediawiki.org/
wiki/Topic:Td5wfd70vptn8eu4
Please feel free to forward this email to anyone who might be interested in
attending!
Thanks,
Srishti
--
Srishti Sethi
ssethi(a)wikimedia.org
Hey,
It seems there is some sort of back pressure on the ORES service right now
causing to send out timeout and overload errors which made icinga scream at
#wikimedia-ai several times today. If you're running the requests, please
slow down a little.
We Increased the capacity for now [1] (Thanks to Andrew Bogott). That
brought back everything to the normal state. Sorry for any inconvenience.
[1] https://gerrit.wikimedia.org/r/#/c/316271/
Best
Hey,
This is the 24th and 25th weekly update from revision scoring team that we
have sent to this mailing list. We skipped a week due to travel and other
work.
Maintenance and robustness:
- We improved the performance of RecentChanges fitlering in the ORES
extension[1]
- We built and ran a maintenance script to clean up duplicate cached
data for the ORES extension[2,3]
- We updated the editquality models for the new version of revscoring
(1.3.0)[4] and made some upstream changes to json2tsv to make that easier[5]
- We quited down some of our error reporting so that our logs take up
less space[6]
Datasets:
- We generated a dataset that uses the "wp10" prediction model to assess
article quality in monthly intervals for English, French, and Russian
Wikipedia[7]. This should enable new research into the quality dynamics of
these wikis.
- We generated a dataset of vandalism, spam, and attack page creations
for building a new "draft quality" model[8]
Communication:
- Presented about transparent/open AI development practices around ORES
at the Association of Internet Researchers[9]
New development:
- We've made substantial progress towards adding ORES data to
MediaWiki's api.php endpoints with rcshow=oresreview[10] and rvprop=ores[11]
1. https://phabricator.wikimedia.org/T146111 -- hidenondamaging=1 query is
extremely slow on enwiki
2. https://phabricator.wikimedia.org/T145356 -- Ensure ORES data violating
constraints do not affect production
3. https://phabricator.wikimedia.org/T145503 -- Build a maintenance script
to clean up duplicate data
4. https://phabricator.wikimedia.org/T146410 -- Update editquality for
revscoring 1.3.0
5. https://phabricator.wikimedia.org/T146939 -- Add type decoding support
to tsv2json
6. https://phabricator.wikimedia.org/T146680 -- Quiet result.get Warning in
tasks
7. https://phabricator.wikimedia.org/T145655 -- Generate monthly article
quality dataset
8. https://phabricator.wikimedia.org/T135644 -- Generate spam and vandalism
new page creation dataset
9. https://phabricator.wikimedia.org/T147706 -- Present about ORES
transparency at AoIR
10. https://phabricator.wikimedia.org/T143616 -- Introduce
rcshow=oresreview and similar ones
11. https://phabricator.wikimedia.org/T143614 -- Introduce ORES rvprop
Sincerely,
Aaron from the Revision Scoring team
Hi all,
I'm trying to do text processing on a subset of Wikipedia articles(about
300k) to calculate tf-idf scores in those articles and look for certain
word occurences.
I'm using the Wikipedia dump to extract the subset of articles. Through a
simple script I can scrape the dump and extract articles but they're in
Wikitext syntax.
I'd like to know if the noise added by Wikitext syntax would be significant
or not? Should I go for parsing of articles to reduce them to bare text
content or is there a way to ignore the Wikitext syntax while processing?
Please note that parsing looks like a much harder job for my use case as I
need only a subset of articles and I'm unable to find a utility which
returns only the text content of a chosen set of articles from dump.
--
-Thanks and Regards,
Sumit Asthana,
B.Tech final year,
Dept. of CSE,
IIT Patna
I made a suggestion [1] in the ongoing discussion about the Wikimedia
Developer Summit [2] in January that AI should be a major topic. I am not
exactly an expert on it but my impression is that the Wikimedia movement is
largely missing to notice the beginnings of a huge shift in user
expectations towards smarter tools and interfaces. While there is some
attention to it (as the existence of this list proves), I don't think it is
proportional to the importance of the topic and the summit might be a good
chance to raise attention.
Input from people who, unlike me, actually know what they are talking about
would be very welcome on the wiki page :)
[1] https://www.mediawiki.org/wiki/Topic:Tcfsas6exo2gd3ug
[2] https://www.mediawiki.org/wiki/Wikimedia_Developer_Summit
Also:
1. If someone is paid to do captioning and/or categorization work, such as
by a GLAM institution or a Wikimedia affiliate with a budget that supports
this kind of work, then integrating this research into Wikimedia workflows
could significantly increase that person's cost-effectiveness.
2. If volunteers are uploading large quantities of photos, this may make
captioning and categorization much less time consuming and therefore
volunteers may be more likely to do substantial captioning and
categorization work instead of doing the minimum amount of work necessary.
Pine
On Wed, Sep 28, 2016 at 12:19 AM, Jan Dittrich <jan.dittrich(a)wikimedia.de>
wrote:
> I find it interesting which impact this could have on the sense of
> achievement for volunteers, if captions are autogenerated or suggested and
> them possibly affirmed or corrected.
> On one hand one could assume a decreased sense of ownership,
> on the other hand, it might be more easier to comment/correct then to
> write from scratch and feel much more efficient.
>
> Jan
>
>
> 2016-09-27 23:08 GMT+02:00 Dario Taraborelli <dtaraborelli(a)wikimedia.org>:
>
>> I forwarded this separately to internally at WMF a few days ago. Clearly
>> – before thinking of building workflows for human contributors to generate
>> captions or rich descriptors of media files in Commons – we should look at
>> what's available in terms of off-the-shelf machine learning services and
>> libraries.
>>
>> #1 rule of sane citizen science/crowdsourcing projects: don't ask humans
>> to perform tedious tasks machines are pretty good at, get humans to curate
>> inputs and outputs of machines instead.
>>
>> D
>>
>> On Mon, Sep 26, 2016 at 5:55 PM, Pine W <wiki.pine(a)gmail.com> wrote:
>>
>>> Perhaps of interest: "...We’re making the latest version of our image
>>> captioning system available as an open source model in TensorFlow."
>>> https://research.googleblog.com/2016/09/show-and-tell-image-
>>> captioning-open.html
>>>
>>> Pine
>>>
>>> _______________________________________________
>>> Wiki-research-l mailing list
>>> Wiki-research-l(a)lists.wikimedia.org
>>> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>>>
>>>
>>
>>
>> --
>>
>> *Dario Taraborelli *Head of Research, Wikimedia Foundation
>> wikimediafoundation.org • nitens.org • @readermeter
>> <http://twitter.com/readermeter>
>>
>> _______________________________________________
>> Wiki-research-l mailing list
>> Wiki-research-l(a)lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>>
>>
>
>
> --
> Jan Dittrich
> UX Design/ User Research
>
> Wikimedia Deutschland e.V. | Tempelhofer Ufer 23-24 | 10963 Berlin
> Phone: +49 (0)30 219 158 26-0
> http://wikimedia.de
>
> Imagine a world, in which every single human being can freely share in the
> sum of all knowledge. That‘s our commitment.
>
> Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e. V.
> Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg unter
> der Nummer 23855 B. Als gemeinnützig anerkannt durch das Finanzamt für
> Körperschaften I Berlin, Steuernummer 27/029/42207.
>
> _______________________________________________
> Wiki-research-l mailing list
> Wiki-research-l(a)lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>
>
I've been looking at some recent work that used Probabilistic Context-free
Grammars[1,2] to detect vandalism in Wikipedia. I wanted to send a quick
message to share some progress.
I've built a python library that implements a really simple PCFG training
and scoring strategy and written a quick demo of how it can work. In the
following demo, I show how we can build a probabilistic grammar using the
I'm a Little Teapot song[4]. Note how sentences that are not
characteristic of the song score lower. Note that scores are log-scaled.
>>> sentences = [
... "I am a little teapot",
... "Here is my handle",
... "Here is my spout",
... "When I get all steamed up I just shout tip me over and
pour me out",
... "I am a very special pot",
... "It is true",
... "Here is an example of what I can do",
... "I can turn my handle into a spout",
... "Tip me over and pour me out"]
>>>
>>>
>>> teapot_grammar = TreeScorer.from_tree_bank(bllip_parse(s) for s in
sentences)
>>>
>>> teapot_grammar.score(bllip_parse("Here is a little teapot"))
-9.392661928770137
>>> teapot_grammar.score(bllip_parse("It is my handle"))
-10.296301543090733
>>> teapot_grammar.score(bllip_parse("I am a spout"))
-10.40166205874856
>>> teapot_grammar.score(bllip_parse("Your teapot is gay"))
-12.96352974967269
>>> teapot_grammar.score(bllip_parse("Your mom's teapot is
asldasnldansldal"))
-19.424997926026403
This work is inspired by work that Arthur Tilley did on our team a last
year[5]. The 'kasami' library represents a narrow slice of Arthur's work.
Next, I'm working on building out revscoring to implement some features
that use the scoring strategy on sentenced modified in an edit. I'm hoping
that this type of feature engineering will allow us to catch edits that
make articles more/less notable. I'm also targeting spammy language and
insults.
1. https://en.wikipedia.org/wiki/Stochastic_context-free_grammar
2. http://pub.cs.sunysb.edu/~rob/papers/acl11_vandal.pdf
3. https://github.com/halfak/kasami
4. https://en.wikipedia.org/wiki/I%27m_a_Little_Teapot
5. https://github.com/aetilley/pcfg
-Aaron