Hey,
This is the 26th and 27th weekly update from revision scoring team that we
have sent to this mailing list. We forgot to send the update for last week!
Last week, we were featured in Research's quarterly review. In the last 3
months, we achieved our goals to expand the ORES extension to 6 wikis (we
made it to 8!) and to release datasets of article quality predictions. The
minutes from the quarterly review are not yet online, but once they are,
you'll be able to see them at [1].
Maintenance and robustness:
- We discussed and decided on a set of strategies for handling
goodfaith/naive DOS attacks on ORES[2]
- We fixed an i18n issue in Wiki Labels[3]
- We updated the article quality models (wikiclass/wp10) to use
revscoring 1.3.0[4]
- We investigated and solved a memory leak in our pre-caching utility[5]
- We configured celery to send its logs to a place where we can read
them for easier debugging[6]
- We deployed a set of schema changes to constrain the ORES Review Tools
database appropriately[7]
- Also worth noting is that the services cluster (SCB) has been
expanded[8]. ORES has now doubled in capacity
Datasets
- We discussed how to make the historical article quality dataset
available via quarry[8]. Regretfully, it seems that we'll not be able to do
that for at least a couple of months.
New development
- We've implemented embedding of machine-readable scores in a JS
variable on-wiki[9]. This will make it easier for tool developers to
experiment with new ways of displaying Special:RecentChanges more easily.
It's also a necessary precondition for adding color-based signaling of
ORES' confidence about an edit.
1.
https://meta.wikimedia.org/wiki/Wikimedia_Foundation_metrics_and_activities…
2. https://phabricator.wikimedia.org/T148347 -- [Discuss] DOS attacks on
ORES. What to do?
3. https://phabricator.wikimedia.org/T139587 -- Revision not found error
unformatted and not localized
4. https://phabricator.wikimedia.org/T147201 -- Update wikiclass for
revscoring 1.3.0
5. https://phabricator.wikimedia.org/T146500 -- Investigate memory leak in
precached
6. https://phabricator.wikimedia.org/T147898 -- Send celery logs to
/srv/log/ores instead of /var/lib/daemon.log
7. https://phabricator.wikimedia.org/T147734 -- Review and deploy 309825
8. https://phabricator.wikimedia.org/T147903 -- Expand SCB cluster
9. https://phabricator.wikimedia.org/T146718 -- [Discuss] Hosting the
monthly article quality dataset on labsDB
10. https://phabricator.wikimedia.org/T143611 -- Embed machine readable
ores scores as data on pages where ORES scores things
Sincerely,
Aaron from the Revision Scoring team
Hello!
The Wikimedia Developer Summit
<https://www.mediawiki.org/wiki/Wikimedia_Developer_Summit> is the annual
meeting to push the evolution of MediaWiki and other technologies
supporting the Wikimedia movement. The next edition will be held in San
Francisco on January 9-11, 2017.
We welcome all Wikimedia technical contributors, third party developers,
and users of MediaWiki and the Wikimedia APIs. We specifically want to
increase the participation of volunteer developers and other contributors
dealing with extensions, apps, tools, bots, gadgets, and templates.
Important deadlines:
- Monday, October 24: This is the last day to request travel
sponsorship. Applying takes less than five minutes.
- Monday, October 31: This is the last day to propose an activity. Bring
the topics you care about!
Subscribe to weekly updates: https://www.mediawiki.org/
wiki/Topic:Td5wfd70vptn8eu4
Please feel free to forward this email to anyone who might be interested in
attending!
Thanks,
Srishti
--
Srishti Sethi
ssethi(a)wikimedia.org
Hey,
It seems there is some sort of back pressure on the ORES service right now
causing to send out timeout and overload errors which made icinga scream at
#wikimedia-ai several times today. If you're running the requests, please
slow down a little.
We Increased the capacity for now [1] (Thanks to Andrew Bogott). That
brought back everything to the normal state. Sorry for any inconvenience.
[1] https://gerrit.wikimedia.org/r/#/c/316271/
Best
Hey,
This is the 24th and 25th weekly update from revision scoring team that we
have sent to this mailing list. We skipped a week due to travel and other
work.
Maintenance and robustness:
- We improved the performance of RecentChanges fitlering in the ORES
extension[1]
- We built and ran a maintenance script to clean up duplicate cached
data for the ORES extension[2,3]
- We updated the editquality models for the new version of revscoring
(1.3.0)[4] and made some upstream changes to json2tsv to make that easier[5]
- We quited down some of our error reporting so that our logs take up
less space[6]
Datasets:
- We generated a dataset that uses the "wp10" prediction model to assess
article quality in monthly intervals for English, French, and Russian
Wikipedia[7]. This should enable new research into the quality dynamics of
these wikis.
- We generated a dataset of vandalism, spam, and attack page creations
for building a new "draft quality" model[8]
Communication:
- Presented about transparent/open AI development practices around ORES
at the Association of Internet Researchers[9]
New development:
- We've made substantial progress towards adding ORES data to
MediaWiki's api.php endpoints with rcshow=oresreview[10] and rvprop=ores[11]
1. https://phabricator.wikimedia.org/T146111 -- hidenondamaging=1 query is
extremely slow on enwiki
2. https://phabricator.wikimedia.org/T145356 -- Ensure ORES data violating
constraints do not affect production
3. https://phabricator.wikimedia.org/T145503 -- Build a maintenance script
to clean up duplicate data
4. https://phabricator.wikimedia.org/T146410 -- Update editquality for
revscoring 1.3.0
5. https://phabricator.wikimedia.org/T146939 -- Add type decoding support
to tsv2json
6. https://phabricator.wikimedia.org/T146680 -- Quiet result.get Warning in
tasks
7. https://phabricator.wikimedia.org/T145655 -- Generate monthly article
quality dataset
8. https://phabricator.wikimedia.org/T135644 -- Generate spam and vandalism
new page creation dataset
9. https://phabricator.wikimedia.org/T147706 -- Present about ORES
transparency at AoIR
10. https://phabricator.wikimedia.org/T143616 -- Introduce
rcshow=oresreview and similar ones
11. https://phabricator.wikimedia.org/T143614 -- Introduce ORES rvprop
Sincerely,
Aaron from the Revision Scoring team
Hi all,
I'm trying to do text processing on a subset of Wikipedia articles(about
300k) to calculate tf-idf scores in those articles and look for certain
word occurences.
I'm using the Wikipedia dump to extract the subset of articles. Through a
simple script I can scrape the dump and extract articles but they're in
Wikitext syntax.
I'd like to know if the noise added by Wikitext syntax would be significant
or not? Should I go for parsing of articles to reduce them to bare text
content or is there a way to ignore the Wikitext syntax while processing?
Please note that parsing looks like a much harder job for my use case as I
need only a subset of articles and I'm unable to find a utility which
returns only the text content of a chosen set of articles from dump.
--
-Thanks and Regards,
Sumit Asthana,
B.Tech final year,
Dept. of CSE,
IIT Patna