AI October 2016

ai@lists.wikimedia.org

5 participants
7 discussions

Fwd: [Commons-l] Programmatically categorizing media in the Commons with Machine Learning
by Pine W 03 Apr '17

03 Apr '17

Forwarding. Pine ---------- Forwarded message ---------- From: "Jordan Adler" <jmadler(a)google.com> Date: Aug 11, 2016 13:06 Subject: [Commons-l] Programmatically categorizing media in the Commons with Machine Learning To: "commons-l(a)wikimedia.org" <commons-l(a)lists.wikimedia.org> Cc: "Ray Sakai" <rsakai(a)reactive.co.jp>, "Ram Ramanathan" < ramramanathan(a)google.com>, "Kazunori Sato" <kazsato(a)google.com> Hey folks! A few months back a colleague of mine was looking for some unstructured images to analyze as part of a demo for the Google Cloud Vision API <https://cloud.google.com/blog/big-data/2016/05/explore-the-galaxy-of-images…>. Luckily, I knew just the place <https://commons.wikimedia.org/wiki/Category:Media_needing_categories>, and the resulting demo <http://vision-explorer.reactive.ai/>, built by Reactive Inc., is pretty awesome. It was shared on-stage by Jeff Dean during the keynote <https://www.youtube.com/watch?v=HgWHeT_OwHc&feature=youtu.be&t=2h1m19s> at GCP NEXT 2016. I wanted to quickly share the data from the programmatically identified images so it could be used to help categorize the media in the Commons. There's about 80,000 images worth of data: - map.txt <https://storage.googleapis.com/gcs-samples2-explorer/reprocess/map.txt> (5.9MB): A single text file mapping id to filename in a "id : filename" format, one per line - results.tar.gz <https://storage.googleapis.com/gcs-samples2-explorer/reprocess/results.tar.…> (29.6MB): a tgz'd directory of json files representing the output of the API <https://cloud.google.com/vision/reference/rest/v1/images/annotate#response-…>, in the format "${id}.jpg.json" We're making this data available under the CC0 license, and these links will likely be live for at least a few weeks. If you're interested in working with the Cloud Vision API to tag other images in the Commons, talk to the WMF Community Tech team. Thanks for your help! _______________________________________________ Commons-l mailing list Commons-l(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/commons-l

2 1

ORES scores is injected to js configs
by Amir Ladsgroup 29 Oct '16

29 Oct '16

Hey, With deployment of I42300f9 [1] in wmf.23 which happened several days ago (depending on your wiki), ORES scores are injected to mediawiki javascript config variables. You can access ores data in Special:RecentChanges/Watchlist/Contributions using mw.config.get('oresData'). It opens up a whole new level of functionality for gadgets. For example, I re-wrote a huge script called ScoredRevisions [2] into several lines [3]. Also, without needing to connect to ores.wikimedia.org, it's much faster than the original gadget. You can also write scripts to sort rows in recent changes based on their ORES scores, etc. As a fun task I made my recent changes look like a rainbow :D [4] [5] The next level is to inject ORES thresholds as mediawiki config variables so we can write up wiki-agnostic gadgets. I would really appreciate comments or ideas :) [1]: https://gerrit.wikimedia.org/r/#/c/314449/ [2]: https://github.com/he7d3r/mw-gadget-ScoredRevisions/blob/master/src/ScoredR… [3]: https://gist.github.com/Ladsgroup/e67e40500b64dd99dc7ab5c2fa34f261 [4]: https://phabricator.wikimedia.org/T144922#2736504 [5]: https://phab.wmfusercontent.org/file/data/hoibxop7mn4s2cooz4lz/PHID-FILE-w7… Best

1 0

The Revision Scoring weekly update
by Aaron Halfaker 24 Oct '16

24 Oct '16

Hey, This is the 26th and 27th weekly update from revision scoring team that we have sent to this mailing list. We forgot to send the update for last week! Last week, we were featured in Research's quarterly review. In the last 3 months, we achieved our goals to expand the ORES extension to 6 wikis (we made it to 8!) and to release datasets of article quality predictions. The minutes from the quarterly review are not yet online, but once they are, you'll be able to see them at [1]. Maintenance and robustness: - We discussed and decided on a set of strategies for handling goodfaith/naive DOS attacks on ORES[2] - We fixed an i18n issue in Wiki Labels[3] - We updated the article quality models (wikiclass/wp10) to use revscoring 1.3.0[4] - We investigated and solved a memory leak in our pre-caching utility[5] - We configured celery to send its logs to a place where we can read them for easier debugging[6] - We deployed a set of schema changes to constrain the ORES Review Tools database appropriately[7] - Also worth noting is that the services cluster (SCB) has been expanded[8]. ORES has now doubled in capacity Datasets - We discussed how to make the historical article quality dataset available via quarry[8]. Regretfully, it seems that we'll not be able to do that for at least a couple of months. New development - We've implemented embedding of machine-readable scores in a JS variable on-wiki[9]. This will make it easier for tool developers to experiment with new ways of displaying Special:RecentChanges more easily. It's also a necessary precondition for adding color-based signaling of ORES' confidence about an edit. 1. https://meta.wikimedia.org/wiki/Wikimedia_Foundation_metrics_and_activities… 2. https://phabricator.wikimedia.org/T148347 -- [Discuss] DOS attacks on ORES. What to do? 3. https://phabricator.wikimedia.org/T139587 -- Revision not found error unformatted and not localized 4. https://phabricator.wikimedia.org/T147201 -- Update wikiclass for revscoring 1.3.0 5. https://phabricator.wikimedia.org/T146500 -- Investigate memory leak in precached 6. https://phabricator.wikimedia.org/T147898 -- Send celery logs to /srv/log/ores instead of /var/lib/daemon.log 7. https://phabricator.wikimedia.org/T147734 -- Review and deploy 309825 8. https://phabricator.wikimedia.org/T147903 -- Expand SCB cluster 9. https://phabricator.wikimedia.org/T146718 -- [Discuss] Hosting the monthly article quality dataset on labsDB 10. https://phabricator.wikimedia.org/T143611 -- Embed machine readable ores scores as data on pages where ORES scores things Sincerely, Aaron from the Revision Scoring team

1 0

The Wikimedia Developer Summit wants YOU!
by Srishti Sethi 19 Oct '16

19 Oct '16

Hello! The Wikimedia Developer Summit <https://www.mediawiki.org/wiki/Wikimedia_Developer_Summit> is the annual meeting to push the evolution of MediaWiki and other technologies supporting the Wikimedia movement. The next edition will be held in San Francisco on January 9-11, 2017. We welcome all Wikimedia technical contributors, third party developers, and users of MediaWiki and the Wikimedia APIs. We specifically want to increase the participation of volunteer developers and other contributors dealing with extensions, apps, tools, bots, gadgets, and templates. Important deadlines: - Monday, October 24: This is the last day to request travel sponsorship. Applying takes less than five minutes. - Monday, October 31: This is the last day to propose an activity. Bring the topics you care about! Subscribe to weekly updates: https://www.mediawiki.org/ wiki/Topic:Td5wfd70vptn8eu4 Please feel free to forward this email to anyone who might be interested in attending! Thanks, Srishti -- Srishti Sethi ssethi(a)wikimedia.org

2 1

ORES is not happy
by Amir Ladsgroup 16 Oct '16

16 Oct '16

Hey, It seems there is some sort of back pressure on the ORES service right now causing to send out timeout and overload errors which made icinga scream at #wikimedia-ai several times today. If you're running the requests, please slow down a little. We Increased the capacity for now [1] (Thanks to Andrew Bogott). That brought back everything to the normal state. Sorry for any inconvenience. [1] https://gerrit.wikimedia.org/r/#/c/316271/ Best

1 1

The Revision Scoring weekly update
by Aaron Halfaker 11 Oct '16

11 Oct '16

Hey, This is the 24th and 25th weekly update from revision scoring team that we have sent to this mailing list. We skipped a week due to travel and other work. Maintenance and robustness: - We improved the performance of RecentChanges fitlering in the ORES extension[1] - We built and ran a maintenance script to clean up duplicate cached data for the ORES extension[2,3] - We updated the editquality models for the new version of revscoring (1.3.0)[4] and made some upstream changes to json2tsv to make that easier[5] - We quited down some of our error reporting so that our logs take up less space[6] Datasets: - We generated a dataset that uses the "wp10" prediction model to assess article quality in monthly intervals for English, French, and Russian Wikipedia[7]. This should enable new research into the quality dynamics of these wikis. - We generated a dataset of vandalism, spam, and attack page creations for building a new "draft quality" model[8] Communication: - Presented about transparent/open AI development practices around ORES at the Association of Internet Researchers[9] New development: - We've made substantial progress towards adding ORES data to MediaWiki's api.php endpoints with rcshow=oresreview[10] and rvprop=ores[11] 1. https://phabricator.wikimedia.org/T146111 -- hidenondamaging=1 query is extremely slow on enwiki 2. https://phabricator.wikimedia.org/T145356 -- Ensure ORES data violating constraints do not affect production 3. https://phabricator.wikimedia.org/T145503 -- Build a maintenance script to clean up duplicate data 4. https://phabricator.wikimedia.org/T146410 -- Update editquality for revscoring 1.3.0 5. https://phabricator.wikimedia.org/T146939 -- Add type decoding support to tsv2json 6. https://phabricator.wikimedia.org/T146680 -- Quiet result.get Warning in tasks 7. https://phabricator.wikimedia.org/T145655 -- Generate monthly article quality dataset 8. https://phabricator.wikimedia.org/T135644 -- Generate spam and vandalism new page creation dataset 9. https://phabricator.wikimedia.org/T147706 -- Present about ORES transparency at AoIR 10. https://phabricator.wikimedia.org/T143616 -- Introduce rcshow=oresreview and similar ones 11. https://phabricator.wikimedia.org/T143614 -- Introduce ORES rvprop Sincerely, Aaron from the Revision Scoring team

1 0

Text processin on Wikipedia articles
by Sumit Asthana 10 Oct '16

10 Oct '16

Hi all, I'm trying to do text processing on a subset of Wikipedia articles(about 300k) to calculate tf-idf scores in those articles and look for certain word occurences. I'm using the Wikipedia dump to extract the subset of articles. Through a simple script I can scrape the dump and extract articles but they're in Wikitext syntax. I'd like to know if the noise added by Wikitext syntax would be significant or not? Should I go for parsing of articles to reduce them to bare text content or is there a way to ignore the Wikitext syntax while processing? Please note that parsing looks like a much harder job for my use case as I need only a subset of articles and I'm unable to find a utility which returns only the text content of a chosen set of articles from dump. -- -Thanks and Regards, Sumit Asthana, B.Tech final year, Dept. of CSE, IIT Patna

2 1

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

AI October 2016