Discovery December 2015

discovery@lists.wikimedia.org

20 participants
26 discussions

by Stas Malyshev

Hi! Since we've talked about maybe using TextCat-based algorithms, I've made an implementation of textcat as PHP class/utility, which may be useful: https://github.com/smalyshev/textcat Please feel free to comment. It bases on what I found at http://odur.let.rug.nl/~vannoord/TextCat/ which is pretty old, so we may want to patch it up, but it works as a starting point I think (provided we'd want to pursue this route). I'll work on improving the loading latency (converting LM format to PHP) and making it into a real composer module. Maybe also add some tests. Improvement suggestions welcome of course. -- Stas Malyshev smalyshev(a)wikimedia.org

8 years, 3 months

Etiquette: Comsumption of API wikipedia from backend (full-text search)

by Luigi Assom

Hello, I am writing about limits of use and etiquette to comply with for consuming API for full-text search *server side*. I am building a site for visualization and knowledge discovery of wikipedias. It will be a personal funded project (at least initially!), for public use: investing more in indexing under Elastic Search would be beyond my possibilities and also beyond the scope of my project - focus is on visualization and discovery. And I also think there is no need to reinvent the wheel :) I want to figure out a best setup for usability and rate requests for of full-text search API, complying with your policy. Would you please take a minute to read below? *** Currently my set up makes use of my own db: for full text search I use elastic search at a very basic level. I then use Wikipedia API for decoration of my data, *client-side (AJAX).* Despite slower than what I have now, Wikipedia full-text api are much more useful for a user. It offer results on complex queries that I cannot provide, for I am indexing only articles' titles. I would like to include full-text search against WikiMedia API from server side. I want to ensure that I can meet policy of wikimedia foundation, if I will make concurrent requests on behalf of users. - *Are there any limit to the number of request I can do from a web domain?* I would like to use wikitool python library. The query I need to run will use a *search generator *over article namespace only: action=query&*generator=search*&gsrnamespace=0&gsrsearch='my query'& gsrlimit=20 I tested it from my laptop, and I found it quite slow; as example, it took: ~1.2 seconds for querying 'DNA' ~1.6 s for 'terroristi attacks' ~1.7s for 'biology technology' and I am currently on a very fast wifi network. - *How would it be possible to improve performance? * - *Is it possible to apply for a desired rate of requests?* I also read it would be a good etiquette practice to specify in *headers* contacts, in case you need to communicate with the domain. It is not clear to me what I should do. - *Could you please indicate how to do it with an example in python (here using flask framework)?* Thank you very much for your help, Luigi

8 years, 4 months

Cybozu / ES Plugin language detection update

by Trey Jones

Hey Everyone, David figured out how the Cybozu ES language detection plugin works in more detail, and figured out how to limit languages and how to retrain the models. The results are big improvements that bring performance more in line with the results we're seeing from TextCat. Initial results are below, for queries with spaces appended before and after (which improved performance on the old models—I'll verify that's still the case). Below are the summary stats for the all old language models, the old models limited to "useful" languages, and new models, retrained on the (admittedly messy) query data used for TextCat training. The evaluation set is the manually tagged enwiki sample. The full details will be posted on this page shortly: https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/Language_Detection_E… All langauges, old models f0.5 f1 f2 recall prec total hits misses 54.4% 47.4% 41.9% 39.0% 60.4% 775 302 198 Limited languages, old models (en,es,zh-cn,zh-tw,pt,ar,ru,fa,ko,bn,bg,hi,el,ta,th) f0.5 f1 f2 recall prec total hits misses 75.6% 71.0% 67.0% 64.5% 79.0% 775 500 133 Retrained languages (en,es,zh,pt,ar,ru,fa,ko,bn,bg,hi,el,ta,th) f0.5 f1 f2 recall prec total hits misses 81.8% 79.2% 76.9% 75.4% 83.5% 775 584 115 David suggests that this means we should go with TextCat, since it's easier to integrate, and I agree. However, this test was pretty quick and easy to run, so if we improve the training data, we can easily rebuild the models and test again. Overall, it's clear that limiting languages to the "useful" ones for a given wiki makes sense, and training on query data rather than generic language data helps, too! —Trey Trey Jones Software Engineer, Discovery Wikimedia Foundation

8 years, 4 months

Portal A/B test announcements

by Oliver Keyes

Hey all, Some announcements around the A/B tests we have run (and will be running) on the Wikipedia portal. A couple of weeks ago we launched an initial test to identify if a more prominent search box would improve the rate at which users clicked through from the portal to our various projects (at the moment, only a third of users do so). This was our first A/B test as a team, and so we expected various process flaws to show up when we implemented it. The test was essentially a wash due to one unfortunate and unfortunately fatal flaw :(. The implementation of the logging did pick up events from both test groups, so our initial lightweight data-checking passed, but it did /not/ pick up *search* events specifically - the one population we absolutely wanted to track. As a result the test group did not contain the events we needed it to. This was a failure of both the implementation and the QA process at the analysis end; there is egg on everyone's face. In the next iteration we will be conducting far deeper checks, both on the code and on the output. We have already expanded our documentation around testing and implementing A/B tests, at https://meta.wikimedia.org/wiki/Discovery/Testing#Guidelines To avoid running behind schedule, and to ensure we are producing the most user-optimised version of the portal we can, we will be launching a *multi-variable* A/B test - technically an A/B/C test - on 4 January. This will test the existing version of the portal, a version with the more prominent search box, and a version with a more prominent search box and friendly search results (containing pictures and brief summaries, as they do on mobile), against each other. This test should run for a week, and be non-invasive - only 0.03% of users should even notice anything has happened. When it's done, we're very hopeful that we'll have a clear winner and can make the experience of using the portal better for the remaining 99%, too :). For the Portal team, -- Oliver Keyes Count Logula Wikimedia Foundation

8 years, 4 months

Expansion of A/B testing documentation

by Oliver Keyes

Hey y'all, I'm a happy to say that I've just finished a major rewrite of the page that documents our A/B tests on meta (https://meta.wikimedia.org/wiki/Discovery/Testing). In particular: 1. The documentation of what needs to be done to run an A/B test has been expanded, and summarised in an easily-readable tabular form. 2. Test metadata has been expanded to document whether the test was a failure (null hypothesis not eliminated) or success (null hypothesis eliminated), and what project it was for (now that we're A/B testing outside Cirrus). 3. Documentation of the tests themselves has been expanded, with the language switching tests now fully documented. Regards, -- Oliver Keyes Count Logula Wikimedia Foundation

8 years, 4 months

Fwd: New Beta Feature: completion suggester

by Dan Garry

Cross-posting from wikitech-l. Please discuss this there. :-) Dan ---------- Forwarded message ---------- From: Dan Garry <dgarry(a)wikimedia.org> Date: 17 December 2015 at 17:09 Subject: New Beta Feature: completion suggester To: Wikimedia developers <wikitech-l(a)lists.wikimedia.org> Hey all, In the continued quest to make the search bar a better tool, the Wikimedia Foundation's Discovery Department <https://www.mediawiki.org/wiki/Wikimedia_Discovery> has put a completion suggester into Beta Features. The tool functions with search-as-you-type, with a small tolerance for typos and spacing in finding results. Possible matches are then displayed as you type in a drop down menu, hopefully eliminating the need to perform a fulltext search with landing page and all. You can read more details at mediawiki.org <https://www.mediawiki.org/wiki/Extension:CirrusSearch/CompletionSuggester> and use the talk page for now for feedback. The tool is now available and will only be enabled for the article namespace for now, and will progress into full production at some point hopefully in early 2016, depending on feedback. It's going to be important to get feedback from regular contributors who use search to make sure that any of the basic feature requests for searching the main space can at least be addressed while in Beta Features. Thanks! Dan -- Dan Garry Lead Product Manager, Discovery Wikimedia Foundation -- Dan Garry Lead Product Manager, Discovery Wikimedia Foundation

8 years, 4 months

Re: [discovery] Language Detection Improvements!

by Trey Jones

Hi everyone, I've mostly finished up my write up on using TextCat for language detection, here: https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/Language_Detection_w… Highlights include: - Using query data for training gives better results than using general text (i.e., Wikipedia articles) even though the query data is pretty messy. - French and English are too similar, esp. since no one uses the diacritics in French. Blame the Normans <https://en.wikipedia.org/wiki/Norman_conquest_of_England> (for the similarity, at least). - We need to expand to other wikis (this focused on enwiki). - We might be able to better, still, with better training data. Thanks for the feedback so far! —Trey Trey Jones Software Engineer, Discovery Wikimedia Foundation On Tue, Dec 8, 2015 at 10:22 AM, Trey Jones <tjones(a)wikimedia.org> wrote: > Hey everyone, > > I originally started this thread on the internal mailing list, but I > should've shared it on the public list. There have been a few replies, so > I'll try to summarize everything so far and we can continue discussion > here. For those following the old thread, new stuff starts at "*New > Stuff!*" below. Sorry for the mixup. > > I wanted to share some good news on the language detection front. I've > been messing around with TextCat, an n-gram-based language ID tool that's > been around for a good while. (The version I've been using is in Really Old > Perl, but as David pointed out, it's been ported to other languages (see > below) and it would be easy for us to port to Python or PHP.) > > I trained fresh models on 59 languages based on queries (not random > non-query text, but fairly messy data), and made some improvements to the > code (making it Unicode-aware, making it run much faster, and improving the > "unknown n-gram" penalty logic) and ran lots of tests with different model > sizes. I got decent improvements over baseline TextCat. > > After doing some analysis on my English Wikipedia sample, I limited the > languages to those that are really good (Thai pretty much only looks like > Thai) and really necessary (English, or course, and Spanish). I dropped > languages that performed poorly and wouldn't help much anyway (Igbo doesn't > come up a lot on enwiki), and dropped some nice-to-have languages that > performed too poorly (French, alas). I got down to 15 relevant languages > for enwiki: Arabic, Bulgarian, Bengali, Greek, English, Spanish, Farsi, > Hindi, Japanese, Korean, Portuguese, Russian, Tamil, Thai, and Chinese. > > The improvement over the ES plugin baseline (with spaces) is phenomenal. > Recall doubled, and precision went up by a third. F0.5 is my preferred > measure, but all these are waaaay better: > > f0.5 f1 f2 recall prec > ES-sp 54.4% 47.4% 41.9% 39.0% 60.4% > ES-thr 69.5% 51.6% 41.1% 36.1% 90.3% > TC-lim 83.3% 83.3% 83.3% 83.4% 83.2% > > • ES-sp is the ES-plugin with spaces added at the beginning and ending of > each string. > > • ES-thr is the ES-plugin (with spaces) with optimized thresholds for each > language (more fine grained than choosing languages, and over trained > because the optimization data and test data is the same, and brittle > because the sample for many languages is very small). Also, this method > targets precision, and in theory could improve recall a tiny bit, but in > practice took it down a bit. > > • TC-lim is TextCat, limiting the languages to the relevant 15 (instead of > all 59). > > I still have some other things to try (equalizing training set sizes, > manually selected training sets, trying wiki-text training sets), and a > more formal write up is coming, so this is just a preview for now. > > Generally, I'm just really glad to get some improved results! > > If we want to try an A/B test with this, we'll also need to turn it into > something ES can use. Stas noted that it doesn't have to be hosted by ES - > since we have a separate call to language detection (which goes to ES now, > but can be anything), it can be implemented separately if we wanted to. > > David found several other implementations in Java and PHP: > > There are many Textcat implementations: >> 2 JAVA: http://textcat.sourceforge.net/ and >> http://www.jedi.be/pages/JTextCat/ >> 1 PHP extension: https://pecl.php.net/package/TextCat >> and certainly others >> > > > I'm wondering why the PHP implementation is done via a C extension... >> Maybe the initialisation steps (parsing ngrams stat files) are not suited >> for PHP scripts? > > > David also pointed out that TextCat uses somewhat the same technique used > by Cybozu (ES plugin), so maybe we could try to train it with the same > dataset used for TextCat and generate custom profiles. We could reuse the > same PHP code in Cirrus. There's a question so whether it will be possible > to implement the new "unknown n-gram" penalty logic. And unfortunately the > es-plugin does not support loading multiple profiles making it hard to run > A/B tests. So we will have to write some code anyways... > > And David noted that solutions based on scripts will suffer from loading > the ngram files for each request, and asked about overhead for the current > models. > > Erik noted that if we are going to rewrite it in something, a PHP library > would make the most sense for reuse in our application (being PHP and all). > Otherwise, we could tie the existing Perl together with a Perl based HTTP > server and set it up somewhere in the cluster, perhaps the services misc > cluster would make sense. The content translation team was also interested > in the work we are doing with language detection and may find a > microservice more accessible than a PHP library. > > As Erik pointed out, the proof is in the pudding, so we'll have to see how > this works out when we start sending live user data to it. > > > *New Stuff!* > > • Stas pointed out that it's surprising that Bulgarian is on the short > list because it's so similar to Russian. Actually Bulgarian, Spanish, and > Portuguese aren't great (40%-55% F0.5 score) but they weren't obviously > actively causing problems, like French and Igbo were. > > • I should have gone looking for newer implementations of TextCat, like > David did. It is pretty simple code. But that also means that using and > modifying another implementation or porting our own should be easy. The > unknown n-gram penalty fix was pretty small—using the model size instead of > the incoming sample size as the penalty. (more detail on that with my write > up.) > > • I'm not 100% sure whether using actual queries as training helped (I > thought it would, which is why I did it), but the training data is still > pretty messy, so depending on the exact training method, retraining Cybozu > could be doable. The current ES plugin was a black box to me—I didn't even > know it was Cybozu. Anyone know where the code lives, or want to volunteer > to figure out how to retrain it? (Or, additionally, turning off the > not-so-useful models within it for testing on enwiki.) > > • I modified TextCat to load all the models at once and hold them in > memory (the previous implementation was noted in the code to be terrible > for line-by-line processing because it would load each model individually > for each query processed—it was the simplest hack possible to make it run > in line-by-line mode). The overhead isn't much. 3.2MB for 59 models with > 5000 n-grams. Models seem to be around 70KB, but range from 45K to 80K. The > 15 languages I used are less then 1MB. Right now, a model with 3000 n-grams > seems to be the best, so the overhead would go down by ~40% (not all > n-grams are the same size, so it might be more). In short, not too much > overhead. > > • I think it would make sense to set this up as something that can keep > the models in memory. I don't know enough about our PHP architecture to > know if you can init a plugin and then keep it in memory for the duration. > Seems plausible though. A service of some sort (doesn't have to be > Perl-based) would also work. We need to think through the architectural > bits. > > More questions and comments are welcome! I'm going to spend a few more > days trying out other variations on models and training data now that I > have an evaluation pipeline set up. > > There's also some work involved in using this on other wikis. The best > results come from using the detectors that are most useful for the given > wiki. For example, there really is no point in detecting Igbo on enwiki > because Igbo queries are very rare, and Igbo incorrectly grabs a percentage > of English queries, and it's extra overhead to use it. On the Igbo > wiki—which made it onto my radar for getting 40K+ queries in a day—an Igbo > model would obviously be useful. (A quick check just now also uncovered the > biggest problem—many if not most queries to the Igbo wiki are in English. > Cleaner training data will definitely be needed for some languages.) > > —Trey > > Trey Jones > Software Engineer, Discovery > Wikimedia Foundation >

8 years, 4 months

Completion suggester lightning talk

by Dan Garry

Hey all, I just signed myself up <https://www.mediawiki.org/w/index.php?title=Lightning_Talks&diff=1964399&ol…> to give a lightning talk about the completion suggester next week. I thought I'd let you all know. :-) Thanks, Dan -- Dan Garry Lead Product Manager, Discovery Wikimedia Foundation

8 years, 4 months

Fwd: [Bigdata-developers] code freeze for 2.0

by Stas Malyshev

FYI below is the planned schedule for Blazegraph 2.0 release. I plan to test with RC as soon as it is published. It is supposed to fix some buggy SPARQL queries that were reported recently. -------- Forwarded Message -------- Subject: [Bigdata-developers] code freeze for 2.0 Date: Mon, 14 Dec 2015 16:07:06 -0500 From: Bryan Thompson <bryan(a)systap.com> To: Bigdata-developers(a)lists.sourceforge.net <Bigdata-developers(a)lists.sourceforge.net> 2.0 is now in a code freeze for benchmarking and performance regression testing. The only remaining tickets against 2.0 have to do with deployers, documentation, etc. Â Our plans are for a candidate release this year with an official 2.0 release in mid-January, 2016. Thanks, Bryan

8 years, 4 months

Elasticsearch dashboards

by Erik Bernhardson

I've gone through and updated some of the dashboards. Most specifically i've renamed the primary dashboard from discovery to elasticsearch. If you're looking for the dashboard, look for elasticsearch. The primary dashboard lives here now: https://grafana.wikimedia.org/dashboard/db/elasticsearch

8 years, 4 months

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

Discovery December 2015