Since we've talked about maybe using TextCat-based algorithms, I've made
an implementation of textcat as PHP class/utility, which may be useful:
Please feel free to comment. It bases on what I found at
http://odur.let.rug.nl/~vannoord/TextCat/ which is pretty old, so we may
want to patch it up, but it works as a starting point I think (provided
we'd want to pursue this route).
I'll work on improving the loading latency (converting LM format to PHP)
and making it into a real composer module. Maybe also add some tests.
Improvement suggestions welcome of course.
I am writing about limits of use and etiquette to comply with for consuming
API for full-text search *server side*.
I am building a site for visualization and knowledge discovery of
It will be a personal funded project (at least initially!), for public
use: investing more in indexing under Elastic Search would be beyond my
possibilities and also beyond the scope of my project - focus is on
visualization and discovery. And I also think there is no need to reinvent
the wheel :)
I want to figure out a best setup for usability and rate requests for of
full-text search API, complying with your policy.
Would you please take a minute to read below?
Currently my set up makes use of my own db: for full text search I use
elastic search at a very basic level.
I then use Wikipedia API for decoration of my data, *client-side (AJAX).*
Despite slower than what I have now, Wikipedia full-text api are much more
useful for a user.
It offer results on complex queries that I cannot provide, for I am
indexing only articles' titles.
I would like to include full-text search against WikiMedia API from server
I want to ensure that I can meet policy of wikimedia foundation, if I will
make concurrent requests on behalf of users.
- *Are there any limit to the number of request I can do from a web
I would like to use wikitool python library.
The query I need to run will use a *search generator *over article
I tested it from my laptop, and I found it quite slow; as example, it took:
~1.2 seconds for querying 'DNA'
~1.6 s for 'terroristi attacks'
~1.7s for 'biology technology'
and I am currently on a very fast wifi network.
*How would it be possible to improve performance? *
- *Is it possible to apply for a desired rate of requests?*
I also read it would be a good etiquette practice to specify in *headers*
contacts, in case you need to communicate with the domain. It is not clear
to me what I should do.
- *Could you please indicate how to do it with an example in python
(here using flask framework)?*
Thank you very much for your help,
David figured out how the Cybozu ES language detection plugin works in more
detail, and figured out how to limit languages and how to retrain the
The results are big improvements that bring performance more in line with
the results we're seeing from TextCat.
Initial results are below, for queries with spaces appended before and
after (which improved performance on the old models—I'll verify that's
still the case).
Below are the summary stats for the all old language models, the old models
limited to "useful" languages, and new models, retrained on the (admittedly
messy) query data used for TextCat training. The evaluation set is the
manually tagged enwiki sample.
The full details will be posted on this page shortly:
All langauges, old models
f0.5 f1 f2 recall prec total hits misses
54.4% 47.4% 41.9% 39.0% 60.4% 775 302 198
Limited languages, old models
f0.5 f1 f2 recall prec total hits misses
75.6% 71.0% 67.0% 64.5% 79.0% 775 500 133
Retrained languages (en,es,zh,pt,ar,ru,fa,ko,bn,bg,hi,el,ta,th)
f0.5 f1 f2 recall prec total hits misses
81.8% 79.2% 76.9% 75.4% 83.5% 775 584 115
David suggests that this means we should go with TextCat, since it's easier
to integrate, and I agree. However, this test was pretty quick and easy to
run, so if we improve the training data, we can easily rebuild the models
and test again.
Overall, it's clear that limiting languages to the "useful" ones for a
given wiki makes sense, and training on query data rather than generic
language data helps, too!
Software Engineer, Discovery
Some announcements around the A/B tests we have run (and will be
running) on the Wikipedia portal.
A couple of weeks ago we launched an initial test to identify if a
more prominent search box would improve the rate at which users
clicked through from the portal to our various projects (at the
moment, only a third of users do so). This was our first A/B test as a
team, and so we expected various process flaws to show up when we
The test was essentially a wash due to one unfortunate and
unfortunately fatal flaw :(. The implementation of the logging did
pick up events from both test groups, so our initial lightweight
data-checking passed, but it did /not/ pick up *search* events
specifically - the one population we absolutely wanted to track. As a
result the test group did not contain the events we needed it to. This
was a failure of both the implementation and the QA process at the
analysis end; there is egg on everyone's face. In the next iteration
we will be conducting far deeper checks, both on the code and on the
output. We have already expanded our documentation around testing and
implementing A/B tests, at
To avoid running behind schedule, and to ensure we are producing the
most user-optimised version of the portal we can, we will be launching
a *multi-variable* A/B test - technically an A/B/C test - on 4
January. This will test the existing version of the portal, a version
with the more prominent search box, and a version with a more
prominent search box and friendly search results (containing pictures
and brief summaries, as they do on mobile), against each other.
This test should run for a week, and be non-invasive - only 0.03% of
users should even notice anything has happened. When it's done, we're
very hopeful that we'll have a clear winner and can make the
experience of using the portal better for the remaining 99%, too :).
For the Portal team,
I'm a happy to say that I've just finished a major rewrite of the page
that documents our A/B tests on meta
(https://meta.wikimedia.org/wiki/Discovery/Testing). In particular:
1. The documentation of what needs to be done to run an A/B test has
been expanded, and summarised in an easily-readable tabular form.
2. Test metadata has been expanded to document whether the test was a
failure (null hypothesis not eliminated) or success (null hypothesis
eliminated), and what project it was for (now that we're A/B testing
3. Documentation of the tests themselves has been expanded, with the
language switching tests now fully documented.
Cross-posting from wikitech-l. Please discuss this there. :-)
---------- Forwarded message ----------
From: Dan Garry <dgarry(a)wikimedia.org>
Date: 17 December 2015 at 17:09
Subject: New Beta Feature: completion suggester
To: Wikimedia developers <wikitech-l(a)lists.wikimedia.org>
In the continued quest to make the search bar a better tool, the Wikimedia
Foundation's Discovery Department
<https://www.mediawiki.org/wiki/Wikimedia_Discovery> has put a completion
suggester into Beta Features. The tool functions with search-as-you-type,
with a small tolerance for typos and spacing in finding results. Possible
matches are then displayed as you type in a drop down menu, hopefully
eliminating the need to perform a fulltext search with landing page and
all. You can read more details at mediawiki.org
and use the talk page for now for feedback.
The tool is now available and will only be enabled for the article
namespace for now, and will progress into full production at some point
hopefully in early 2016, depending on feedback. It's going to be important
to get feedback from regular contributors who use search to make sure that
any of the basic feature requests for searching the main space can at least
be addressed while in Beta Features.
Lead Product Manager, Discovery
Lead Product Manager, Discovery
I've mostly finished up my write up on using TextCat for language
- Using query data for training gives better results than using general
text (i.e., Wikipedia articles) even though the query data is pretty messy.
- French and English are too similar, esp. since no one uses the
diacritics in French. Blame the Normans
<https://en.wikipedia.org/wiki/Norman_conquest_of_England> (for the
similarity, at least).
- We need to expand to other wikis (this focused on enwiki).
- We might be able to better, still, with better training data.
Thanks for the feedback so far!
Software Engineer, Discovery
On Tue, Dec 8, 2015 at 10:22 AM, Trey Jones <tjones(a)wikimedia.org> wrote:
> Hey everyone,
> I originally started this thread on the internal mailing list, but I
> should've shared it on the public list. There have been a few replies, so
> I'll try to summarize everything so far and we can continue discussion
> here. For those following the old thread, new stuff starts at "*New
> Stuff!*" below. Sorry for the mixup.
> I wanted to share some good news on the language detection front. I've
> been messing around with TextCat, an n-gram-based language ID tool that's
> been around for a good while. (The version I've been using is in Really Old
> Perl, but as David pointed out, it's been ported to other languages (see
> below) and it would be easy for us to port to Python or PHP.)
> I trained fresh models on 59 languages based on queries (not random
> non-query text, but fairly messy data), and made some improvements to the
> code (making it Unicode-aware, making it run much faster, and improving the
> "unknown n-gram" penalty logic) and ran lots of tests with different model
> sizes. I got decent improvements over baseline TextCat.
> After doing some analysis on my English Wikipedia sample, I limited the
> languages to those that are really good (Thai pretty much only looks like
> Thai) and really necessary (English, or course, and Spanish). I dropped
> languages that performed poorly and wouldn't help much anyway (Igbo doesn't
> come up a lot on enwiki), and dropped some nice-to-have languages that
> performed too poorly (French, alas). I got down to 15 relevant languages
> for enwiki: Arabic, Bulgarian, Bengali, Greek, English, Spanish, Farsi,
> Hindi, Japanese, Korean, Portuguese, Russian, Tamil, Thai, and Chinese.
> The improvement over the ES plugin baseline (with spaces) is phenomenal.
> Recall doubled, and precision went up by a third. F0.5 is my preferred
> measure, but all these are waaaay better:
> f0.5 f1 f2 recall prec
> ES-sp 54.4% 47.4% 41.9% 39.0% 60.4%
> ES-thr 69.5% 51.6% 41.1% 36.1% 90.3%
> TC-lim 83.3% 83.3% 83.3% 83.4% 83.2%
> • ES-sp is the ES-plugin with spaces added at the beginning and ending of
> each string.
> • ES-thr is the ES-plugin (with spaces) with optimized thresholds for each
> language (more fine grained than choosing languages, and over trained
> because the optimization data and test data is the same, and brittle
> because the sample for many languages is very small). Also, this method
> targets precision, and in theory could improve recall a tiny bit, but in
> practice took it down a bit.
> • TC-lim is TextCat, limiting the languages to the relevant 15 (instead of
> all 59).
> I still have some other things to try (equalizing training set sizes,
> manually selected training sets, trying wiki-text training sets), and a
> more formal write up is coming, so this is just a preview for now.
> Generally, I'm just really glad to get some improved results!
> If we want to try an A/B test with this, we'll also need to turn it into
> something ES can use. Stas noted that it doesn't have to be hosted by ES -
> since we have a separate call to language detection (which goes to ES now,
> but can be anything), it can be implemented separately if we wanted to.
> David found several other implementations in Java and PHP:
> There are many Textcat implementations:
>> 2 JAVA: http://textcat.sourceforge.net/ and
>> 1 PHP extension: https://pecl.php.net/package/TextCat
>> and certainly others
> I'm wondering why the PHP implementation is done via a C extension...
>> Maybe the initialisation steps (parsing ngrams stat files) are not suited
>> for PHP scripts?
> David also pointed out that TextCat uses somewhat the same technique used
> by Cybozu (ES plugin), so maybe we could try to train it with the same
> dataset used for TextCat and generate custom profiles. We could reuse the
> same PHP code in Cirrus. There's a question so whether it will be possible
> to implement the new "unknown n-gram" penalty logic. And unfortunately the
> es-plugin does not support loading multiple profiles making it hard to run
> A/B tests. So we will have to write some code anyways...
> And David noted that solutions based on scripts will suffer from loading
> the ngram files for each request, and asked about overhead for the current
> Erik noted that if we are going to rewrite it in something, a PHP library
> would make the most sense for reuse in our application (being PHP and all).
> Otherwise, we could tie the existing Perl together with a Perl based HTTP
> server and set it up somewhere in the cluster, perhaps the services misc
> cluster would make sense. The content translation team was also interested
> in the work we are doing with language detection and may find a
> microservice more accessible than a PHP library.
> As Erik pointed out, the proof is in the pudding, so we'll have to see how
> this works out when we start sending live user data to it.
> *New Stuff!*
> • Stas pointed out that it's surprising that Bulgarian is on the short
> list because it's so similar to Russian. Actually Bulgarian, Spanish, and
> Portuguese aren't great (40%-55% F0.5 score) but they weren't obviously
> actively causing problems, like French and Igbo were.
> • I should have gone looking for newer implementations of TextCat, like
> David did. It is pretty simple code. But that also means that using and
> modifying another implementation or porting our own should be easy. The
> unknown n-gram penalty fix was pretty small—using the model size instead of
> the incoming sample size as the penalty. (more detail on that with my write
> • I'm not 100% sure whether using actual queries as training helped (I
> thought it would, which is why I did it), but the training data is still
> pretty messy, so depending on the exact training method, retraining Cybozu
> could be doable. The current ES plugin was a black box to me—I didn't even
> know it was Cybozu. Anyone know where the code lives, or want to volunteer
> to figure out how to retrain it? (Or, additionally, turning off the
> not-so-useful models within it for testing on enwiki.)
> • I modified TextCat to load all the models at once and hold them in
> memory (the previous implementation was noted in the code to be terrible
> for line-by-line processing because it would load each model individually
> for each query processed—it was the simplest hack possible to make it run
> in line-by-line mode). The overhead isn't much. 3.2MB for 59 models with
> 5000 n-grams. Models seem to be around 70KB, but range from 45K to 80K. The
> 15 languages I used are less then 1MB. Right now, a model with 3000 n-grams
> seems to be the best, so the overhead would go down by ~40% (not all
> n-grams are the same size, so it might be more). In short, not too much
> • I think it would make sense to set this up as something that can keep
> the models in memory. I don't know enough about our PHP architecture to
> know if you can init a plugin and then keep it in memory for the duration.
> Seems plausible though. A service of some sort (doesn't have to be
> Perl-based) would also work. We need to think through the architectural
> More questions and comments are welcome! I'm going to spend a few more
> days trying out other variations on models and training data now that I
> have an evaluation pipeline set up.
> There's also some work involved in using this on other wikis. The best
> results come from using the detectors that are most useful for the given
> wiki. For example, there really is no point in detecting Igbo on enwiki
> because Igbo queries are very rare, and Igbo incorrectly grabs a percentage
> of English queries, and it's extra overhead to use it. On the Igbo
> wiki—which made it onto my radar for getting 40K+ queries in a day—an Igbo
> model would obviously be useful. (A quick check just now also uncovered the
> biggest problem—many if not most queries to the Igbo wiki are in English.
> Cleaner training data will definitely be needed for some languages.)
> Trey Jones
> Software Engineer, Discovery
> Wikimedia Foundation
FYI below is the planned schedule for Blazegraph 2.0 release.
I plan to test with RC as soon as it is published. It is supposed to fix
some buggy SPARQL queries that were reported recently.
-------- Forwarded Message --------
Subject: [Bigdata-developers] code freeze for 2.0
Date: Mon, 14 Dec 2015 16:07:06 -0500
From: Bryan Thompson <bryan(a)systap.com>
2.0 is now in a code freeze for benchmarking and performance regression
testing. The only remaining tickets against 2.0 have to do with
deployers, documentation, etc. Â
Our plans are for a candidate release this year with an official 2.0
release in mid-January, 2016.
I've gone through and updated some of the dashboards. Most specifically
i've renamed the primary dashboard from discovery to elasticsearch. If
you're looking for the dashboard, look for elasticsearch.
The primary dashboard lives here now: