Here is a good write up that breaks it down
http://opensourceconnections.com/blog/2015/10/16/bm25-the-next-generation-of...
Given the recent threads about exploring BM25, i thought this was a good introduction to the difference between the two.
Cheers.
--tomasz
This was a really helpful read for me! Thanks for sending. :-)
Dan
On 14 March 2016 at 16:21, Tomasz Finc tfinc@wikimedia.org wrote:
Here is a good write up that breaks it down
http://opensourceconnections.com/blog/2015/10/16/bm25-the-next-generation-of...
Given the recent threads about exploring BM25, i thought this was a good introduction to the difference between the two.
Cheers.
--tomasz
discovery mailing list discovery@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/discovery
Yep, nice reading!
On Fri, Mar 18, 2016 at 12:28 AM, Dan Garry dgarry@wikimedia.org wrote:
This was a really helpful read for me! Thanks for sending. :-)
Dan
On 14 March 2016 at 16:21, Tomasz Finc tfinc@wikimedia.org wrote:
Here is a good write up that breaks it down
http://opensourceconnections.com/blog/2015/10/16/bm25-the-next-generation-of...
Given the recent threads about exploring BM25, i thought this was a good introduction to the difference between the two.
Cheers.
--tomasz
discovery mailing list discovery@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/discovery
-- Dan Garry Lead Product Manager, Discovery Wikimedia Foundation
discovery mailing list discovery@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/discovery
That is a nice summary.
A couple of thoughts to add for those who aren't information-retrieval nerds (this helped solidify TF and IDF in my head >20 years ago):
* I like to think of TF (whatever it's formulation) as how *characteristic *a term is for a document. "dog" occurs once, then this article isn't really about "dog".
* Similarly, IDF is how *distinctive* a term is for a document. "the" shows up a lot in English text—so it is characteristic—but it shows up in every text, so it isn't distinctive.
So, in a corpus of English documents, "the" is characteristic (high TF) but not distinctive (low IDF) for any given document. OTOH, in a corpus that's 99% Swahili documents, and a small handful of English documents, "the" is both characteristic (because the docs are in English) and distinctive (because most other docs are not).
Thus, TF is about a given document, and IDF is about the corpus it is part of.
* Those cool graphs in that blog post? Those were obviously done with Desmos ( https://www.desmos.com ) a powerful free HTML5 graphing calculator. We use it to look at scoring components to understand how different formulas behave, and how modifying various parameters affects them. I use it so much, I just wanted to share it.
—Trey
Trey Jones Software Engineer, Discovery Wikimedia Foundation
On Fri, Mar 18, 2016 at 7:36 AM, Guillaume Lederrey <glederrey@wikimedia.org
wrote:
Yep, nice reading!
On Fri, Mar 18, 2016 at 12:28 AM, Dan Garry dgarry@wikimedia.org wrote:
This was a really helpful read for me! Thanks for sending. :-)
Dan
On 14 March 2016 at 16:21, Tomasz Finc tfinc@wikimedia.org wrote:
Here is a good write up that breaks it down
http://opensourceconnections.com/blog/2015/10/16/bm25-the-next-generation-of...
Given the recent threads about exploring BM25, i thought this was a good introduction to the difference between the two.
Cheers.
--tomasz