Still confused about BM25 vs TF*IDF for search relevance? - Discovery - lists.wikimedia.org

List overview All Threads
Download

Still confused about BM25 vs TF*IDF for search relevance?

High priority task added to search...

Starting work on WDQS

Tomasz Finc

14 Mar 2016 14 Mar '16

11:21 p.m.

Here is a good write up that breaks it down http://opensourceconnections.com/blog/2015/10/16/bm25-the-next-generation-o… Given the recent threads about exploring BM25, i thought this was a good introduction to the difference between the two. Cheers. --tomasz

Reply

Show replies by date

Dan Garry

17 Mar 17 Mar

11:28 p.m.

This was a really helpful read for me! Thanks for sending. :-) Dan On 14 March 2016 at 16:21, Tomasz Finc <tfinc(a)wikimedia.org> wrote:

Here is a good write up that breaks it down http://opensourceconnections.com/blog/2015/10/16/bm25-the-next-generation-o… Given the recent threads about exploring BM25, i thought this was a good introduction to the difference between the two. Cheers. --tomasz _______________________________________________ discovery mailing list discovery(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/discovery

-- Dan Garry Lead Product Manager, Discovery Wikimedia Foundation

Reply

Guillaume Lederrey

18 Mar 18 Mar

11:36 a.m.

Yep, nice reading! On Fri, Mar 18, 2016 at 12:28 AM, Dan Garry <dgarry(a)wikimedia.org> wrote:

This was a really helpful read for me! Thanks for sending. :-) Dan On 14 March 2016 at 16:21, Tomasz Finc <tfinc(a)wikimedia.org> wrote:

Here is a good write up that breaks it down http://opensourceconnections.com/blog/2015/10/16/bm25-the-next-generation-o… Given the recent threads about exploring BM25, i thought this was a good introduction to the difference between the two. Cheers. --tomasz _______________________________________________ discovery mailing list discovery(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/discovery

-- Dan Garry Lead Product Manager, Discovery Wikimedia Foundation _______________________________________________ discovery mailing list discovery(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/discovery

-- Guillaume Lederrey Operations Engineer, Discovery Wikimedia Foundation

Reply

Trey Jones

4:18 p.m.

That is a nice summary. A couple of thoughts to add for those who aren't information-retrieval nerds (this helped solidify TF and IDF in my head >20 years ago): * I like to think of TF (whatever it's formulation) as how *characteristic *a term is for a document. "dog" occurs once, then this article isn't really about "dog". * Similarly, IDF is how *distinctive* a term is for a document. "the" shows up a lot in English text—so it is characteristic—but it shows up in every text, so it isn't distinctive. So, in a corpus of English documents, "the" is characteristic (high TF) but not distinctive (low IDF) for any given document. OTOH, in a corpus that's 99% Swahili documents, and a small handful of English documents, "the" is both characteristic (because the docs are in English) and distinctive (because most other docs are not). Thus, TF is about a given document, and IDF is about the corpus it is part of. * Those cool graphs in that blog post? Those were obviously done with Desmos ( https://www.desmos.com ) a powerful free HTML5 graphing calculator. We use it to look at scoring components to understand how different formulas behave, and how modifying various parameters affects them. I use it so much, I just wanted to share it. —Trey Trey Jones Software Engineer, Discovery Wikimedia Foundation On Fri, Mar 18, 2016 at 7:36 AM, Guillaume Lederrey <glederrey(a)wikimedia.org

wrote:

> Yep, nice reading! > > On Fri, Mar 18, 2016 at 12:28 AM, Dan Garry <dgarry(a)wikimedia.org

wrote:

> > This was a really helpful read for me! Thanks for sending. :-) > > > > Dan > > > > On 14 March 2016 at 16:21, Tomasz Finc <tfinc(a)wikimedia.org

wrote:

> >> > >> Here is a good write up that breaks it down > >> > >> > >> > http://opensourceconnections.com/blog/2015/10/16/bm25-the-next-generation-o… > >> > >> Given the recent threads about exploring BM25, i thought this was a > >> good introduction to the difference between the two. > >> > >> Cheers. > >> > >> --tomasz > >> > > >

Reply

2962

days inactive

2966

days old

discovery@lists.wikimedia.org

Manage subscription

3 comments

4 participants

tags (0)

participants (4)

Dan Garry
Guillaume Lederrey
Tomasz Finc
Trey Jones