Re: [discovery] Still confused about BM25 vs TF*IDF for search relevance?

18 Mar 2016

That is a nice summary.

A couple of thoughts to add for those who aren't information-retrieval
nerds (this helped solidify TF and IDF in my head >20 years ago):

* I like to think of TF (whatever it's formulation) as how *characteristic *a
term is for a document. "dog" occurs once, then this article isn't really
about "dog".

* Similarly, IDF is how *distinctive* a term is for a document. "the" shows
up a lot in English text—so it is characteristic—but it shows up in every
text, so it isn't distinctive.

So, in a corpus of English documents, "the" is characteristic (high TF) but
not distinctive (low IDF) for any given document. OTOH, in a corpus that's
99% Swahili documents, and a small handful of English documents, "the" is
both characteristic (because the docs are in English) and distinctive
(because most other docs are not).

Thus, TF is about a given document, and IDF is about the corpus it is part
of.

* Those cool graphs in that blog post? Those were obviously done with
Desmos ( https://www.desmos.com ) a powerful free HTML5 graphing
calculator. We use it to look at scoring components to understand how
different formulas behave, and how modifying various parameters affects
them. I use it so much, I just wanted to share it.

—Trey

Trey Jones
Software Engineer, Discovery
Wikimedia Foundation

On Fri, Mar 18, 2016 at 7:36 AM, Guillaume Lederrey &lt;glederrey(a)wikimedia.org
...
  wrote: 
> Yep, nice reading!
>
> On Fri, Mar 18, 2016 at 12:28 AM, Dan Garry &lt;dgarry(a)wikimedia.org...
  wrote: > > This was a really helpful read
for me! Thanks for sending. :-)
> >
> > Dan
> >
> > On 14 March 2016 at 16:21, Tomasz Finc &lt;tfinc(a)wikimedia.org...
  wrote: > >>
> >> Here is a good write up that breaks it down
> >>
> >>
> >>
>
http://opensourceconnections.com/blog/2015/10/16/bm25-the-next-generation-o…
> >>
> >> Given the recent threads about exploring BM25, i thought this was a
> >> good introduction to the difference between the two.
> >>
> >> Cheers.
> >>
> >> --tomasz
> >>
> >
>

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

Re: [discovery] Still confused about BM25 vs TF*IDF for search relevance?