Cheers
Johannes
On Wed, 11 Dec 2013 08:33:22 -0600
Sebastiano Vigna <vigna(a)di.unimi.it> wrote:
On 11 Dec 2013, at 8:17 AM, Aaron Halfaker
<ahalfaker(a)wikimedia.org> wrote:
This is
in the interest of reproducibility--it's very easy with large dataset to miss some
trivial detail and then things go amok
Yes. It would be nice if we didn't have to manage such details.
I'm sorry--I'm a bit confused. I understand you are from Wikipedia--the intended
meaning of the above phrase is that you guys at Wikipedia are not interested in generating
and distributing such a graph?
The ids
inside the SQL tables are artifacts of Wikipedia's construction phases--we should hide
them.
I don't think that hiding IDs that could be used to, say, look up content or compare
against user contrib histories is a good idea.
Maybe "hiding" is not the correct word. I would also distribute an array of SQL
ids indexed by node numbers to access the SQL data if necessary--the point is that when
you work with the graph a contiguous ID space is essentially. Otherwise, for example, all
vector norms in the computation of spectral rankings are altered by the existence of
numerous isolated nodes.
Ciao,
seba
_______________________________________________
Analytics mailing list
Analytics(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics