On 11 Dec 2013, at 8:17 AM, Aaron Halfaker <ahalfaker(a)wikimedia.org> wrote:
This is in the
interest of reproducibility--it's very easy with large dataset to miss some trivial
detail and then things go amok
Yes. It would be nice if we didn't have to manage such details.
I'm sorry--I'm a bit confused. I understand you are from Wikipedia--the intended
meaning of the above phrase is that you guys at Wikipedia are not interested in generating
and distributing such a graph?
The ids inside
the SQL tables are artifacts of Wikipedia's construction phases--we should hide them.
I don't think that hiding IDs that could be used to, say, look up content or compare
against user contrib histories is a good idea.
Maybe "hiding" is not the correct word. I would also distribute an array of SQL
ids indexed by node numbers to access the SQL data if necessary--the point is that when
you work with the graph a contiguous ID space is essentially. Otherwise, for example, all
vector norms in the computation of spectral rankings are altered by the existence of
numerous isolated nodes.
Ciao,
seba