Re: [Analytics] Distributing an official graph

11 Dec 2013

On 11 Dec 2013, at 8:17 AM, Aaron Halfaker &lt;ahalfaker(a)wikimedia.org&gt; wrote:

...
   This is in the
interest of reproducibility--it's very easy with large dataset to miss some trivial
detail and then things go amok  
 Yes.  It would be nice if we didn't have to manage such details.  
I'm sorry--I'm a bit confused. I understand you are from Wikipedia--the intended
meaning of the above phrase is that you guys at Wikipedia are not interested in generating
and distributing such a graph?

...
   The ids inside
the SQL tables are artifacts of Wikipedia's construction phases--we should hide them.

 I don't think that hiding IDs that could be used to, say, look up content or compare
against user contrib histories is a good idea.  
Maybe "hiding" is not the correct word. I would also distribute an array of SQL
ids indexed by node numbers to access the SQL data if necessary--the point is that when
you work with the graph a contiguous ID space is essentially. Otherwise, for example, all
vector norms in the computation of spectral rankings are altered by the existence of
numerous isolated nodes.

Ciao,

					seba

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

Re: [Analytics] Distributing an official graph