Re: [Analytics] Distributing an official graph

11 Dec 2013

Hi all,

you may be interested in
https://wikitech.wikimedia.org/wiki/Nova_Resource:Catgraph

Cheers
Johannes

On Wed, 11 Dec 2013 08:33:22 -0600
Sebastiano Vigna &lt;vigna(a)di.unimi.it&gt; wrote:

...
  On 11 Dec 2013, at 8:17 AM, Aaron Halfaker
&lt;ahalfaker(a)wikimedia.org&gt; wrote:

   This is
in the interest of reproducibility--it's very easy with large dataset to miss some
trivial detail and then things go amok  
 Yes.  It would be nice if we didn't have to manage such details.   
 I'm sorry--I'm a bit confused. I understand you are from Wikipedia--the intended
meaning of the above phrase is that you guys at Wikipedia are not interested in generating
and distributing such a graph?

   The ids
inside the SQL tables are artifacts of Wikipedia's construction phases--we should hide
them.  
 I don't think that hiding IDs that could be used to, say, look up content or compare
against user contrib histories is a good idea.   
 Maybe "hiding" is not the correct word. I would also distribute an array of SQL
ids indexed by node numbers to access the SQL data if necessary--the point is that when
you work with the graph a contiguous ID space is essentially. Otherwise, for example, all
vector norms in the computation of spectral rankings are altered by the existence of
numerous isolated nodes.

 Ciao,

 					seba

 _______________________________________________
 Analytics mailing list
 Analytics(a)lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/analytics 

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

Re: [Analytics] Distributing an official graph