Hi all,
I agree with Sebastiano, it would be really useful to have 'plain'
version of the graph.
It would be also nice to have a plain version of Wikipedia, with all
the articles organized in fields
(title, categories, links, images, etc etc). I recently wrote code [1]
to convert the dump in json, so that
each article can be easily pushed in an object without handling the
parsing of the fields.
I think that having such kind of data sets directly provided by
Wikipedia would really improve quality of
research and applications. Of course, I would be happy to help if needed ;)
Cheers,
Diego
[1]
https://github.com/diegoceccarelli/json-wikipedia
On Wed, Dec 11, 2013 at 3:44 PM, Johannes Kroll
<johannes.kroll(a)wikimedia.de> wrote:
Hi all,
you may be interested in
https://wikitech.wikimedia.org/wiki/Nova_Resource:Catgraph
Cheers
Johannes
On Wed, 11 Dec 2013 08:33:22 -0600
Sebastiano Vigna <vigna(a)di.unimi.it> wrote:
> On 11 Dec 2013, at 8:17 AM, Aaron Halfaker <ahalfaker(a)wikimedia.org> wrote:
>
> > > This is in the interest of reproducibility--it's very easy with large
dataset to miss some trivial detail and then things go amok
Yes. It would be nice if we didn't have to manage such details.
I'm sorry--I'm a bit confused. I understand you are from Wikipedia--the intended
meaning of the above phrase is that you guys at Wikipedia are not interested in generating
and distributing such a graph?
The ids
inside the SQL tables are artifacts of Wikipedia's construction phases--we should hide
them.
I don't think that hiding IDs that could be used to, say, look up content or compare
against user contrib histories is a good idea.
Maybe "hiding" is not the correct word. I would also distribute an array of SQL
ids indexed by node numbers to access the SQL data if necessary--the point is that when
you work with the graph a contiguous ID space is essentially. Otherwise, for example, all
vector norms in the computation of spectral rankings are altered by the existence of
numerous isolated nodes.
Ciao,
seba
_______________________________________________
Analytics mailing list
Analytics(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics
_______________________________________________
Analytics mailing list
Analytics(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics
--
Computers are useless. They can only give you answers.
(Pablo Picasso)
_______________
Diego Ceccarelli
High Performance Computing Laboratory
Information Science and Technologies Institute (ISTI)
Italian National Research Council (CNR)
Via Moruzzi, 1
56124 - Pisa - Italy
Phone: +39 050 315 2984
Fax: +39 050 315 2040
________________________________________