A dump is indeed your best bet, especially if you intend to spider all of 2,400,000 articles.
In response to whether anything similar has been done before, you might want to have a look at the six-degrees project (http://toolserver.org/~river/pages/projects/six-degrees) which was a replication-based implementation of something similar to what you're suggesting. It's since been taken down, but there's a similar tool at http://www.netsoc.tcd.ie/~mu/wiki/ ("find shortest paths") which queries the last database dump for the closest path between two articles.
- H
Andrew Gray wrote (Wed 04/06/2008 11:00):
2008/6/4 Sylvan Arevalo khakiducks@gmail.com:
Oh and if anyone has suggestions on the best way to make the database of hyperlinks that reference each other (spidering all of wikipedia, or is there a better way to do it?)
Spidering is bad!
(It's both time-consuming for you and very annoying for us)
You can get the dataset you're looking for via dumps.wikimedia.org - you
want the enwiki pagelinks.sql.gz file, I believe. Not entirely sure what you'd do with it after that, but it ought to have the data you're looking for in a suitably stripped-down form.
The dump you probabally want in this case is the "Wiki page-to-page link records."
You can download that file from here (It's near the bottom of the page, and is 1.7GB) : http://dumps.wikimedia.org/enwiki/20080524/
Hopefully, you will then be able to make your project work without having to spider the entire database.
This dump is slightly out of date, being made on the 24th May this year, but it's not too bad.
Regards, Stwalkerster
2008/6/4 Harry Willis en.haza-w@ip3.co.uk:
A dump is indeed your best bet, especially if you intend to spider all of 2,400,000 articles.
In response to whether anything similar has been done before, you might want to have a look at the six-degrees project (http://toolserver.org/~river/pages/projects/six-degreeshttp://toolserver.org/%7Eriver/pages/projects/six-degrees) which was a replication-based implementation of something similar to what you're suggesting. It's since been taken down, but there's a similar tool at http://www.netsoc.tcd.ie/~mu/wiki/ http://www.netsoc.tcd.ie/%7Emu/wiki/("find shortest paths") which queries the last database dump for the closest path between two articles.
- H
Andrew Gray wrote (Wed 04/06/2008 11:00):
2008/6/4 Sylvan Arevalo khakiducks@gmail.com:
Oh and if anyone has suggestions on the best way to make the database of hyperlinks that reference each other (spidering all of wikipedia, or is there a better way to do it?)
Spidering is bad!
(It's both time-consuming for you and very annoying for us)
You can get the dataset you're looking for via dumps.wikimedia.org - you
want the enwiki pagelinks.sql.gz file, I believe. Not entirely sure what you'd do with it after that, but it ought to have the data you're looking for in a suitably stripped-down form.
WikiEN-l mailing list WikiEN-l@lists.wikimedia.org To unsubscribe from this mailing list, visit: https://lists.wikimedia.org/mailman/listinfo/wikien-l