Hi, we are trying to extract all URLs in wiki articles from our Mediawiki installation. We have tried Grep, Perl and Sed on mysql dumps, but it is very difficult to get the URLs only, without some garbage/text/comments before or after them.
Does anyone know of a better way to achieve this?
Thanks, Andi
On Thu, Mar 26, 2009 at 12:54 PM, Andreas Rindler mediawiki@jenandi.com wrote:
Hi, we are trying to extract all URLs in wiki articles from our Mediawiki installation. We have tried Grep, Perl and Sed on mysql dumps, but it is very difficult to get the URLs only, without some garbage/text/comments before or after them.
Does anyone know of a better way to achieve this?
SELECT * FROM externallinks;
on your wiki database.
Bryan
Andreas Rindler schreef:
Hi, we are trying to extract all URLs in wiki articles from our Mediawiki installation. We have tried Grep, Perl and Sed on mysql dumps, but it is very difficult to get the URLs only, without some garbage/text/comments before or after them.
Does anyone know of a better way to achieve this?
Use the externallinks table, it has all this data. If the externallinks table is empty or incomplete, you can rebuild it with a maintenance script (don't remember the name offhand).
Roan Kattouw (Catrope)
wikitech-l@lists.wikimedia.org