Hi,
I work since two weeks on the tool server. In my directory home I have extract the English and German dump. So that I can scan the dumps for geocoordinates. At the moment I have a Perl script for this work. The results is the KML for Google Earth and a CSV for many other services.
Now I see the most of you use the MySQL-database with all the nice tables. I have no experience with MySQL, but I know SQL.
Question 1: Can I use Perl to get some data from the database? (I work since three weeks with Perl and CGI, so I need a small example.)
Question 2: Can I use a SQL-Question with perl? For example: I want the category's of page "xy" in DE.
Question 3. When I scan all articles from the Dump (XML-File) for coordinates I need with Perl in EN 45 minutes and in DE 15 minutes. If I use in the future hopefully the MySQL-database I think this process will be to long for the database. So that all other services have a problem. Is this right? Or is the power of the database strong enough for this full-text search?
Question 4. After the full-text search I will put the results in the database. Therefore I need also a Perl-example. Please help me.
Question 5. If I can not do the full-text search at the MySQL-database I think it would be very helpful to make one directory for all users with all dumps (XML-files). So that every user can use this dumps. At the moment I have the current dump (DE+EN) in my home-directory, but I am sure that other users also have dumps. What did you think about this?
Thanks, Stefan (sk)
Stefan Kühn wrote:
I work since two weeks on the tool server. In my directory home I have extract the English and German dump. So that I can scan the dumps for geocoordinates. At the moment I have a Perl script for this work. The results is the KML for Google Earth and a CSV for many other services.
Now I see the most of you use the MySQL-database with all the nice tables. I have no experience with MySQL, but I know SQL.
Question 1: Can I use Perl to get some data from the database? (I work since three weeks with Perl and CGI, so I need a small example.)
Yes, you can have full access to the replicated database from Perl. You can use the DBI module, see http://dbi.perl.org/ (alternatively, searching for "perl DBI example" brings up a number of usable tutorials).
Question 2: Can I use a SQL-Question with perl? For example: I want the category's of page "xy" in DE.
Yes, once you understand the secrets structure of the MediaWiki database ;) Seriously, it's not that complicated. I'll give you an example which reads all categories of the "Schweden" article in the German Wikipedia:
USE dewiki_p; SELECT cl_to FROM categorylinks INNER JOIN page ON page.page_id = categorylinks.cl_from AND page_title="Schweden";
As all titles in the database, the result is in UTF-8 and "_" have to be replaced by spaces.
Question 3. When I scan all articles from the Dump (XML-File) for coordinates I need with Perl in EN 45 minutes and in DE 15 minutes. If I use in the future hopefully the MySQL-database I think this process will be to long for the database. So that all other services have a problem. Is this right? Or is the power of the database strong enough for this full-text search?
I can't say much about that, but the full texts are not in the replicated database at the moment. If they were, I believe MySQL would be faster than scanning an XML dump, and whether your Perl script parses XML or reads from MySQL should also not matter too much. You should just let the admins know you're running a long query so they don't kill it. (Someone correct me if I'm wrong.)
Question 4. After the full-text search I will put the results in the database. Therefore I need also a Perl-example. Please help me.
I think that's covered by question 1 already. Just make the INSERTs you need to write the data.
Question 5. If I can not do the full-text search at the MySQL-database I think it would be very helpful to make one directory for all users with all dumps (XML-files). So that every user can use this dumps. At the moment I have the current dump (DE+EN) in my home-directory, but I am sure that other users also have dumps. What did you think about this?
Did you miss http://mail.wikipedia.org/pipermail/toolserver-l/2006-September/000535.html?
I hope I provided a few useful hints. Once you know how to generally access the database from Perl (or whatever), understanding the DB structure isn't that difficult, I think.
Gruß, Peter 'da Pete' Schlömer
toolserver-l@lists.wikimedia.org