Troubles with reading Articles

List overview All Threads
Download

newer

older

Some comments

Re: [Toolserver-l] WikiProxy

Leo Büttiker

22 Mar 2006 22 Mar '06

4:03 p.m.

Hi all, For a toolserver-project I will read all Wikipedia (pwiki_de) articles and parse them for geoinformation. After some troubles I've fixed now nearly all bugs, but I have still some troubles with opening the articles.

I open the article with the help of the mediawiki functions in the following way: $title = Title::newFromID($page_id); $art = new Article($title); $text = $art->getContent(true);

For some articles this work quite well, but for some it doesn't return text. I think there's a problem with the compresion of the database (in a local enviroment with a wikipedia dump it works), but I could't find out a workaround. Any suggestions?

Thanks Leo

Show replies by date

Rob Church

22 Mar 22 Mar

4:10 p.m.

Some text is stored compressed in the databases, and some is on external storage, a feature of MediaWiki which Wikimedia sites use; this is not available at the present time.

Rob Church

On 22/03/06, Leo Büttiker leo.buettiker@hsr.ch wrote:

...

Hi all, For a toolserver-project I will read all Wikipedia (pwiki_de) articles and parse them for geoinformation. After some troubles I've fixed now nearly all bugs, but I have still some troubles with opening the articles.

I open the article with the help of the mediawiki functions in the following way: $title = Title::newFromID($page_id); $art = new Article($title); $text = $art->getContent(true);

For some articles this work quite well, but for some it doesn't return text. I think there's a problem with the compresion of the database (in a local enviroment with a wikipedia dump it works), but I could't find out a workaround. Any suggestions?

Thanks Leo _______________________________________________ Toolserver-l mailing list Toolserver-l@Wikipedia.org http://mail.wikipedia.org/mailman/listinfo/toolserver-l

Stefan F. Keller

4:20 p.m.

Rob

...

this is not available at the present time.

Does this mean, Wikimedia sites (like de) use the compression but don't offer decompression?? As a user I've never seen compressed stuff, so this is weird... we really need somehow a workaround.

-- Stefan K.

...

-----Original Message----- From: toolserver-l-bounces@Wikipedia.org [mailto:toolserver-l- bounces@Wikipedia.org] On Behalf Of Rob Church Sent: Wednesday, March 22, 2006 10:10 PM To: toolserver-l@wikipedia.org Subject: Re: [Toolserver-l] Troubles with reading Articles

Some text is stored compressed in the databases, and some is on external storage, a feature of MediaWiki which Wikimedia sites use; this is not available at the present time.

Rob Church

On 22/03/06, Leo Büttiker leo.buettiker@hsr.ch wrote:

...
Hi all, For a toolserver-project I will read all Wikipedia (pwiki_de) articles

and

...
parse them for geoinformation. After some troubles I've fixed now nearly

all

...
bugs, but I have still some troubles with opening the articles.

I open the article with the help of the mediawiki functions in the

following

...
way: $title = Title::newFromID($page_id); $art = new Article($title); $text = $art->getContent(true);

For some articles this work quite well, but for some it doesn't return

text. I

...
think there's a problem with the compresion of the database (in a local enviroment with a wikipedia dump it works), but I could't find out a workaround. Any suggestions?

Thanks Leo _______________________________________________ Toolserver-l mailing list Toolserver-l@Wikipedia.org http://mail.wikipedia.org/mailman/listinfo/toolserver-l

Toolserver-l mailing list Toolserver-l@Wikipedia.org http://mail.wikipedia.org/mailman/listinfo/toolserver-l

Rob Church

4:25 p.m.

Well, on Wikimedia sites it'll be handled within the software, won't it? Old versions which are compressed will be decompressed for viewing and for calculating diffs. The same is true for external storage.

On the toolserver, we have the problem that it's a different setup; we don't have access to the external storage grid (that I know of and at the moment; Kate has ways and means). Compressed stuff should still be okay to view, but I bet there's less of it.

Rob Church

On 22/03/06, Stefan F. Keller sfkeller@hsr.ch wrote:

...

Rob

...
this is not available at the present time.

Does this mean, Wikimedia sites (like de) use the compression but don't offer decompression?? As a user I've never seen compressed stuff, so this is weird... we really need somehow a workaround.

-- Stefan K.

...
-----Original Message----- From: toolserver-l-bounces@Wikipedia.org [mailto:toolserver-l- bounces@Wikipedia.org] On Behalf Of Rob Church Sent: Wednesday, March 22, 2006 10:10 PM To: toolserver-l@wikipedia.org Subject: Re: [Toolserver-l] Troubles with reading Articles

Some text is stored compressed in the databases, and some is on external storage, a feature of MediaWiki which Wikimedia sites use; this is not available at the present time.

Rob Church

On 22/03/06, Leo Büttiker leo.buettiker@hsr.ch wrote:

...
Hi all, For a toolserver-project I will read all Wikipedia (pwiki_de) articles

and

...
parse them for geoinformation. After some troubles I've fixed now nearly

all

...
bugs, but I have still some troubles with opening the articles.

I open the article with the help of the mediawiki functions in the

following

...
way: $title = Title::newFromID($page_id); $art = new Article($title); $text = $art->getContent(true);

For some articles this work quite well, but for some it doesn't return

text. I

...
think there's a problem with the compresion of the database (in a local enviroment with a wikipedia dump it works), but I could't find out a workaround. Any suggestions?

Thanks Leo _______________________________________________ Toolserver-l mailing list Toolserver-l@Wikipedia.org http://mail.wikipedia.org/mailman/listinfo/toolserver-l

Toolserver-l mailing list Toolserver-l@Wikipedia.org http://mail.wikipedia.org/mailman/listinfo/toolserver-l

Toolserver-l mailing list Toolserver-l@Wikipedia.org http://mail.wikipedia.org/mailman/listinfo/toolserver-l

Leo Büttiker

4:45 p.m.

Thanks Rob I think I have a problem with the compressed stuff, I can't open it. In a previous Version of MediaWiki I receive a warning from php, now they tourned this off, but it still dosen't work. I can't say exactly how many articles are affected but there's a lot of them. By the way I will only use the actual revision, no old stuff.

Am Mittwoch, 22. März 2006 22:25 schrieb Rob Church:

...

Well, on Wikimedia sites it'll be handled within the software, won't it? Old versions which are compressed will be decompressed for viewing and for calculating diffs. The same is true for external storage.

On the toolserver, we have the problem that it's a different setup; we don't have access to the external storage grid (that I know of and at the moment; Kate has ways and means). Compressed stuff should still be okay to view, but I bet there's less of it.

Rob Church

Gregory Maxwell

4:49 p.m.

On 3/22/06, Leo Büttiker leo.buettiker@hsr.ch wrote:

...

Thanks Rob I think I have a problem with the compressed stuff, I can't open it. In a previous Version of MediaWiki I receive a warning from php, now they tourned this off, but it still dosen't work. I can't say exactly how many articles are affected but there's a lot of them. By the way I will only use the actual revision, no old stuff.

I can't speak for dewiki, but the vast majority of the new edits in enwiki are just going straight to external storage... making toolserver not very useful.

What are the flags on the revision you are trying to read? and what is the raw text output of the text column? is it just 'cluster' something another then your problem is external storage.

FlaBot

6:44 p.m.

The external storage is not a good soluction for the toolserver.

I use a little bloody hack to get all content :

ini_set('user_agent', 'TOOLSERVER'); $replag=file_get_contents(" http://tools.wikimedia.de/~interiot/cgi-bin/replag?raw");

$result1=mysql_query("select t.old_text,t.old_flags from ".$fields[1]."wiki_p.text as t,".$fields[1]."wiki_p.page as p,".$fields[1]."wiki_p.revision as r where p.page_title='".utf8_encode($fields[4])."' and p.page_namespace=0 and p.page_is_redirect=0 and p.page_latest=r.rev_id and r.rev_text_id=t.old_id",$db); if ($result1 == false) die("failed"); $fields1 = mysql_fetch_row($result1);

if (preg_match('/external/',$fields1[1]) or preg_match('/object/',$fields1[1]) or strlen($fields1[1])==0 or $replag>=300){ $text1=file_get_contents("http://%22.$fields%5B1%5D.%22.wikipedia.org/w/index.php?title=%22.$fields%5B..."); } else { $text1=gzinflate($fields1[0]); }

Ok . That works .. but is not very fast .... and produce load to the wiki-server ..

But i cant find a better solution.

Greeting Flacus/Flabot

Daniel Kinzler

6:59 p.m.

Hi all

To solve the external storage problem, I have written http://tools.wikimedia.de/~daniel/foo/WikiProxy.php. It's not fast, as it also pulls the pages via HTTP from the real servers, but it does caching for dewiki, enwiki and commonswiki (I can add more on demand, but I have to ask Kate to tweak grants for each new cache).

The more people use this, the better it works :)

I have also fiddeled with setting up an interface for WikiProxy that bypasses the Web Server. If anyone wants to experiment with a shaky fifo or plain TCP interface, tell me, I'll have another look at it.

HTH -- Daniel

Leo Büttiker

23 Mar 23 Mar

4:12 a.m.

Hi Daniel, hi all Great Tool. Looks like I have to rewrite my app again. And I have also some questions: *Do you look only for remote articles when they not avaible on the toolserver? I ask this because I will read all articles in german wikipedia (at least at the first time I run my script) and that will bring a big performence problem. *Can you give my a PHP Interface to the wikiproxy? Something like a include file "WikiProxy.inc" with a function to return the article as string? That would be great! 'Cause I doesn't need any slow http, tcp or whatever connection.

Thanks Leo

Am Donnerstag, 23. März 2006 00:59 schrieb Daniel Kinzler:

...

Hi all

To solve the external storage problem, I have written http://tools.wikimedia.de/~daniel/foo/WikiProxy.php. It's not fast, as it also pulls the pages via HTTP from the real servers, but it does caching for dewiki, enwiki and commonswiki (I can add more on demand, but I have to ask Kate to tweak grants for each new cache).

The more people use this, the better it works :)

I have also fiddeled with setting up an interface for WikiProxy that bypasses the Web Server. If anyone wants to experiment with a shaky fifo or plain TCP interface, tell me, I'll have another look at it.

HTH -- Daniel

Rob Church

6:41 a.m.

On 23/03/06, Leo Büttiker leo.buettiker@hsr.ch wrote:

...

I ask this because I will read all articles in german wikipedia (at least at the first time I run my script) and that will bring a big performence problem.

Correct, it will. Please don't do that.

Rob Church

Leo Büttiker

8:49 a.m.

"Don't do that"?! What should I do instead of this? Not using the toolserver?

When I started working on the toolserver I hopped to find all articles (at least in the current version of them) on the database to make a usfull extract of the geodata for the community, but when not all articles are in the database there's no possibility for do that. Suggestions are always welcom.

Am Donnerstag, 23. März 2006 12:41 schrieb Rob Church:

...

On 23/03/06, Leo Büttiker leo.buettiker@hsr.ch wrote:

...
I ask this because I will read all articles in german wikipedia (at least at the first time I run my script) and that will bring a big performence problem.

Correct, it will. Please don't do that.

Rob Church _______________________________________________ Toolserver-l mailing list Toolserver-l@Wikipedia.org http://mail.wikipedia.org/mailman/listinfo/toolserver-l

Gregory Maxwell

22 Mar 22 Mar

9:57 p.m.

On 3/22/06, FlaBot flabot@googlemail.com wrote:

...

Ok . That works .. but is not very fast .... and produce load to the wiki-server ..

But i cant find a better solution.

Greeting Flacus/Flabot

I tried the same approach with my IRC bot that reads the text of recent changes and analyzies the changes as they come in... it reads multiple revisions of the text to find out whats going on. Back when we had text it caused very few problems (well execpt for triggering a mysql bug which was fixed).. but pulling from http it caused problems rather quickly.

FlaBot

23 Mar 23 Mar

3:27 a.m.

Hi Daniel !

Great Tool. Ok i will switch from my script to yours, but before i have a few question.

1) we need a place to accounte things like for tool.

2) At the moment i can access the php via http://tools.wikimedia.de/~daniel/foo/WikiProxy.php?wiki=de&title=Haus That works with de and en. But if i use fr for example that doesnt work. Ok you wrote only de/en will be cached. but perhaps you can make the tool working for no-cached languages ?

3) what about cache-expired-time ?

4) Perhaps we can have a small space from the toolserver to tmp more request ?

Daniel Kinzler

7:56 a.m.

Hi Flacus, hi Leo

...

we need a place to accounte things like for tool.

err... come again?

...

At the moment i can access the php via

http://tools.wikimedia.de/~daniel/foo/WikiProxy.php?wiki=de&title=Haus That works with de and en. But if i use fr for example that doesnt work.

Hm? "Haus" does not exist in the fr wiki, so you get a 404 (with no visible text, for consistency with action=raw). Looking for "Berlin" works, for example: http://tools.wikimedia.de/~daniel/foo/WikiProxy.php?wiki=fr&title=Berlin

...

Ok you wrote only de/en will be cached. but perhaps you can make the tool working for no-cached languages ?

WikiProxy works for all wikis. If there is no cache table, it will simply pass the text through.

Btw.: the values for the wiki parameter can be full domain names. Short names like "de", "fr", etc work for wikipedias, for other wikis, use the full domain name, like "pl.wikinews.org", etc.

...

what about cache-expired-time ?

The cache does not expire, the text is kept indefinitely, separate for each revision.

...

Perhaps we can have a small space from the toolserver to tmp more request?

Uh, what?

...

*Do you look only for remote articles when they not avaible on the toolserver?

Yes. It first looks into the text table - if the text is not there (i.e. it has the EXTERNAL flag), it looks into the cache. If it's not in the cache, it pulls it via HTTP, and put it into the cache.

...

I ask this because I will read all articles in german wikipedia (at least at the first time I run my script) and that will bring a big performence problem.

If you need to process a *lot* of articles, use an XML dump. Your database will not be up to the minute anyway. If you need to track live updates, consider using the Atom feed for the RC page - it's possible to extract the diff from that, but it's a bit messy. I have code for that somewhere, though.

...

*Can you give my a PHP Interface to the wikiproxy? Something like a include file "WikiProxy.inc" with a function to return the article as string? That would be great! 'Cause I doesn't need any slow http, tcp or whatever connection.

Accessing the cache directly would require you to have read- and write access to the cache tables - this is messy administration-wise. As I said, I also thought about bypassing the HTTP interface for the proxy... I started to write a daemon mode for the proxy, so it can be contacted using fifos or plain TCP - it works more or less, but there's no client interface for this yet. I could start to write one, but don't hold your breath... in any case, I'm not sure how much faster that would actually be.

Regards, Daniel

-- Homepage: http://brightbyte.de

Christian Thiele

9:03 a.m.

Hello,

Daniel Kinzler daniel@brightbyte.de schrieb am Thu, 23 Mar 2006 13:56:44 +0100:

...

http://tools.wikimedia.de/~daniel/foo/WikiProxy.php?wiki=fr&title=Berlin

...

WikiProxy works for all wikis. If there is no cache table, it will simply pass the text through.

of course it's okay to use the contents of the wikipedia databases for your tools, but it's not wanted, that every article could be fully accessed from outside. The wiki source is a border case, but this isn't good, too. The problem is not technical, the problem is, that Wikimedia Deutschland would be a "Diensteanbieter" (service provider) of all contents of the wikipedia... I think some german people only wait for sue Wikimedia Deutschland for some content ;).

Sincerelly Christian Thiele

DaB

9:13 a.m.

Am Donnerstag, den 23.03.2006, 15:03 +0100 schrieb Christian Thiele:

...

Hello, of course it's okay to use the contents of the wikipedia databases for your tools, but it's not wanted, that every article could be fully accessed from outside. The wiki source is a border case, but this isn't good, too. The problem is not technical, the problem is, that Wikimedia Deutschland would be a "Diensteanbieter" (service provider) of all contents of the wikipedia... I think some german people only wait for sue Wikimedia Deutschland for some content ;).

That's right. I would prefer a simple password (1 for all toolserverdevs), which must be insert in the GET-String, to get the text. So normal outside-user will be out, and toolserverdevs will be in easyly.

...

Sincerelly Christian Thiele

Sincertly DaB.

Christian Thiele

9:21 a.m.

Hi,

DaB wp@daniel.baur4.info schrieb am Thu, 23 Mar 2006 15:13:48 +0100:

...

That's right. I would prefer a simple password (1 for all toolserverdevs), which must be insert in the GET-String, to get the text. So normal outside-user will be out, and toolserverdevs will be in easyly.

the best thing is, if everyone uses the script itself... I don't know in which language it is written, but access should be possible from every other language, maybe it could be published as a php class to access the database... Remember: please publish all your code under a free licence and make it accessible for everyone ;).

Sincerely Christian Thiele

Daniel Kinzler

9:32 a.m.

Hi all

Ok, I'll lock down access to the wikitext - might take a few days, though. I plan to do this as follows:

* have a whitelist of trusted IPs (i.e. 127.0.0.1, etc) * have a list of tokens (passwords) for access from other IPs.

Alternatively, this could be done with .htaccess. But I think I'll put it into the PHP code.

...

the best thing is, if everyone uses the script itself... I don't know in which language it is written, but access should be possible from every other language, maybe it could be published as a php class to access the database...

The problem with this is db access rights: anyone using the cache db directly needs write access to the table. This is a mess to maintain. While it's the nicest way from the developers pov, it's not so good from an admin pov.

The easiest solution is to have some type of daemon that can be accessed via a simple interface. HTTP is the simplest to implement, but as I explained in my previous posts, i have also experimented with other methods.

...

Remember: please publish all your code under a free licence and make it accessible for everyone ;).

A pretty old version of my stuff is available at http://tools.wikimedia.de/~daniel/downloads/, but the cache stuff is not in there yet. I have to tweak my bundling scheme a bit more, it's too complicated...

-- Daniel

-- Homepage: http://brightbyte.de

Stefan F. Keller

12:26 p.m.

Dear all,

On March 23, 2006 3:33 PM Daniel Kinzler wrote:

...

Ok, I'll lock down access to the wikitext - might take a few days, though. I plan to do this as follows:

And on 3/23/06, Christian Thiele APPER@apper.de wrote:

...

the best thing is, if everyone uses the script itself... I don't know in

Wait a moment: We invested quite some effort in order to extract useful information (georeferenced articles) out of dewiki in order to serve other services. We have been adviced and are convinced until now that the toolserver is the best platform to do this.

Then we got problems with a large amount of unreadable (compressed) articles... And now you close down access even for dewiki toolserver users because of non-technical reasons? Please tell me, that I'm wrong!

-- Stefan

Gregory Maxwell

4:08 p.m.

On 3/23/06, Stefan F. Keller sfkeller@hsr.ch wrote:

...

Then we got problems with a large amount of unreadable (compressed) articles... And now you close down access even for dewiki toolserver users because of non-technical reasons? Please tell me, that I'm wrong!

The compression really isn't a problem, if the text access is ever restored I will help you write code to read the compressed data.. it is easy. Yes, you can't access it directly in the database, but you'd need to parse the content anyways.

For a geospatial projects on toolserver I'd expect that you'd run into bigger problems with things like the lack of useful indexing for spatial objects in Innodb tables.

(I had a postgresql install with the tiger data in my home directory, but I removed it after the loss of text access killed all my spatial wiki plans. :( )

FlaBot

4:37 p.m.

the gzip inside the mysql is a problem .. you cant regexp inside the gzip-mysql-fields.

If the mysql get the data from the master/slaves from the wiki , why must the data stored in the same content ? why cant the last version be uncompressed ?

Is the problem cpu ? disk-space ? mysql can do this ? no one mod the server to behave in that way ?

Wasent the idee of the server to give developers on the server acces to a live uncompressed version on the live- wiki ?

I am a gynaecologist not a mysql/php/what-ever-guru .. but perhaps my question can help to find answers to problems.

But the first step of solving a problem is allready been done .. we talk together ,, me exchange informations ..

greeting from germany

Flacus

Gregory Maxwell

5:07 p.m.

On 3/23/06, FlaBot flabot@googlemail.com wrote:

...

the gzip inside the mysql is a problem .. you cant regexp inside the gzip-mysql-fields.

The regexp isn't going to use an index anyways, so the cost of doing this in your application is fairly low (just the extra database round trip). This stuff isn't magic. If you regexp against article text the DB is going to be forced to read every eligible article... in fact it might even be stupid enough to apply the regexp before other more useful constraints.

Yes there is a little extra cost to send the data to the application for filtering (and potentially aggregation), but it's really not major.

Yes, it's an inconvenience... but from what I can tell most of the toolserver users implement most of their logic in their applications. (Typical mysql practice I guess)...

...

If the mysql get the data from the master/slaves from the wiki , why must the data stored in the same content ? why cant the last version be uncompressed ?

Because we're using mysql replication. Were we not using mysql replication you'd hear me whining to replace mysql 5 with another database system that doesn't completely suck for adhoc queries, like PGSQL.

I believe that mysql now supports user defined functions, so it wouldn't be too hard to create a function so you could do something like:

select id from table where php_decompress(text) REGEXP 'whatever';

...

Is the problem cpu ? disk-space ? mysql can do this ? no one mod the server to behave in that way ?

Wasent the idee of the server to give developers on the server acces to a live uncompressed version on the live- wiki ?

In all honesty, If you're not able to handle decompressing the content, I have serious questions about your ability to do something useful with the resource ... No insult intended. It's just really not that hard.

...

I am a gynaecologist not a mysql/php/what-ever-guru .. but perhaps my question can help to find answers to problems.

But the first step of solving a problem is allready been done .. we talk together ,, me exchange informations ..

There is only so much that can be done without getting down and dirty with the technical bits and bytes. At some point in the future someone may create a system to help less technical users create the sort of reports and tools that can be created on toolserver, but we do not have that today.

It is not an easy problem .... On our larger Wikis like de and en our database is big enough that if you don't understand things like the computational order of your query, and the limitations of index use in mysql (only one index per table is used to constrain the rows recalled), you will often just build queries which never complete in a useful amount of time.

Stefan F. Keller

6:20 p.m.

On March 23, 2006 10:09 PM Gregory Maxwell wrote:

...

The compression really isn't a problem, if the text access is ever restored I will help you write code to read the compressed data... it is easy.

That sounds good! Thank you for helping out (soon?).

...

Yes, you can't access it directly in the database, but you'd need to parse the content anyways.

If we would have the means to use stored procedures, then we would be happy get the access to it.

Please understand our situation: Leo - and me as a humble sponsoring mentor - are taking a step by step approach in order to extract georeferenced content. We don't need real-time access (and no history); but we really need to parse *all* available articles near real-time at least *once* (Step 1; takes about 6 hours). After that, we need only those articles which have changed say in a nightly run (Step 2; takes couple of minutes).

We started experimenting on local dumps then moved to toolserver because dumps are really outdated (currently dewiki is still 2005-Dec-11!). If dumps would be available, say, at a daily or weekly basis, that would be a choice.

...

For a geospatial project on toolserver I'd expect that you'd run into bigger problems with things like the lack of useful indexing for spatial objects in Innodb tables.

Right: If we never get access to all articles in the near future, we'll never run into problems postprocessing it ;-> (just joking...) The data maintained in Step 2 does'nt need to be spatially indexed yet (it's enough e.g. for KML-export) as long as no geospatial queries are needed.

...

(I had a postgresql install with the tiger data in my home directory, but I removed it after the loss of text access killed all my spatial wiki plans. :( )

Geospatial data types and related queries are supported in MySQL with SPATIAL INDEX TYPE MyISAM or else we take our own Postgresql-DB types and PostGIS.

--Stefan

Christopher Beland

8:57 p.m.

I think you may be looking in the wrong place; the dumps moved sometime in the past few months. dewiki dumps from March are linked from: http://download.wikimedia.org/dewiki/20060320/

You can check http://download.wikimedia.org for periodic updates.

-B.

On Fri, 2006-03-24 at 00:20 +0100, Stefan F. Keller wrote:

...

We started experimenting on local dumps then moved to toolserver because dumps are really outdated (currently dewiki is still 2005-Dec-11!). If dumps would be available, say, at a daily or weekly basis, that would be a choice.

Stefan F. Keller

24 Mar 24 Mar

1:55 a.m.

On March 24, 2006 2:57 AM Christopher Beland wrote:

...

I think you may be looking in the wrong place; the dumps moved sometime in the past few months. dewiki dumps from March are linked from: http://download.wikimedia.org/dewiki/20060320/

You can check http://download.wikimedia.org for periodic updates.

Thank you for the kind hint. I'm new here but I assume you all are aware what this means to us when we have to read in the dump instead of having a programmatic access to dewiki? It means installing a rather non-repeatable process (unstable pathnames, mediawiki versions etc.) which has a processing time of more than 30 hours once it gets running...

On March 23, 2006 10:09 PM Gregory Maxwell wrote:

...

The compression really isn't a problem, if the text access is ever restored I will help you write code to read the compressed data... it is easy.

Who is responsible for restoring text access? When is it estimated to achieve this?

-- Stefan

Gregory Maxwell

25 Mar 25 Mar

5:33 p.m.

On 3/24/06, Stefan F. Keller sfkeller@hsr.ch wrote:

...

On March 24, 2006 2:57 AM Christopher Beland wrote:

...
I think you may be looking in the wrong place; the dumps moved sometime in the past few months. dewiki dumps from March are linked from: http://download.wikimedia.org/dewiki/20060320/

You can check http://download.wikimedia.org for periodic updates.

Thank you for the kind hint. I'm new here but I assume you all are aware what this means to us when we have to read in the dump instead of having a programmatic access to dewiki? It means installing a rather non-repeatable process (unstable pathnames, mediawiki versions etc.) which has a processing time of more than 30 hours once it gets running...

You write software which scans the dump directly. Don't import it into mediawiki. If you're looking for particular data, this works fairly well.

...

On March 23, 2006 10:09 PM Gregory Maxwell wrote:

...
The compression really isn't a problem, if the text access is ever restored I will help you write code to read the compressed data... it is easy.

Who is responsible for restoring text access? When is it estimated to achieve this?

Your guess is as good as mine: I asked (posted to both this list and the developers list) a month ago and didn't even receive a reply.

If someone is in charge here at all, then they are asleep at the wheel.

I've yet to even here an update on why it's gone, the most recent excuse I've seen thrown out is that mysql can't replicate from multiple masters... but that should be a non issue: we could run multiple instance on mysql on separate ports.

Rob Church

5:58 p.m.

Clearly, given I was under the impression Kate was in charge, someone needs to step forward and state who *is*.

As has been pointed out several times before, part of the text access problem is that Wikimedia now stores text in external storage clusters, a feature of MediaWiki written for that purpose. And Zedler can't get at them.

A sensible, practical suggestion is needed, and soon. And administrators to answer the queries would be helpful; after all, we could all have it pegged down wrong.

Rob Church

On 25/03/06, Gregory Maxwell gmaxwell@gmail.com wrote:

...

On 3/24/06, Stefan F. Keller sfkeller@hsr.ch wrote:

...
On March 24, 2006 2:57 AM Christopher Beland wrote:

...
I think you may be looking in the wrong place; the dumps moved sometime in the past few months. dewiki dumps from March are linked from: http://download.wikimedia.org/dewiki/20060320/

You can check http://download.wikimedia.org for periodic updates.

Thank you for the kind hint. I'm new here but I assume you all are aware what this means to us when we have to read in the dump instead of having a programmatic access to dewiki? It means installing a rather non-repeatable process (unstable pathnames, mediawiki versions etc.) which has a processing time of more than 30 hours once it gets running...

You write software which scans the dump directly. Don't import it into mediawiki. If you're looking for particular data, this works fairly well.

...
On March 23, 2006 10:09 PM Gregory Maxwell wrote:

...
The compression really isn't a problem, if the text access is ever restored I will help you write code to read the compressed data... it is easy.

Who is responsible for restoring text access? When is it estimated to achieve this?

Your guess is as good as mine: I asked (posted to both this list and the developers list) a month ago and didn't even receive a reply.

If someone is in charge here at all, then they are asleep at the wheel.

I've yet to even here an update on why it's gone, the most recent excuse I've seen thrown out is that mysql can't replicate from multiple masters... but that should be a non issue: we could run multiple instance on mysql on separate ports. _______________________________________________ Toolserver-l mailing list Toolserver-l@Wikipedia.org http://mail.wikipedia.org/mailman/listinfo/toolserver-l

Gregory Maxwell

7:39 p.m.

On 3/25/06, Rob Church robchur@gmail.com wrote:

...

Clearly, given I was under the impression Kate was in charge, someone needs to step forward and state who *is*.

As has been pointed out several times before, part of the text access problem is that Wikimedia now stores text in external storage clusters, a feature of MediaWiki written for that purpose. And Zedler can't get at them.

A sensible, practical suggestion is needed, and soon. And administrators to answer the queries would be helpful; after all, we could all have it pegged down wrong.

At first we lost it because of a disk failure. I'd gone dark for a few weeks and when I came back the cause had been changed to moving the text to external DBs.

I've asked, but never received answers, as to why we can't either replicate these externals stores.

I'm pretty much to the point where I've abandoned my toolserver account, and when it expires April first I won't be asking for it to be extended ... I don't even know who I'd ask now.

Because of the unresponsiveness of these matters I have, in the past inquired about receiving an OAI feed... an advantage of using a live updated feed rather than replication is that I could run an analysis database on PostgreSQL or Oracle, both of which are far more suited to the use than MySQL. I wouldn't have any problem making a system with more resources than toolserveral available for public use... But my request there has also been ignored.

I'm more than a little tired of wasting my time trying to contribute to a project where getting even the most basic assistance from the powers that be is effectively impossible. I've never asked anyone to do anything, except the most minimal set of tasks that I lack the authority to perform myself. ...Sigh...

interiot＠68k.org

26 Mar 26 Mar

2:05 a.m.

New subject: zh-yue.wikipedia.org etc

zh-yue.wikipedia.org was created recently... is it available for replication? If so, could its data start to be replicated, and it be added to the toolserver.wiki table, etc? If it's not available for replication, could we get a quick response saying so?

Thanks, -Interiot

Stefan F. Keller

27 Mar 27 Mar

7:32 p.m.

Gregory,

I'm new on toolserver and on this list. Just wanted to drop a sign of live...

On Sunday, March 26, 2006 1:39 AM Gregory Maxwell wrote:

...

At first we lost it because of a disk failure. I'd gone dark for a few weeks and when I came back the cause had been changed to moving the text to external DBs.

I've asked, but never received answers, as to why we can't either replicate these externals stores.

I'm pretty much to the point where I've abandoned my toolserver account, and when it expires April first I won't be asking for it to be extended ... I don't even know who I'd ask now.

I've had some email contact to all german responders of this thread (i.e. Daniel, Jakob, Christia). They told me, admins are there but quite busy, especially superadmin Kate.

And Daniel Baur (DaB) and Daniel Kinzler (Duesentrieb) are working on a solution (WikiProxy?). They have been really helpful but they didn't tell me when it's expected to be completed earliest :-<

...

[...] I wouldn't have any problem making a system with more resources than toolserver available for public use... But my request there has also been ignored.

That's my dilemma too.

...

I'm more than a little tired of wasting my time trying to contribute to a project where getting even the most basic assistance from the powers that be is effectively impossible. I've never asked anyone to do anything, except the most minimal set of tasks that I lack the authority to perform myself. ...Sigh...

Why nobody - especially the superadmin - is able to drop a line here is contrasting to the sponsoring money de.Wikipedia obviously gets. It's seems to be part of the problem of claim and reality of Wikipedia.

I'm giving _one_ weeks time to the colleagues here to solve the technical problem of mirroring. To my humble point of view this is still finally an organisational problem...

-- Stefan

Daniel Kinzler

28 Mar 28 Mar

8:19 a.m.

Hello Stefan

...

And Daniel Baur (DaB) and Daniel Kinzler (Duesentrieb) are working on a solution (WikiProxy?). They have been really helpful but they didn't tell me when it's expected to be completed earliest :-<

There are two problems: Replication of Article Text is broken, because it now uses "external storage". WikiProxy exists for quite some time now, and resolves this problem, though not very efficiently. It's good enough for fetching a couple of hundred pages, but should not be used to process *everything* - for now, we are stuck with XML dumps for that.

The second problem is that *all* replication is broken for Databases at the Seoul cluster (esp. Japanese and Chinese wikis). DaB is working on a solution, which he told me will go into a first round of testing in the next few days.

...

...
[...] I wouldn't have any problem making a system with more resources than toolserver available for public use... But my request there has also been ignored.

That's my dilemma too.

Uh, let me get that straight... you (GMaxwell) could donate a box? That would be great! Please talk to DaB about this, he's our contact to the e.V. My idea would be to have separate boxes for a) public, web-based tools, and b) another one for running massive queries on.

...

Why nobody - especially the superadmin - is able to drop a line here is contrasting to the sponsoring money de.Wikipedia obviously gets. It's seems to be part of the problem of claim and reality of Wikipedia.

I agree that this is frustrating, and I too have the impression that no one really feels responsible for resolving the problems that exist. I hope we can improve communication in the future.

...

I'm giving _one_ weeks time to the colleagues here to solve the technical problem of mirroring. To my humble point of view this is still finally an organisational problem...

Uh, what?! I'm sympathetic to your complains in general, but WTF?! You are *giving* us one week? Or what? You know, there's no *right* to be able to use the toolserver, or to have it available at all. If you don't like it, help to fix it, or go away.

Btw: both major issues I described above are based on technical problems. Yes, they can be overcome, but it's not simple. It takes time and effort, which someone will have to donate. How about you?

Regards, Daniel

-- Homepage: http://brightbyte.de

Gregory Maxwell

10:35 a.m.

On 3/28/06, Daniel Kinzler daniel@brightbyte.de wrote:

...

...
...
[...] I wouldn't have any problem making a system with more resources than toolserver available for public use... But my request there has also been ignored.

That's my dilemma too.

Uh, let me get that straight... you (GMaxwell) could donate a box? That would be great! Please talk to DaB about this, he's our contact to the e.V. My idea would be to have separate boxes for a) public, web-based tools, and b) another one for running massive queries on.

It wouldn't at all be a problem but I don't want to just end up with more of what we currently have... where simple requests just go unanswered and usability falls through the floor.

I like the idea of two toolserver systems... perhaps we should call them toolserver lab and a toolserver production. Toolserver lab for massive queries and more development stuff while the web interfaces (which tend to be replag sensitive) run on the front facing box.

This is worth more discussion I think.

Stephanie S.

10:38 a.m.

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1

Gregory Maxwell wrote:

...

On 3/28/06, Daniel Kinzler daniel@brightbyte.de wrote:

...
...
...
[...] I wouldn't have any problem making a system with more resources than toolserver available for public use... But my request there has also been ignored.

That's my dilemma too.

Uh, let me get that straight... you (GMaxwell) could donate a box? That would be great! Please talk to DaB about this, he's our contact to the e.V. My idea would be to have separate boxes for a) public, web-based tools, and b) another one for running massive queries on.

It wouldn't at all be a problem but I don't want to just end up with more of what we currently have... where simple requests just go unanswered and usability falls through the floor.

I like the idea of two toolserver systems... perhaps we should call them toolserver lab and a toolserver production. Toolserver lab for massive queries and more development stuff while the web interfaces (which tend to be replag sensitive) run on the front facing box.

This is worth more discussion I think. _______________________________________________ Toolserver-l mailing list Toolserver-l@Wikipedia.org http://mail.wikipedia.org/mailman/listinfo/toolserver-l

sounds like a great idea to me.

- -- Stephanie S. Encrypted Email Preferred OpenPGP key ID: C9774A04 http://tinyurl.com/8wxzb

...PGP SIGNATURE...

-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.1 (GNU/Linux) Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org iD8DBQFEKVhcqqXvz8l3SgQRAuH7AKCAS1YzXsFBTGFBJc19Op9f7JGPJgCdGkBC LtNIfDM4jHvemaR03uUiP0o= =xQGI -----END PGP SIGNATURE-----

Stefan F. Keller

3:36 p.m.

On March 28, 2006 3:20 PM Daniel Kinzler wrote: [... ]> would be great! Please talk to DaB about this, he's our contact to the

...

e.V. My idea would be to have separate boxes for a) public, web-based tools, and b) another one for running massive queries on.

Did you say web-based tools like kvalebergs (still external) service? And massive queries like our coordinate parser? That sounds interesting!

[...]

...

...
I'm giving _one_ weeks time to the colleagues here to solve the technical problem of mirroring. To my humble point of view this is still finally an organisational problem...

Uh, what?! I'm sympathetic to your complains in general, but WTF?! You are *giving* us one week? Or what? You know, there's no *right* to be able to use the toolserver, or to have it available at all. If you don't like it, help to fix it, or go away.

Ok; that answer came fast... Sorry, I should have written "I'm giving _me_ one weeks time..." simply because the team project has to come to an end (unless e.V. itself does sponsor programmer teams :->).

...

Btw: both major issues I described above are based on technical problems. Yes, they can be overcome, but it's not simple. It takes time and effort, which someone will have to donate. How about you?

Believe me, it's several boxes worth what we invested in cash in this Wikipoint-db - and don't mention time. Our know-how lies in geoinformation processing, so I thought it's there where our contribution would be most efficient.

I'm aware that this is a technical problem - you actually made me curious about replication - but to me it's still seems to be also an organisational issue because there are'nt more admins to help you two out. If Wikipedia want's to become mature and if I would be e.V. I would put cracks like you on the pay roll.

-- Stefan

Gregory Maxwell

4:04 p.m.

On 3/28/06, Daniel Kinzler daniel@brightbyte.de wrote:

...

Uh, what?! I'm sympathetic to your complains in general, but WTF?! You are *giving* us one week? Or what? You know, there's no *right* to be able to use the toolserver, or to have it available at all. If you don't like it, help to fix it, or go away. Btw: both major issues I described above are based on technical problems. Yes, they can be overcome, but it's not simple. It takes time and effort, which someone will have to donate. How about you?

I didn't respond to this bit at first, but I've reconsidered.

The above is a load of bullshit.

What 'technical issue' prevented anyone from responding *at all* to my question to the list about text access on March 6th (http://mail.wikipedia.org/mailman/private/toolserver-l/2006-March/000143.htm...) almost a month ago? Or the numerous times I've inquired on IRC before then?

Yes, there are things that need to be done, but we have competent and interested folks offering to help and they are ignored. So don't call this a technical issue.

You are correct when you say "There's no *right* to be able to use the toolserver", but you must understand why some of us may be a little irate when we've invested so much time working on things only to have our work effectively sabotaged because no one cares enough to even reply to questions on status or offers to help.

Had the wikimedia developers not ignored my request for an OAI link some months ago, I would already be offering an alternative service *myself*.

So don't call it a lack of help or a technical issue, because it simply is not. This is purely an organizational problem: We are failing to empower the people who are able and willing to do the work.

Daniel Kinzler

4:33 p.m.

Hello Gregory

Just to avoid any misunderstanding: I agree with pretty much all you said. There *are* massive organizational problems, and not getting a response to a request is very frustration - it has happened to me too. My comment you responded to was aimed at Stefan's "ultimatum" - which was a misunderstanding, as it turns out.

On the other hand, there are in fact technical issues. For instance, as far as I know, the OAI interface would have to be extended to be usable for full text replication (not sure about the details, but I vaguely remember Kate telling me about it). To Mediawiki developers, the primary goal is to support Wikipedia, not the toolserver. It's often annoying, but that is their priority. "Official" requests from the e.V. might help, though.

In any case: we should try to resolve the organizational problems, and work on ways to get around the technical issues. I know, it's frustrating to be stalled because "da man" is currently busy doing something else, but that's how it is with volunteer projects. Flying into a rage will not help. Taking load off the people we want to do things might.

Regards, Daniel

-- Homepage: http://brightbyte.de

Gregory Maxwell

5:37 p.m.

On 3/28/06, Daniel Kinzler daniel@brightbyte.de wrote:

...

Hello Gregory

Just to avoid any misunderstanding: I agree with pretty much all you said. There *are* massive organizational problems, and not getting a response to a request is very frustration - it has happened to me too. My comment you responded to was aimed at Stefan's "ultimatum" - which was a misunderstanding, as it turns out.

Ah, understood. I thought you were making more of a general complaint.

...

On the other hand, there are in fact technical issues. For instance, as far as I know, the OAI interface would have to be extended to be usable for full text replication (not sure about the details, but I vaguely remember Kate telling me about it). To Mediawiki developers, the primary goal is to support Wikipedia, not the toolserver. It's often annoying, but that is their priority. "Official" requests from the e.V. might help, though.

OAI is expressly for text replication, but it doesn't get old revisions.

It would be easy enough to start a database with an old dump, use OAI to feed in new changes, and then cook up a little glue to sync up the missed revisions.

Since we're using mysql replication there is no need for OAI, although I'd inquired about it in the past because I've had an interest in running a live analysis database on something other than MySQL since mysql has fairly poor performance for such uses. (Although mysql 5 did improve things a fair bit.)

...

In any case: we should try to resolve the organizational problems, and work on ways to get around the technical issues. I know, it's frustrating to be stalled because "da man" is currently busy doing something else, but that's how it is with volunteer projects. Flying into a rage will not help. Taking load off the people we want to do things might.

Sounds good to me, though I hope no one has actually been in a rage over it... :) I know that I'm personally at the point where I won't be wasting any more time until I feel confident that my work will not be lost when toolserver has a fault and offers to fix it are ignored.

Gregory Maxwell

23 Mar 23 Mar

9:36 a.m.

On 3/23/06, Christian Thiele APPER@apper.de wrote:

...

the best thing is, if everyone uses the script itself... I don't know in which language it is written, but access should be possible from every other language, maybe it could be published as a php class to access the database... Remember: please publish all your code under a free licence and make it accessible for everyone ;).

A PHP class would be a long way away from useful to everyone. :)

This is easy enough to code anyways, but please don't assume all toolserver users are willing to use PHP.

(If we ever get real text access back, I've implimented python code to deal with the quasi-propritary php compression used for article text... same dance would apply for all other non php languages)

FlaBot

10:23 a.m.

Hi Daniel !

Great. I will stop running the code i posted yesterday and will use file_get_contents (yourtool.php) for the IWLC.

You wrote about that you cache every rev. I only need the LAST version so who should i mod the url ? is WikiProxy.php?wiki=de&title=Haushttp://tools.wikimedia.de/%7Edaniel/foo/WikiProxy.php?wiki=de&title=Haus the right style to get a (perhaps) cache last-version ?

Dirk

Daniel Kinzler

12:54 p.m.

Hi Dirk, Hi Stefan

FlaBot wrote:

...

is WikiProxy.php?wiki=de&title=Haushttp://tools.wikimedia.de/%7Edaniel/foo/WikiProxy.php?wiki=de&title=Haus the right style to get a (perhaps) cache last-version ?

Yes. If yo do not request a specific revision, you will get the latest (according to the toolserver db). If the toolserver is lagged and you request the latest version, you may not see the very newest update, but what you get will be consistent with the toolserver db.

Stefan wrote:

...

Wait a moment: We invested quite some effort in order to extract useful information (georeferenced articles) out of dewiki in order to serve other services. We have been adviced and are convinced until now that the toolserver is the best platform to do this.

It is. But there are technical difficulties with providing full text in real time. Currently, the only *efficient* way to look at a lit of articles is to use a dump.

...

Then we got problems with a large amount of unreadable (compressed) articles...

Compression is not a big problem, external storage is. Most of the new versions are not available on the toolserver at all.

...

And now you close down access even for dewiki toolserver users because of non-technical reasons? Please tell me, that I'm wrong!

You are wrong.

The idea is to not provide wiki *content* to the *public* from the toolserver (which WikiProxy currently does). The toolserver is run by the German e.V., and the concern is that they might become liable for the content if its served from the toolserver. I don't really believe that this could be a problem (there's a special clause for caches and proxies in German law), but they run the thing, and I can comply to their wishes without much pain.

When I "lock down" my WikiProxy thing, nothing will change for any script running on the toolserver, whether it uses WikiProxy or not; Only if you want to use WikiProxy from the outside, you would need to get a token.

-- Daniel

-- Homepage: http://brightbyte.de

Stefan F. Keller

25 Mar 25 Mar

6:35 a.m.

On Thu Mar 23 11:41:49 UTC 2006 Rob Church wrote:

...

...
I ask this because I will read all articles in german wikipedia (at least at the first time I run my script) and that will bring a big performance problem.

Correct, it will. Please don't do that.

On March 23, 2006 6:55 PM Daniel wrote:

...

...
And now you close down access even for dewiki toolserver users because of non-technical reasons? Please tell me, that I'm wrong!

You are wrong.

The idea is to not provide wiki *content* to the *public* from the toolserver (which WikiProxy currently does). The toolserver is run by

[...]

Got it.

...

When I "lock down" my WikiProxy thing, nothing will change for any script running on the toolserver, whether it uses WikiProxy or not; Only if you want to use WikiProxy from the outside, you would need to get a token.

a) Are we now allowed - from a toolserver account - to iterate over german and english articles (first all, then only the new ones) - or not?

b) Is there a technical solution (in PHP, WikiProxy?) to solve our problem trying to access all pages - even those residing in external storage?

c) In order to mirror those pages on toolserver can perhaps Kate or Brion come to rescue?

-- Stefan

Daniel Kinzler

6:54 a.m.

Hi all

...

a) Are we now allowed - from a toolserver account - to iterate over german and english articles (first all, then only the new ones) - or not?

In theory yes, in practice no. As this would currently mean to pull every single article via HTTP, this is discouraged, because it creates a lot of load. Doing it trough WikiProxy would mean that each revision is only loaded once, which makes this a bit better. But it's still slow (more than 1 sec per article).

Currently, the best way to bulk-process article text is to read from an XML dump. You can adopt the exiting importers to fit your purpose, code is available in PHP, Java and C#, I believe.

...

b) Is there a technical solution (in PHP, WikiProxy?) to solve our problem trying to access all pages - even those residing in external storage?

WikiProxy solves the problem of accessing external storage, for any page you want. It does not solve it very efficiently, so it should not be use to access *all* pages in a run.

...

c) In order to mirror those pages on toolserver can perhaps Kate or Brion come to rescue?

Again, in theory, yes. In practice, both are quite busy, maybe we should try asking someone else (like, I don't know... JeLuF, perhaps?). I imagine this would involve setting up a second mysql server instance, and replication for that. There are probably some other tricky things to take care of. Perhaps we should officially request technical help with this from the e.V. I have already talked to elian about it.

On a slightly related note: we still do not get updates for any data on the Asian cluster (the databases we have are stuck in October). Apparently, it would be possible to resolve this, but it's tricky. The *real* solution would be to to have multi-master replication, which (i am told) is expected to be supported by MySQL 5.2.

Regards, -- Daniel

-- Homepage: http://brightbyte.de

Jakob Voss

7:19 a.m.

Hi Daniel, thanks for your answer - you wrote:

...

...
a) Are we now allowed - from a toolserver account - to iterate over german and english articles (first all, then only the new ones) - or not?

In theory yes, in practice no. As this would currently mean to pull every single article via HTTP, this is discouraged, because it creates a lot of load. Doing it trough WikiProxy would mean that each revision is only loaded once, which makes this a bit better. But it's still slow (more than 1 sec per article).

Where can I find the token? I'd like to process some 10 to 1000 articles based on templates that are used in it.

...

Currently, the best way to bulk-process article text is to read from an XML dump. You can adopt the exiting importers to fit your purpose, code is available in PHP, Java and C#, I believe.

Well, I think this means that Stefan's team has to recode a lot. Pulling the titles and texts out of the XML dump is easy but you only get a new dump every 1 or 2 month. On the other hand XML is more robust while the database structure will change with every MediaWiki version - for instance I was not aware of the external text before.

...

...
b) Is there a technical solution (in PHP, WikiProxy?) to solve our problem trying to access all pages - even those residing in external storage?

WikiProxy solves the problem of accessing external storage, for any page you want. It does not solve it very efficiently, so it should not be use to access *all* pages in a run.

...
c) In order to mirror those pages on toolserver can perhaps Kate or Brion come to rescue?

Again, in theory, yes. In practice, both are quite busy, maybe we should try asking someone else (like, I don't know... JeLuF, perhaps?). I imagine this would involve setting up a second mysql server instance, and replication for that. There are probably some other tricky things to take care of. Perhaps we should officially request technical help with this from the e.V. I have already talked to elian about it.

What do you mean? The e.V. can support with money and fame but it's pretty unexperienced in setting up mysql servers ;-)

...

On a slightly related note: we still do not get updates for any data on the Asian cluster (the databases we have are stuck in October). Apparently, it would be possible to resolve this, but it's tricky. The *real* solution would be to to have multi-master replication, which (i am told) is expected to be supported by MySQL 5.2.

Sounds like not definite solution before MySQL 5.2.

Greetings, Jakob

Daniel Kinzler

7:48 a.m.

Hello again

...

Where can I find the token? I'd like to process some 10 to 1000 articles based on templates that are used in it.

Right now, no token is necessary to use WikiProxy - the "lock" will become active when I update my tools next time. Then, you can get an access token by asking me :)

Note that you will *never* need a token to access WikiProxy locally from the toolserver. The IPs are whitelisted.

...

Well, I think this means that Stefan's team has to recode a lot. Pulling the titles and texts out of the XML dump is easy but you only get a new dump every 1 or 2 month. On the other hand XML is more robust while the database structure will change with every MediaWiki version - for instance I was not aware of the external text before.

For the analysis of large volumes of texts doing it "live" isn't really an option anyway, I think. And being able to handle XML dumps is a good idea anyway :)

None the less, having the full text available on the tool server *would* be nice. Btw: if that ever happens, I will switch WikiProxy do use the data on the toolserver, so you can keep using it.

...

What do you mean? The e.V. can support with money and fame but it's pretty unexperienced in setting up mysql servers ;-)

But perhaps they can help coordinating this. All we can do is whine on this mailing list... which has led to nothing so far.

...

...
On a slightly related note: we still do not get updates for any data on the Asian cluster (the databases we have are stuck in October).

Sounds like not definite solution before MySQL 5.2.

DaB is trying to get an external replication agent working. I hope this gets finished soon. Maybe someone familiar with the internals of MySQL could give him a hand?

-- Daniel

-- Homepage: http://brightbyte.de

Stefan F. Keller

8:37 a.m.

On Saturday, March 25, 2006 1:48 PM Daniel wrote:

...

...
What do you mean? The e.V. can support with money and fame but it's pretty unexperienced in setting up mysql servers ;-)

[...]

...

...
Sounds like not definite solution before MySQL 5.2.

To me this sounds like it's time to migrate to PGCluster... :-o

-- Stefan

Stefan F. Keller

8:38 a.m.

On Saturday, March 25, 2006 1:48 PM Daniel wrote:

...

Right now, no token is necessary to use WikiProxy - the "lock" will become active when I update my tools next time. Then, you can get an access token by asking me :)

Note that you will *never* need a token to access WikiProxy locally from the toolserver. The IPs are whitelisted.

Thank you and Jakob!

...

...
Well, I think this means that Stefan's team has to recode a lot. Pulling the titles and texts out of the XML dump is easy but you only get a new dump every 1 or 2 month. On the other hand XML is more robust while the

[...]

...

For the analysis of large volumes of texts doing it "live" isn't really an option anyway, I think. And being able to handle XML dumps is a good idea anyway :)

If you assume that one has to repeat this process repeatedly you are right. In our case we only need to run it _once_ (Ok, I admit: twice, because of some testing). After that we try to visit only those articles which have changed since, say, one or several days. We would even bare lowest priority while our process runs.

On the other hand: A dump could serve for this first 'full access', but only if it's a recent one... ('cause we'll try then to iterate only on the delta since the timestamp of the dump).

And: Pulling text out of the XML dump is not that easy really; needs lots of additional code (dumps from several tables and re-indexing, etc.) compared to online-access. And it's not repeatable up to now, as I'm aware, e.g. either the path/filename to the most recent dewiki dump needs to be constant or the online request of the XML dump should tell us its timestamp.

-- Stefan

Platonides

4:38 p.m.

...

...
Currently, the best way to bulk-process article text is to read from an XML dump. You can adopt the exiting importers to fit your purpose, code is available in PHP, Java and C#, I believe.

Well, I think this means that Stefan's team has to recode a lot. Pulling the titles and texts out of the XML dump is easy but you only get a new dump every 1 or 2 month. On the other hand XML is more robust while the database structure will change with every MediaWiki version - for instance I was not aware of the external text before.

XML dumps should be handled by the Wiki. Not only for the monthly dumps, but for the Special:Export, which also uses the same format. Queries done through it are supposed to be better for the server load as it only needs one query for getting many articles.

Well, you'd also need some kind of guessing about which articles will be queried after this to optimize it. Or you could get the asked article plus the next X pages on the DB that need http query.

Leo, you should also watch on that direction, as it is easier for the programmer to know the total amount of articles to be queried, not having to rely on the getting layer to guess the improvements.

Maybe you could have another parameter on the wikiproxy for the articles i want too, to make the wikiproxy aware of it? The most accurate way would be to have the layer acting asyncronously, so it would get a query and not really do it through http unless a) a parameter 'notwait' is set; b) the query queue is X long; c) it's Y seconds old (a wait timeout). Then it solves all the queries at the same time. However, it makes more difficult the client part, as client programs tend to use a ask-process-ask-process-loop

6856

Age (days ago)

6862

Last active (days ago)

toolserver-l@lists.wikimedia.org

46 comments

13 participants

tags (0)

participants (13)

Christian Thiele
Christopher Beland
DaB
Daniel Kinzler
FlaBot
Gregory Maxwell
interiot＠68k.org
Jakob Voss
Leo Büttiker
Platonides
Rob Church
Stefan F. Keller
Stephanie S.