Golem issues

List overview All Threads
Download

newer

older

s3/s4/s6 maintenance, Tuesday...

cassia (s3/s4/s6) maintenance, now

Alex Rave

31 Mar 2010 31 Mar '10

11:19 p.m.

Hello, all. I'm from ru-wiki, i'm one of active members in Connectivity project. I was very concerned when I learned that Golem can't work anymore because of new limitation on Toolserver. Golem's data is a key part of Connectivity project, which works about improving of Wikipedia quality. Connectivity project works mostly in russian and ukrainian editions of WP, but Golem also collects very useful information for every other language, except english, which is too huge yet to analyse. Project's code is improving continuously, for example two years ago, when ruwiki has about 250k of articles and project's tools were few in number, analyse of ruwiki took about 2 hours, and now, when ruwiki has 500k of articles and number of connectivity tools is increased several times, it's required about 1 hour 40 minutes for analyse. Improvement may go faster: there is only one programmer now in project - Mashiah, and if anybody wants to help him and participate in code improving, he is free to join. Our project needs any help from programmers. We have noted that the number of isolated articles is directly related to the authors' awareness of the lack of referencing articles . At certain periods of time due to toolserver problems in February 2009 we were unable to obtain timely data. During these periods the number of isolated articles usually grows, and the growth gradually turns to decline once the Golem being started to work again. So, it means, that any idle period of Golem leads to a deterioration in the quality of articles. The code will be improved in any case, sooner or later, but we want to try all the ways for keeping Golem running during the optimization process. I want to ask if hardware upgrade can resolve this problem? And if it's possible, can you please estimate models and cost of required equipment? Please, help us to help Wikipedia.

Attachments:

attachment.htm (text/html — 2.0 KB)

Show replies by date

Daniel Schwen

31 Mar 31 Mar

11:36 p.m.

...

to obtain timely data. During these periods the number of isolated articles usually grows, and the growth gradually turns to decline once the Golem being started to work again. So, it means, that any idle period of Golem

Just out of curiosity. How exactly does Golem help here? It is quite a leap from detecting isolated article clusters to actually linking them. Can you please describe the process how Golem helps authors?

Анастасия Львова

11:41 p.m.

On Wed, Mar 31, 2010 at 7:36 PM, Daniel Schwen lists@schwen.de wrote:

...

...
to obtain timely data. During these periods the number of isolated articles usually grows, and the growth gradually turns to decline once the Golem being started to work again. So, it means, that any idle period of Golem

Just out of curiosity. How exactly does Golem help here? It is quite a leap from detecting isolated article clusters to actually linking them. Can you please describe the process how Golem helps authors?

Besides identifying problems, Golem also helps solving them. The site ([[tools:~lvova]]) allows to view the list of isolated articles in nearly every possible way - sorted by name, category, cluster type, author. The hints for each isolated article provide suggestions as to where the article may be referenced: the notion is probable to occur in a simple search result, or the link that might lead to this article at present erroneously points to an disambig. Also interwiki provides information on articles from other language editions that reference the same article in another language, and, finally, a list of articles for creation (or translation) is provided, that would allow the long-sought reference.

In order to easily inform isolated and dead-end article authors about the problem and tell them about a site that can help, we add respective templates to articles in Russian and Ukranian Wikipedia. The template "isolated article" forms a link to a site page containing suggestions on linking this very article.

-- regadrs, Lvova

Mike.lifeguard

1 Apr 1 Apr

1:06 a.m.

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1

On 10-03-31 11:36 AM, Daniel Schwen wrote:

...

It is quite a leap from detecting isolated article clusters to actually linking them.

It is also quite a leap to call linking these articles an important kind of quality. It is obviously useful, but the Wikipedias will not die without it - let's try not to be so hyperbolic. The sky is not falling.

- -Mike

...PGP SIGNATURE...

-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.9 (GNU/Linux) iEYEARECAAYFAkuzcw0ACgkQst0AR/DaKHtRfQCcC2aczsXF63qR3je054buZxWm pzQAnjtbjUZkjYaQ4NUhFMGk35iAUKz7 =RtSL -----END PGP SIGNATURE-----

Daniel Schwen

1:27 a.m.

On Wed, Mar 31, 2010 at 9:13 AM, Анастасия Львова stasielvova@gmail.com wrote:

...

A major difference between isolated articles and the rest of articles with quality notes is that fixing the isolated requires editing other articles...

That is not an issue at all! You just extract the data from the edited articles and use it to update your graph, which you can keep in apermanent non-memory table.

On Wed, Mar 31, 2010 at 10:06 AM, Mike.lifeguard mike.lifeguard@gmail.com wrote:

...

It is also quite a leap to call linking these articles an important kind of quality. It is obviously useful, but the Wikipedias will not die

Ok, I guess such posts by tooldevelopers here always have to be taken with a grain of salt. Of course in my world _my_ tools are the most important ones for Wikipedia too ;-).

Андрій Бондаренко

2:41 a.m.

DS> Ok, I guess such posts by tooldevelopers here always have to be taken DS> with a grain of salt. Of course in my world _my_ tools are the most DS> important ones for Wikipedia too ;-).

Hi, everebody! I'm sorry to be a newb in toolserver doings, but it seems there is some misanderstading of the appointment of [[Wikipedia:WikiProject Orphanage]] and Golem as a heart of this project.

The main appointment of Orphanage project is to pay attention of wikipedians to such a problem as isolation of the article he is working on. In this meaning it seems to be incorrect to regard Golem as somebody's programming issue or something like that.

I could not talk about programming features because I'm not a developer but I want to talk about Golem as a powerfull tool for improving wikipedias. In Ukrainian wikipedia we used it for a long time and highly appriciated its working. Thanks to Orphanage WikiProject we improved a lot of articles not only providing them with internal links but adding some interconnective information in articles that where marked with Golem-controlled bot.

Properly speaking I request you to search an appropriate way to making possible Golem tool for important wikipedia purpose. Maybe there is sense to talk about Golem's work only in a number of wiki-projects (including uk-wiki) if this could lighten server's resourses.

-- Sincerelly, Andrij Bondarenko (aka A1)

Mashiah Davidson

5:29 a.m.

...

It is also quite a leap to call linking these articles an important kind of quality. It is obviously useful, but the Wikipedias will not die without it - let's try not to be so hyperbolic. The sky is not falling.

No wikipedia will not die, of course. But some active users may stop working. It's sad. Isn't it?

mashiah

River Tarnell

31 Mar 31 Mar

11:43 p.m.

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1

Alex Rave:

...

analyse. Improvement may go faster: there is only one programmer now in project - Mashiah, and if anybody wants to help him and participate in code improving, he is free to join.

Could you (or someone) describe exactly what Golem does, and provide an example of its output format?

...

I want to ask if hardware upgrade can resolve this problem? And if it's possible, can you please estimate models and cost of required equipment?

The cost to upgrade each database server by 4GB would be EUR2'500 list, or EUR 440 per server.

- river.

...PGP SIGNATURE...

-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.10 (HP-UX) iEYEARECAAYFAkuzX6wACgkQIXd7fCuc5vKEmgCgmyrj7bjQ/sGMCSwyDKK0Arpp i3EAnigzK2b3GuUidWCfQamfN7IsDCWP =Yls3 -----END PGP SIGNATURE-----

Daniel Schwen

11:57 p.m.

Ok, I just had a really brief look at the output, but it strikes ma as suboptimal to scan the entire database periodically rather than performing incremental updates, by just pulling the subset of articles that were edited since the last update and adjusting linkage accordingly.

Анастасия Львова

1 Apr 1 Apr

12:13 a.m.

On Wed, Mar 31, 2010 at 7:57 PM, Daniel Schwen lists@schwen.de wrote:

...

Ok, I just had a really brief look at the output, but it strikes ma as suboptimal to scan the entire database periodically rather than performing incremental updates, by just pulling the subset of articles that were edited since the last update and adjusting linkage accordingly.

A major difference between isolated articles and the rest of articles with quality notes is that fixing the isolated requires editing other articles...

...On this weekend on the Konferencja Wikimedia Polska 2010 I did a report. Here it is: http://medeyko.com/lvova/Doklad/

-- regard, Lvova

Mashiah Davidson

5:15 a.m.

I would be happy if somebody were able to create a prototype of such an algorithm. I was not, at least for now.

mashiah

River Tarnell

12:25 a.m.

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1

River Tarnell:

...

Could you (or someone) describe exactly what Golem does, and provide an example of its output format?

Okay, so based on lvova's presentation, it seems like it does this:

* Build a graph of Wikipedia articles in the main namespace, with wikilinks as vertexes. Since some pages are not reachable from other pages, this is actually N disconnected graphs. * Remove all edges which refer to disambiguation pages, date pages, or lists * Remove the graph which contains the main page * Produce a list of all remaining graphs.

Is that roughly correct?

- river.

...PGP SIGNATURE...

-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.10 (HP-UX) iEYEARECAAYFAkuzaWYACgkQIXd7fCuc5vKQ6QCfQuzTrBIUk4XRqKzwhqrzC9Tg 5DgAn0kLgC/0gkDiVS0M66WMiMSlhnvT =VrXF -----END PGP SIGNATURE-----

Mashiah Davidson

5:27 a.m.

...

Build a graph of Wikipedia articles in the main namespace, with

wikilinks as vertexes. Since some pages are not reachable from other pages, this is actually N disconnected graphs.

Remove all edges which refer to disambiguation pages, date pages, or

lists

Remove the graph which contains the main page

Produce a list of all remaining graphs.

Is that roughly correct?

It is roughly correct description of one of Golem's processing stages.

It makes this operation for both, main namespace and category tree. The second is to get info on cycles in it.

It also monitors a number of categories with isolated articles of various types and output files containing new isolates for each type. Those files the used with AWB for articles templating (inclusion to categories containing isolated articles).

It also generates lists of pages containing links to disambiguation pages and list of most linked disambiguation pages.

For each isolated article it tries to find suggestions for linking of three various types: 1. if an isolated article is linked from a disambiguation page and it is linked from another article, it suggests to check wether the link is to go directly to isolated article 2. if an isolated article has interwiki link and its iwiki-partner is linked in another language and the page linking it has backward link to mother language, it suggests to improve an existent article. 3. if in the above chain the article does not have backward link to mother language, it suggests to translate and link.

All the suggestions are present on ts web page and templates mentioned above provide access from article in wiki to suggestions list on ts web server.

It also creates a list of users by amount of isolated articles created (on the web page)

It also creates a list of isolated articles by creation date (old isolates most probably have lost creator's attention).

Depeniding on existence of additional configuration it also allows to see which templates link disambiguation pages.

mashiah

Magnus Manske

7:46 a.m.

On Wed, Mar 31, 2010 at 9:27 PM, Mashiah Davidson mashiah.davidson@gmail.com wrote:

...

...

Build a graph of Wikipedia articles in the main namespace, with

wikilinks as vertexes. Since some pages are not reachable from other pages, this is actually N disconnected graphs.

Remove all edges which refer to disambiguation pages, date pages, or

lists

Remove the graph which contains the main page

Produce a list of all remaining graphs.

Is that roughly correct?

It is roughly correct description of one of Golem's processing stages.

Well, I just did the experiment for German Wikipedia, using page_id pairs in a temporary, non-memory table. In my user database (on the same server as dewiki_p):

mysql> create temporary table delinks ( pid1 INTEGER , pid2 INTEGER ) ENGINE=InnoDB ;

mysql> INSERT /* SLOW_OK */ INTO delinks ( pid1,pid2 ) select p1.page_id AS pid1,p2.page_id AS pod2 from dewiki_p.page AS p1,dewiki_p.page AS p2,dewiki_p.pagelinks WHERE pl_title=p2.page_title and p2.page_namespace=0 and pl_namespace=0 and p1.page_id=pl_from and p1.page_namespace=0 ;

Query OK, 34964160 rows affected (32 min 59.29 sec) Records: 34964160 Duplicates: 0 Warnings: 0

So, 35 million link pairs between namespace-0 pages, created in 33 minutes (~1 million links per minute). That's not too bad for our #2 wikipedia, and seems perfectly managable.

Depending on your usage, now add indices and spices :-)

Magnus

Mashiah Davidson

2 Apr 2 Apr

5:42 a.m.

...

So, 35 million link pairs between namespace-0 pages, created in 33 minutes (~1 million links per minute). That's not too bad for our #2 wikipedia, and seems perfectly managable.

It is indeed not bad. Golem performed this operation in 14 minutes + something for pages caching. On the other hand the pages table cache has been reused for such purposes as redirects throwing.

mashiah

River Tarnell

1 Apr 1 Apr

9:26 a.m.

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1

Mashiah Davidson:

...

...

Build a graph of Wikipedia articles in the main namespace, with

wikilinks as vertexes. Since some pages are not reachable from other pages, this is actually N disconnected graphs.

Remove all edges which refer to disambiguation pages, date pages, or

lists

Remove the graph which contains the main page

Produce a list of all remaining graphs.

...

...
Is that roughly correct?

...

It is roughly correct description of one of Golem's processing stages.

Okay, so for this part, my implementation can load all links from ruwiki, and analyse them into 528 disconnected subgraphs (most of which contain only a single isolated page) in 85 seconds. In total it uses about 200MB RAM on the system it runs on, and no MySQL tables.

Does this seem reasonable? The vast majority of the runtime is loading the data; the actual processing only takes about 10 seconds, so adding additional analysis should not increase the runtime significantly.

- river.

% time ./judah -c defs/ruwiki NOTE: Loading configuration from defs/ruwiki NOTE: Using ruwiki_p on ruwiki-p.db.toolserver.org NOTE: Connected to database.

Running...

NOTE: Estimated page count: 921950 (memory = 7.03MB) Fetching pages: 1167075 (actual memory used = 8.90MB), skipped 0, list=0, year=0, date=0, disambig=0 Sorting pages...

NOTE: Estimated link count: 39461689 (memory = 301.07MB) Fetching links: 24552488 (actual memory used = 187.32MB), skipped 2512244 links to invalid pages

Finding all subgraphs... NOTE: Identified 529 distinct subgraphs ./judah -c defs/ruwiki 70.99s user 1.60s system 85% cpu 1:25.12 total

...PGP SIGNATURE...

-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.10 (HP-UX) iEYEARECAAYFAkuz6B8ACgkQIXd7fCuc5vLb8QCfZt5wRlSiVmaScHztlGPt+ez2 1WwAoKuxsQ11BMCCnLsSK5MNveYH77Xo =fJUt -----END PGP SIGNATURE-----

Mashiah Davidson

2 Apr 2 Apr

5:39 a.m.

...

Okay, so for this part, my implementation can load all links from ruwiki, and analyse them into 528 disconnected subgraphs (most of which contain only a single isolated page) in 85 seconds. In total it uses about 200MB RAM on the system it runs on, and no MySQL tables.

As per my understanding there are much more than 528 disconnected subgraphs in ruwiki. The exact amount can be estimated with use of this page: http://ru.wikipedia.org/wiki/%D0%92%D0%B8%D0%BA%D0%B8%D0%BF%D0%B5%D0%B4%D0%B...

Each orphan is belongs to a distinct disconnected subgraph, there are 18592 orphans there. Let's for a while forget about long chains like _1_1 (orphan linking another article) and just look for a lower bound for the proper subgraphs amount.

There are also 930 articles in isolated pairs (_2), which ads 470 subgraphs to our lower bound, etc. Totally there should be not less than 19 000.

Similar data for dewiki can be seen from http://toolserver.org/~mashiah/isolated/de.log:

21567 orphans + 954/2 pair + etc gives us not less than 22 000 distinct subgraphs.

Currently I am not sure the difference is caused by the fact that rules for links taking/not taking into account are different because the difference in results looks too huge.

On the other hand, if the problem can indeed be resolved in such a small amount of time, it seems great.

mashiah

River Tarnell

5:45 a.m.

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1

Mashiah Davidson:

...

Currently I am not sure the difference is caused by the fact that rules for links taking/not taking into account are different because the difference in results looks too huge.

There are no rules for this at the moment, it will only find purely isolated clusters. Nonetheless, it already performs a traversal of the entire page tree, so removing some edges should not have a large effect on performance (it may even become faster).

Are the rules for detecting links which should be excluded documented anywhere?

- river.

...PGP SIGNATURE...

-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.10 (HP-UX) iEYEARECAAYFAku1BfIACgkQIXd7fCuc5vL/hACgowKlorVz1yM+amUf4ueIzSE5 vckAoKlVY+YtnlsKC8/eOXby6BSxFuNB =/dmD -----END PGP SIGNATURE-----

Mashiah Davidson

6:01 a.m.

...

Are the rules for detecting links which should be excluded documented anywhere?

The rules I use a bit different for various languages and depend on a language configuration, which is made in a way similar to one defining disambiguating template names at Mediawiki:Disambiguationspage.

Rules for connectivity analysis are simple, so I can just list them:

0. all redirects are thrown so that redirect pages are not present in verticles set and all links through them added to edges set. 1. disambiguation pages are to be excluded from the articles set (everything marked by a template linked from Mediawiki:Disambiguationspage) 2. if a configuration exists (it does just for ru and uk at the moment), some other pages can be exluded 3. all links from/to excluded pages are also excluded from edges set 4. links from chronological articles (this set is now empty for all wikis except for ru and uk) are excluded from edges set 5. if an article is transcluded by another article (which happens sometimes) it is assumed as linked instead

If you are interested in other confuguration setting rules, they are described in russian and english languages here: http://ru.wikipedia.org/wiki/%D0%92%D0%B8%D0%BA%D0%B8%D0%BF%D0%B5%D0%B4%D0%B...

mashiah

Mashiah Davidson

3 Apr 3 Apr

7:27 a.m.

...

Fetching links: 24552488 (actual memory used = 187.32MB),

I've just looked after a process of text data transmission from a MEMORY table on sql-s3 to my home driven by willow. 1 MB was stored to hdd in 2.5 min. 187 * 2.5. does it mean that data load takes more than 7 hours for you or my result is caused by just too slow interface through mysql client called from a bash script?

mashiah

River Tarnell

7:33 a.m.

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1

Mashiah Davidson:

...

I've just looked after a process of text data transmission from a MEMORY table on sql-s3 to my home driven by willow.

...

1 MB was stored to hdd in 2.5 min. 187 * 2.5.

...

does it mean that data load takes more than 7 hours for you

Loading the entire pagelinks table from ruwiki took roughly 60 seconds. (I didn't measure that step separately, but the entire run was 1:25, most of which was loading the data.)

...

or my result is caused by just too slow interface through mysql client called from a bash script?

I'd have to see a more detailed description of how you tested it to answer that, but 1MB in 2.5 minutes seems far too slow.

For raw disk writes to /home:

% time mkfile 100m test mkfile 100m test 0.00s user 0.28s system 12% cpu 2.155 total

i.e. 100MB written in 2.1 seconds, or 47MB/sec.

- river.

...PGP SIGNATURE...

-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.10 (HP-UX) iEYEARECAAYFAku2cLoACgkQIXd7fCuc5vIflgCaApHAcb6JMoPg15042PXl8Kh0 UhUAnjy5q46wX5dtPd1jEYFdG4EnyiqR =HYZL -----END PGP SIGNATURE-----

Mashiah Davidson

7:46 a.m.

...

I'd have to see a more detailed description of how you tested it to answer that, but 1MB in 2.5 minutes seems far too slow.

First, a query on SQL server outs a row with a filename to an outer handler on stdout. Handler switches itself for data collection.

SQL server selects a MEMORY table for a MB of data to stdout. The hadler collects till the end of transmission and then writes a collection of rows to the file chosen one step up.

mashiah

Mashiah Davidson

1 Apr 1 Apr

5:13 a.m.

...

Could you (or someone) describe exactly what Golem does, and provide an example of its output format?

I think, I can do this. Let say next weekend.

mashiah

River Tarnell

5:17 a.m.

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1

Mashiah Davidson:

...

...
Could you (or someone) describe exactly what Golem does, and provide an example of its output format?

...

I think, I can do this. Let say next weekend.

Okay. Well, in the mean time, I find this an interesting problem, so I have started a new implementation using the algorithm I described earlier. I will only implement the backend (data generation), not the UI.

If that algorithm is not accurate, it shouldn't be hard to modify the program once the real algorithm is revealed.

- river.

...PGP SIGNATURE...

-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.10 (HP-UX) iEYEARECAAYFAkuzrecACgkQIXd7fCuc5vLg+ACggi3j+fDfFWWsRNasLGVOTP42 fqEAniaTFvKIkGibcHnqsr23pJGlQ5q9 =bAjX -----END PGP SIGNATURE-----

5376

Age (days ago)

5378

Last active (days ago)

toolserver-l@lists.wikimedia.org

23 comments

8 participants

tags (0)

participants (8)

Alex Rave
Daniel Schwen
Magnus Manske
Mashiah Davidson
Mike.lifeguard
River Tarnell
Анастасия Львова
Андрій Бондаренко