Non transcluded Page: from ns:0

List overview All Threads
Download

newer

older

Wikimedia Conference report

Internet Archive job posting: Book...

Philippe Elie

28 Apr 2016 28 Apr '16

12:42 p.m.

Hi,

I added a new tool: https://tools.wmflabs.org/phetools/not_transcluded/ to provide a list of Index containing corrected or validated page which are not transcluded from main:, see the README.txt.

-- phe

Show replies by date

Andrea Zanni

28 Apr 28 Apr

12:50 p.m.

Wow, this is fantastic Phe. It's really useful for running the "Match & split" when it's needed.

Andrea

On Thu, Apr 28, 2016 at 2:42 PM, Philippe Elie phil.el@free.fr wrote:

...

Hi,

I added a new tool: https://tools.wmflabs.org/phetools/not_transcluded/ to provide a list of Index containing corrected or validated page which are not transcluded from main:, see the README.txt.

-- phe

Wikisource-l mailing list Wikisource-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikisource-l

Alex Brollo

1:55 p.m.

Very interesting.

Have you any suggestion about finding the list of not transcluded pages? I can imagine, to get by a bot html of ns0 main page and all its subpages related to a Index page, then parsing it to get the list of existing page links; is there any simpler strategy?

Alex

2016-04-28 14:50 GMT+02:00 Andrea Zanni zanni.andrea84@gmail.com:

...

Wow, this is fantastic Phe. It's really useful for running the "Match & split" when it's needed.

Andrea

On Thu, Apr 28, 2016 at 2:42 PM, Philippe Elie phil.el@free.fr wrote:

...
Hi,

I added a new tool: https://tools.wmflabs.org/phetools/not_transcluded/ to provide a list of Index containing corrected or validated page which are not transcluded from main:, see the README.txt.

-- phe

Wikisource-l mailing list Wikisource-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikisource-l

Wikisource-l mailing list Wikisource-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikisource-l

Philippe Elie

2:47 p.m.

On Thu, 28 Apr 2016 at 15:55 +0200, Alex Brollo wrote:

...

Very interesting.

Have you any suggestion about finding the list of not transcluded pages? I can imagine, to get by a bot html of ns0 main page and all its subpages related to a Index page, then parsing it to get the list of existing page links; is there any simpler strategy?

Alex

If you have access to the database the simplest way is the code of this tool https://github.com/phil-el/phetools/blob/master/statistics/not_transcluded.p... as the function not_transcluded() is nearly what you need. I'll probably show the list of page not transcluded in a future version but this tool get such list for all index: on a wiki and the query takes a few minutes, it's not handy for a per index transclusions status.

To get such list for only one index it'll easier to use the API, 1) get all links on the Index:page filtered to namespace Page: 2) use the embededin api to get all transclusions from ns:0, result from 1) minus result from 2) are what you are searching. You can do 1) in one request and you can probably get also the proofread status with the same request as you are probably only interested in yellow or green page not transcluded, 2) is perhaps possible in only one request, I don't remember. Such tool to complement my tool can be very useful. It's possible I'll provide a simpler API on toollabs to do that.

-- phe

Andrea Zanni

3:39 p.m.

Thanks Phe also for the pointer at your GitHub page, I'll try to post issues directly there if needed :-)

Your tool and a bit of fiddling with transclusions got me thinking: sometime some works are really complex. You have multiple Indexes representing multiple texts, and at times you also have other versions/editions of the same work. This creates a mess because all the realtionships between Indexes and ns0 pages are made by humans, and it's not always easy to understand the "structure".

So, my question is: is it possibile to "draw" some sort of graph/network between Indexes and the pages that are transcluded from them?

Maybe with a visual representation it would be easier to tame the chaos :-)

Aubrey

On Thu, Apr 28, 2016 at 4:47 PM, Philippe Elie phil.el@free.fr wrote:

...

On Thu, 28 Apr 2016 at 15:55 +0200, Alex Brollo wrote:

...
Very interesting.

Have you any suggestion about finding the list of not transcluded pages?

I

...
can imagine, to get by a bot html of ns0 main page and all its subpages related to a Index page, then parsing it to get the list of existing page links; is there any simpler strategy?

Alex

If you have access to the database the simplest way is the code of this tool

https://github.com/phil-el/phetools/blob/master/statistics/not_transcluded.p... as the function not_transcluded() is nearly what you need. I'll probably show the list of page not transcluded in a future version but this tool get such list for all index: on a wiki and the query takes a few minutes, it's not handy for a per index transclusions status.

To get such list for only one index it'll easier to use the API, 1) get all links on the Index:page filtered to namespace Page: 2) use the embededin api to get all transclusions from ns:0, result from 1) minus result from 2) are what you are searching. You can do 1) in one request and you can probably get also the proofread status with the same request as you are probably only interested in yellow or green page not transcluded, 2) is perhaps possible in only one request, I don't remember. Such tool to complement my tool can be very useful. It's possible I'll provide a simpler API on toollabs to do that.

-- phe

Wikisource-l mailing list Wikisource-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikisource-l

Mardetanha

7:48 p.m.

I don't see fawikis on the page, could you update it also with farsi wikiS ? thanks in advance

Mardetanha

On Thu, Apr 28, 2016 at 8:09 PM, Andrea Zanni zanni.andrea84@gmail.com wrote:

...

Thanks Phe also for the pointer at your GitHub page, I'll try to post issues directly there if needed :-)

Your tool and a bit of fiddling with transclusions got me thinking: sometime some works are really complex. You have multiple Indexes representing multiple texts, and at times you also have other versions/editions of the same work. This creates a mess because all the realtionships between Indexes and ns0 pages are made by humans, and it's not always easy to understand the "structure".

So, my question is: is it possibile to "draw" some sort of graph/network between Indexes and the pages that are transcluded from them?

Maybe with a visual representation it would be easier to tame the chaos :-)

Aubrey

On Thu, Apr 28, 2016 at 4:47 PM, Philippe Elie phil.el@free.fr wrote:

...
On Thu, 28 Apr 2016 at 15:55 +0200, Alex Brollo wrote:

...
Very interesting.

Have you any suggestion about finding the list of not transcluded

pages? I

...
can imagine, to get by a bot html of ns0 main page and all its subpages related to a Index page, then parsing it to get the list of existing

page

...
links; is there any simpler strategy?

Alex

If you have access to the database the simplest way is the code of this tool

https://github.com/phil-el/phetools/blob/master/statistics/not_transcluded.p... as the function not_transcluded() is nearly what you need. I'll probably show the list of page not transcluded in a future version but this tool get such list for all index: on a wiki and the query takes a few minutes, it's not handy for a per index transclusions status.

To get such list for only one index it'll easier to use the API, 1) get all links on the Index:page filtered to namespace Page: 2) use the embededin api to get all transclusions from ns:0, result from 1) minus result from 2) are what you are searching. You can do 1) in one request and you can probably get also the proofread status with the same request as you are probably only interested in yellow or green page not transcluded, 2) is perhaps possible in only one request, I don't remember. Such tool to complement my tool can be very useful. It's possible I'll provide a simpler API on toollabs to do that.

-- phe

Wikisource-l mailing list Wikisource-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikisource-l

Wikisource-l mailing list Wikisource-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikisource-l

Philippe Elie

10:03 p.m.

On Fri, 29 Apr 2016 at 00:18 +0430, Mardetanha wrote:

...

I don't see fawikis on the page, could you update it also with farsi wikiS ? thanks in advance

Mardetanha

fa is in, but empty html file are not created, I'll change it to create the file even if zero index meet the needed criteria.

-- Phe

Mardetanha

10:13 p.m.

Thanks

Mardetanha

On Fri, Apr 29, 2016 at 2:33 AM, Philippe Elie phil.el@free.fr wrote:

...

On Fri, 29 Apr 2016 at 00:18 +0430, Mardetanha wrote:

...
I don't see fawikis on the page, could you update it also with farsi

wikiS ?

...
thanks in advance

Mardetanha

fa is in, but empty html file are not created, I'll change it to create the file even if zero index meet the needed criteria.

-- Phe

Wikisource-l mailing list Wikisource-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikisource-l

billinghurst

29 Apr 29 Apr

11:14 a.m.

To note that we have the long existing tool "checker" at toollabs that will generate the transclusion listing per work

https://tools.wmflabs.org/checker

eg. https://tools.wmflabs.org/checker/?db=enwikisource_p&title=Index:Dream_d...

and enWS has been using it on its Index: ns pages for ages.

Regards, Billinghurst

On Fri, Apr 29, 2016 at 12:47 AM, Philippe Elie phil.el@free.fr wrote:

...

On Thu, 28 Apr 2016 at 15:55 +0200, Alex Brollo wrote:

...
Very interesting.

Have you any suggestion about finding the list of not transcluded pages? I can imagine, to get by a bot html of ns0 main page and all its subpages related to a Index page, then parsing it to get the list of existing page links; is there any simpler strategy?

Alex

If you have access to the database the simplest way is the code of this tool https://github.com/phil-el/phetools/blob/master/statistics/not_transcluded.p... as the function not_transcluded() is nearly what you need. I'll probably show the list of page not transcluded in a future version but this tool get such list for all index: on a wiki and the query takes a few minutes, it's not handy for a per index transclusions status.

To get such list for only one index it'll easier to use the API, 1) get all links on the Index:page filtered to namespace Page: 2) use the embededin api to get all transclusions from ns:0, result from 1) minus result from 2) are what you are searching. You can do 1) in one request and you can probably get also the proofread status with the same request as you are probably only interested in yellow or green page not transcluded, 2) is perhaps possible in only one request, I don't remember. Such tool to complement my tool can be very useful. It's possible I'll provide a simpler API on toollabs to do that.

-- phe

Wikisource-l mailing list Wikisource-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikisource-l

Andrea Zanni

1:23 p.m.

Thanks for reminding us, it.source uses it too on the Index ns. But Phe tool is different: it gives you the list of all the "not transcluded" books. You don't have to check all the books by hand to know.

Ideally, the tools should be merged in one, so an editor can check every work directly. Used together the tools are pretty powerful.

Aubrey

On Fri, Apr 29, 2016 at 1:14 PM, billinghurst billinghurstwiki@gmail.com wrote:

...

To note that we have the long existing tool "checker" at toollabs that will generate the transclusion listing per work

https://tools.wmflabs.org/checker

eg. https://tools.wmflabs.org/checker/?db=enwikisource_p&title=Index:Dream_d...

and enWS has been using it on its Index: ns pages for ages.

Regards, Billinghurst

On Fri, Apr 29, 2016 at 12:47 AM, Philippe Elie phil.el@free.fr wrote:

...
On Thu, 28 Apr 2016 at 15:55 +0200, Alex Brollo wrote:

...
Very interesting.

Have you any suggestion about finding the list of not transcluded

pages? I

...
...
can imagine, to get by a bot html of ns0 main page and all its subpages related to a Index page, then parsing it to get the list of existing

page

...
...
links; is there any simpler strategy?

Alex

If you have access to the database the simplest way is the code of this

tool

...
https://github.com/phil-el/phetools/blob/master/statistics/not_transcluded.p...

...
as the function not_transcluded() is nearly what you need. I'll probably show the list of page not transcluded in a future version but this tool

get

...
such list for all index: on a wiki and the query takes a few minutes,

it's not

...
handy for a per index transclusions status.

To get such list for only one index it'll easier to use the API, 1) get

all

...
links on the Index:page filtered to namespace Page: 2) use the embededin

api

...
to get all transclusions from ns:0, result from 1) minus result from 2)

are

...
what you are searching. You can do 1) in one request and you can

probably get

...
also the proofread status with the same request as you are probably only interested in yellow or green page not transcluded, 2) is perhaps

possible

...
in only one request, I don't remember. Such tool to complement my tool

can be

...
very useful. It's possible I'll provide a simpler API on toollabs to do

that.

...
-- phe

Wikisource-l mailing list Wikisource-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikisource-l

Wikisource-l mailing list Wikisource-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikisource-l

Andrea Zanni

1:38 p.m.

(it seems there is a problem: I thought a simple patch like this could work

line 47:

def format_html_line(domain, bookname, count): if domain == 'old': domain = 'mul' #bookname = unicode(bookname, 'utf-8') fmt = '<li><a href="//%s.wikisource.org/wiki/Index:%s">%s</a> %d <a href="//%s.wikisource.org/wiki/Index:%s">check pages</a></li>' result = fmt % (domain, urllib.quote(bookname), bookname, count, domain, urllib.quote(bookname), bookname) return result

but the problem is that the transclusion checker wants the exact name of the page, so the translations of "Index" in all the language give and error. This should be change on the "checker" side.

Aubrey

On Fri, Apr 29, 2016 at 3:23 PM, Andrea Zanni zanni.andrea84@gmail.com wrote:

...

Thanks for reminding us, it.source uses it too on the Index ns. But Phe tool is different: it gives you the list of all the "not transcluded" books. You don't have to check all the books by hand to know.

Ideally, the tools should be merged in one, so an editor can check every work directly. Used together the tools are pretty powerful.

Aubrey

On Fri, Apr 29, 2016 at 1:14 PM, billinghurst billinghurstwiki@gmail.com wrote:

...
To note that we have the long existing tool "checker" at toollabs that will generate the transclusion listing per work

https://tools.wmflabs.org/checker

eg. https://tools.wmflabs.org/checker/?db=enwikisource_p&title=Index:Dream_d...

and enWS has been using it on its Index: ns pages for ages.

Regards, Billinghurst

On Fri, Apr 29, 2016 at 12:47 AM, Philippe Elie phil.el@free.fr wrote:

...
On Thu, 28 Apr 2016 at 15:55 +0200, Alex Brollo wrote:

...
Very interesting.

Have you any suggestion about finding the list of not transcluded

pages? I

...
...
can imagine, to get by a bot html of ns0 main page and all its subpages related to a Index page, then parsing it to get the list of existing

page

...
...
links; is there any simpler strategy?

Alex

If you have access to the database the simplest way is the code of this

tool

...
https://github.com/phil-el/phetools/blob/master/statistics/not_transcluded.p...

...
as the function not_transcluded() is nearly what you need. I'll probably show the list of page not transcluded in a future version but this tool

get

...
such list for all index: on a wiki and the query takes a few minutes,

it's not

...
handy for a per index transclusions status.

To get such list for only one index it'll easier to use the API, 1) get

all

...
links on the Index:page filtered to namespace Page: 2) use the

embededin api

...
to get all transclusions from ns:0, result from 1) minus result from 2)

are

...
what you are searching. You can do 1) in one request and you can

probably get

...
also the proofread status with the same request as you are probably only interested in yellow or green page not transcluded, 2) is perhaps

possible

...
in only one request, I don't remember. Such tool to complement my tool

can be

...
very useful. It's possible I'll provide a simpler API on toollabs to do

that.

...
-- phe

Wikisource-l mailing list Wikisource-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikisource-l

Wikisource-l mailing list Wikisource-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikisource-l

Philippe Elie

3:19 p.m.

On Fri, 29 Apr 2016 at 15:38 +0200, Andrea Zanni wrote:

...

(it seems there is a problem: I thought a simple patch like this could work

line 47:

def format_html_line(domain, bookname, count): if domain == 'old': domain = 'mul' #bookname = unicode(bookname, 'utf-8') fmt = '<li><a href="//%s.wikisource.org/wiki/Index:%s">%s</a> %d <a href="//%s.wikisource.org/wiki/Index:%s">check pages</a></li>' result = fmt % (domain, urllib.quote(bookname), bookname, count, domain, urllib.quote(bookname), bookname) return result

but the problem is that the transclusion checker wants the exact name of the page, so the translations of "Index" in all the language give and error. This should be change on the "checker" side.

Aubrey

checker is older than canonical namespace for Index and Page, I've the needed names on my side so I added a link to checker for each index listed.

-- phe

Zdzislaw

3:08 p.m.

hello Phe,

tool is very useful, but generated links point "nowhere (404)" if the Pages and Index have different names (often used on pl ws ), see: https://tools.wmflabs.org/phetools/not_transcluded/pl.html and Poezye_cz._2_(Antoni_Lange).djvu

where pages Strona:Poezye_cz._2_(Antoni_Lange).djvu/001... are from Indeks:Poezye_cz._2_(Antoni Lange) https://pl.wikisource.org/wiki/Indeks%3APoezye_cz._2_%28Antoni_Lange%29

so, the tool would have to link to the correct name of the index (also for Checker)

regards,

On 28 April 2016 at 14:42, Philippe Elie phil.el@free.fr wrote:

...

Hi,

I added a new tool: https://tools.wmflabs.org/phetools/not_transcluded/ to provide a list of Index containing corrected or validated page which are not transcluded from main:, see the README.txt.

-- phe

Wikisource-l mailing list Wikisource-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikisource-l

Philippe Elie

3:48 p.m.

On Fri, 29 Apr 2016 at 17:08 +0200, Zdzislaw wrote:

...

hello Phe,

tool is very useful, but generated links point "nowhere (404)" if the Pages and Index have different names (often used on pl ws ), see: https://tools.wmflabs.org/phetools/not_transcluded/pl.html and Poezye_cz._2_(Antoni_Lange).djvu

where pages Strona:Poezye_cz._2_(Antoni_Lange).djvu/001... are from Indeks:Poezye_cz._2_(Antoni Lange) https://pl.wikisource.org/wiki/Indeks%3APoezye_cz._2_%28Antoni_Lange%29

so, the tool would have to link to the correct name of the index (also for Checker)

regards,

Z.

Yes, such index are not correctly handled it's stated in the README.txt. Actually deducing index name from page is done by striping the "/*" part of Page: name which is very cheap, but this tool is already enough slow than I need to cache the result and to generate these html page only one per day. I'm unsure how to handle that sort of index without doing a request per Page: not transcluded to get the Index: name. Perhaps I could check if the index doesn't exist and retry by removing the extension ?

-- Phe

Jayanta Nath

3:55 p.m.

I am just checking of bnws, here https://tools.wmflabs.org/phetools/not_transcluded/bn.html. All "check pages" got the error

" Internal Server Error

The server encountered an internal error and was unable to complete your request. Either the server is overloaded or there is an error in the application."

On Fri, Apr 29, 2016 at 9:18 PM, Philippe Elie phil.el@free.fr wrote:

...

On Fri, 29 Apr 2016 at 17:08 +0200, Zdzislaw wrote:

...
hello Phe,

tool is very useful, but generated links point "nowhere (404)" if the Pages and Index have different names (often used on pl ws ), see: https://tools.wmflabs.org/phetools/not_transcluded/pl.html and Poezye_cz._2_(Antoni_Lange).djvu

where pages Strona:Poezye_cz._2_(Antoni_Lange).djvu/001... are from Indeks:Poezye_cz._2_(Antoni Lange) https://pl.wikisource.org/wiki/Indeks%3APoezye_cz._2_%28Antoni_Lange%29

so, the tool would have to link to the correct name of the index (also for Checker)

regards,

Z.

Yes, such index are not correctly handled it's stated in the README.txt. Actually deducing index name from page is done by striping the "/*" part of Page: name which is very cheap, but this tool is already enough slow than I need to cache the result and to generate these html page only one per day. I'm unsure how to handle that sort of index without doing a request per Page: not transcluded to get the Index: name. Perhaps I could check if the index doesn't exist and retry by removing the extension ?

-- Phe

Wikisource-l mailing list Wikisource-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikisource-l

Philippe Elie

5:12 p.m.

On Fri, 29 Apr 2016 at 21:25 +0530, Jayanta Nath wrote:

...

I am just checking of bnws, here https://tools.wmflabs.org/phetools/not_transcluded/bn.html. All "check pages" got the error

" Internal Server Error

The server encountered an internal error and was unable to complete your request. Either the server is overloaded or there is an error in the application."

I asked the maintainer of the checker tool to get a look at that.

-- phe

Zdzislaw

4:26 p.m.

sorry, I have not read the README.txt carefully :))) there is no need to check if the index doesn't exist and removing the extension, I can do it manually :))

thanks,

On 29 April 2016 at 17:48, Philippe Elie phil.el@free.fr wrote:

...

On Fri, 29 Apr 2016 at 17:08 +0200, Zdzislaw wrote:

...
hello Phe,

tool is very useful, but generated links point "nowhere (404)" if the Pages and Index have different names (often used on pl ws ), see: https://tools.wmflabs.org/phetools/not_transcluded/pl.html and Poezye_cz._2_(Antoni_Lange).djvu

where pages Strona:Poezye_cz._2_(Antoni_Lange).djvu/001... are from Indeks:Poezye_cz._2_(Antoni Lange) https://pl.wikisource.org/wiki/Indeks%3APoezye_cz._2_%28Antoni_Lange%29

so, the tool would have to link to the correct name of the index (also for Checker)

regards,

Z.

Yes, such index are not correctly handled it's stated in the README.txt. Actually deducing index name from page is done by striping the "/*" part of Page: name which is very cheap, but this tool is already enough slow than I need to cache the result and to generate these html page only one per day. I'm unsure how to handle that sort of index without doing a request per Page: not transcluded to get the Index: name. Perhaps I could check if the index doesn't exist and retry by removing the extension ?

-- Phe

Wikisource-l mailing list Wikisource-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikisource-l

3002

Age (days ago)

3003

Last active (days ago)

wikisource-l@lists.wikimedia.org

16 comments

7 participants

tags (0)

participants (7)

Alex Brollo
Andrea Zanni
billinghurst
Jayanta Nath
Mardetanha
Philippe Elie
Zdzislaw