Drop OAI-PMH repository of Index: pages

List overview All Threads
Download

newer

older

Upload/import wizard

Bengali Wikisource 10th...

Thomas PT

30 Dec 2016 30 Dec '16

4:31 p.m.

Hello everyone, The ProofreadPage MediaWiki extension provides an OAI-PMH API allowing to retrieve the content of Index: pages in a more or less structured format. According to the Wikimedia PageView statistic tool nobody seems to use it (probably because of the low quality of the provided data). In order to ease the maintenance of the extension it is probably a good idea to drop this feature. The source code will still be in the git history so, if we want to make this feature alive again, we won't have to start from scratch. Is there any strong concerns about it? Thomas

Show replies by date

Federico Leva (Nemo)

30 Dec 30 Dec

5:06 p.m.

For those who haven't seen it, it's this thing: https://it.wikisource.org/wiki/Speciale:ProofreadIndexOai?verb=ListRecords&… https://it.wikisource.org/wiki/Speciale:ProofreadIndexOai?verb=ListRecords&… (Enter a Special:ProofreadIndexOai URL in http://validator.oaipmh.com/ for a clickable interface to get previews.) Thomas PT, 30/12/2016 17:31:

...

Is there any strong concerns about it?

Removing the OAI-PMH endpoint probably means nobody interested in OAI-PMH will ever get to know that there was such a possibility or what it used to do, so it's very unlikely that any improvement would be made. Do we have a list of current issues which make the code expensive to maintain in the short term? Nemo

Federico Leva (Nemo)

5:11 p.m.

Sorry for the double message. Thomas PT, 30/12/2016 17:31:

...

According to the Wikimedia PageView statistic tool

Did you literally use https://tools.wmflabs.org/pageviews , or have you asked for real requests data? The pageviews API doesn't count requests to the OAI-PMH endpoint at all, because they have "content-type: text/xml" while text/html is required: https://meta.wikimedia.org/wiki/Research:Page_view#Definition Only people with access to https://wikitech.wikimedia.org/wiki/Analytics/Data/Webrequest#wmf.webrequest can extract data on how much it's used. Nemo

Thomas PT

5:15 p.m.

I definitely used the pageviews API. So I understand now why the count was 0. Sorry for the false info and thank you for your correction. But my proposal still stands as I do not know any actual user of the API. Thomas

...

Le 30 déc. 2016 à 18:11, Federico Leva (Nemo) <nemowiki(a)gmail.com> a écrit : Sorry for the double message. Thomas PT, 30/12/2016 17:31:

According to the Wikimedia PageView statistic tool

Andrea Zanni

31 Dec 31 Dec

11:58 a.m.

Hi Thomas. I used, one year ago, the API: I downloaded the data from the Index pages, and I think that it would be good to have it while we still don't have Wikidata. I guess it could be very useful to use for importing those data into Wikidata. The problem with those API is that it works only it Index pages, which are only a fraction of the "book" entity on Wikisource. Index pages are not linked in a structured way with their ns0 pages, and this is a problem for us. Ideally, we would know when a Index page has only one ns0 page, and we would use the same set of data to create an entity (or more) into Wikidata. I know that Sam is trying to develop a similar tool: https://tools.wmflabs.org/ws-search/ and I don't know if that uses your API. Aubrey On Fri, Dec 30, 2016 at 6:15 PM, Thomas PT <thomaspt(a)hotmail.fr> wrote:

...

Le 30 déc. 2016 à 18:11, Federico Leva (Nemo) <nemowiki(a)gmail.com> a

écrit :

Sorry for the double message. Thomas PT, 30/12/2016 17:31:

According to the Wikimedia PageView statistic tool

Did you literally use https://tools.wmflabs.org/pageviews , or have you

asked for real requests data? The pageviews API doesn't count requests to the OAI-PMH endpoint at all, because they have "content-type: text/xml" while text/html is required: https://meta.wikimedia.org/ wiki/Research:Page_view#Definition

Only people with access to https://wikitech.wikimedia.

org/wiki/Analytics/Data/Webrequest#wmf.webrequest can extract data on how much it's used.

Nemo _______________________________________________ Wikisource-l mailing list Wikisource-l(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikisource-l

_______________________________________________ Wikisource-l mailing list Wikisource-l(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikisource-l

Thomas PT

12:09 p.m.

Hello Andrea,

...

I guess it could be very useful to use for importing those data into Wikidata.

Even when removing the OAI-PMH API we could still extract data from the Index: page serialization. It's a bit more difficult but not much more (and definitely far less than the entity matching problem).

...

The problem with those API is that it works only it Index pages, which are only a fraction of the "book" entity on Wikisource. Index pages are not linked in a structured way with their ns0 pages, and this is a problem for us.

It's possible to retrieve the ns0 pages that uses a given Index: page using the <pages> tag (you just have to retrieve the list of transclusions of the Index: page as if it where a regular template).

...

Ideally, we would know when a Index page has only one ns0 page, and we would use the same set of data to create an entity (or more) into Wikidata.

Yes. What we could do is see if the "Title" field of the index pages has only one link to a ns0 page and consider this is the "one" ns0 page. An other possible thing is, when the header feature of the <pages> tag is use, retrieve the pages that use the automatic summary feature and, if there is only one, consider this as the "one".

...

and I don't know if that uses your API.

I believe he doesn't but we should definitely ask him if it's useful for his use case. Thomas > Le 31 déc. 2016 à 12:58, Andrea Zanni <zanni.andrea84(a)gmail.com> a écrit : > > Hi Thomas. > > I used, one year ago, the API: I downloaded the data from the Index pages, and I think that it would be good to have it while we still don't have Wikidata.

...

I guess it could be very useful to use for importing those data into Wikidata.

> > The problem with those API is that it works only it Index pages, which are only a fraction of the "book" entity on Wikisource. Index pages are not linked in a structured way with their ns0 pages, and this is a problem for us. >

...

Ideally, we would know when a Index page has only one ns0 page, and we would use the same set of data to create an entity (or more) into Wikidata.

> > I know that Sam is trying to develop a similar tool: > https://tools.wmflabs.org/ws-search/

...

and I don't know if that uses your API.

> > Aubrey > > On Fri, Dec 30, 2016 at 6:15 PM, Thomas PT <thomaspt(a)hotmail.fr> wrote: > I definitely used the pageviews API. So I understand now why the count was 0. Sorry for the false info and thank you for your correction. > > But my proposal still stands as I do not know any actual user of the API. > > Thomas > > > Le 30 déc. 2016 à 18:11, Federico Leva (Nemo) <nemowiki(a)gmail.com> a écrit : > > > > Sorry for the double message. > > > > Thomas PT, 30/12/2016 17:31: > >> According to the Wikimedia PageView statistic tool > > > > Did you literally use https://tools.wmflabs.org/pageviews , or have you asked for real requests data? The pageviews API doesn't count requests to the OAI-PMH endpoint at all, because they have "content-type: text/xml" while text/html is required: https://meta.wikimedia.org/wiki/Research:Page_view#Definition > > > > Only people with access to https://wikitech.wikimedia.org/wiki/Analytics/Data/Webrequest#wmf.webrequest can extract data on how much it's used. > > > > Nemo > > > > _______________________________________________ > > Wikisource-l mailing list > > Wikisource-l(a)lists.wikimedia.org > > https://lists.wikimedia.org/mailman/listinfo/wikisource-l > > _______________________________________________ > Wikisource-l mailing list > Wikisource-l(a)lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/wikisource-l > > _______________________________________________ > Wikisource-l mailing list > Wikisource-l(a)lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/wikisource-l

2685

days inactive

2686

days old

wikisource-l@lists.wikimedia.org

Manage subscription

5 comments

3 participants

tags (0)

participants (3)

Andrea Zanni
Federico Leva (Nemo)
Thomas PT