What do we see as the next components for Wikisource?
What are our major hurdles for system development?
If we were offered development help where do people think that we should be making use of that help? Is it incremental fixes, transactional changes, or are we wanting transformational changes, completely new features, and new opportunities?
Regards, Billinghurst
2014-11-23 2:55 GMT+01:00 Wiki Billinghurst billinghurstwiki@gmail.com:
What do we see as the next components for Wikisource?
What are our major hurdles for system development?
If we were offered development help where do people think that we should be making use of that help? Is it incremental fixes, transactional changes, or are we wanting transformational changes, completely new features, and new opportunities?
Regards, Billinghurst
Not sure but I think improving what we already have is more a priority (ePub / PDF export - on the fly ? - book/page namespaces, better OCR, Wikidata integration ?).
Plus, you can find some profitable infos on http://meta.wikimedia.org/wiki/Wikisource_Community_User_Group/Wikisource_su... ; it's a bit old but still relevant.
Cdlt, ~nicolas
In thinking further about this, I think one of our major hurdles in getting casual transcription is the formatting and templates aspects. So is the migration to Visual Editor one of our major progression points?
Regards, Billinghurst
On Sun, Nov 23, 2014 at 9:08 PM, Nicolas VIGNERON vigneron.nicolas@gmail.com wrote:
2014-11-23 2:55 GMT+01:00 Wiki Billinghurst billinghurstwiki@gmail.com:
What do we see as the next components for Wikisource?
What are our major hurdles for system development?
If we were offered development help where do people think that we should be making use of that help? Is it incremental fixes, transactional changes, or are we wanting transformational changes, completely new features, and new opportunities?
Regards, Billinghurst
Not sure but I think improving what we already have is more a priority (ePub / PDF export - on the fly ? - book/page namespaces, better OCR, Wikidata integration ?).
Plus, you can find some profitable infos on http://meta.wikimedia.org/wiki/Wikisource_Community_User_Group/Wikisource_su... ; it's a bit old but still relevant.
Cdlt, ~nicolas
Wikisource-l mailing list Wikisource-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikisource-l
2014-11-23 13:02 GMT+01:00 Wiki Billinghurst billinghurstwiki@gmail.com:
In thinking further about this, I think one of our major hurdles in getting casual transcription is the formatting and templates aspects. So is the migration to Visual Editor one of our major progression points?
Regards, Billinghurst
Yes it is, once again look at
http://meta.wikimedia.org/wiki/Wikisource_Community_User_Group/Wikisource_su... ;)
Cdlt, ~nicolas
On 11/23/2014 02:55 AM, Wiki Billinghurst wrote:
What do we see as the next components for Wikisource?
What are our major hurdles for system development?
If we were offered development help where do people think that we should be making use of that help? Is it incremental fixes, transactional changes, or are we wanting transformational changes, completely new features, and new opportunities?
Ten years ago, Wikipedia was already a given success, and we started to branch out into projects like Wikisource, Wikinews and what not. That was also when Google Book Search started, and when the Internet Archive got its current practices for book scanning (with the "Scribe" scanning stations) in place. Ten years earlier, in the mid 90s, the first large-scale book scanning projects appeared. In the two decades 1990-2010, several books were published on the future of digital libraries. But what has happened in the last decade? What is new, really? Has anything changed in Google Book Search or the Internet Archive in the five years 2010-2014? Yes, more books have been digitized, but are they presented or used differently?
I think a lot more can be done, e.g. algorithmic improvement of OCR engines. Wikisource hasn't looked into that, neither has the Internet Archive, and nobody knows much about what Google does internally. This isn't necessarily "wiki", so it's not clear that it's a task for WMF and its projects. Another thing could be "gamification" of proofreading or mark-up / categorization / analysis of scanned books.
As for new kinds of content, the digitization of entire newspapers is still a new area, where the Australian national library was a pioneer some years ago, but what has happened since then? Potentially, it could become a cross-over between Wikisource and Wikinews, where each event can be found on the same day in many different newspapers. How to link them together? The problem: If we get scanned images + OCR text of 10 different newspapers, 10 years, 10 pages each day, that is 365 × 10 × 10 × 10 = 365,000 large pages to proofread, before we can do any serious analysis. How do we proofread so many pages in any reasonable time? We don't have enough volunteers for that.
Please keep up this good discussion :-) We have the Wikisource contest on it.source right now, so this mail is not going to be as long and detailed as I hoped.
I agree with Vigneron that the Survey report is a good start: having written it myself, I'm well aware that it's not perfect, and that questions were not bulletproof, as well the methodology. Nonetheless, we tried hard to make it and many results are as good and trustworthy.
I personally agree that a VE integration with the Proofread extension would be much needed: if you think about it, Wikisource is the right place for the VE. We could simplify enormously the life of new proofreaders, and formatting on Wikisource is ten times more difficult than in Wikipedia. I'm sure it's one of the best thing to do right now.
At the same time, I agree with Lars (who always has great insights) that we still need to do the big leap in digital libraries. For me, one of the thing Wikisource offers that nobody does is *hypertextuality*, and connections and integration with other projects as Wikidata (hopefully) and Wikipedia. I agree with him that algorithmic learning of Wikisource is an amazing idea: just think about having a Tesseract instance for every Wikisource, and the tesseract learns from every page the community proofreads... In few years, we could even think about tell our Tesseract to distinguish between XII century Italian vs XIX century... We could have amazing open source OCRs to give to the world.
Another greataccomplishment could be *giving back proofread OCR* to GLAMs: think about libraries (or Internet Archive!) give us ancient texts, and us giving them back a perfect djvu or PDF with mapped text inside... I'm sure we could have many GLAMs coming to us then :-) We cannot give them back almost anything, right now, a part from our HTML pages.
Aubrey
On Sun, Nov 23, 2014 at 6:16 PM, Lars Aronsson lars@aronsson.se wrote:
On 11/23/2014 02:55 AM, Wiki Billinghurst wrote:
What do we see as the next components for Wikisource?
What are our major hurdles for system development?
If we were offered development help where do people think that we should be making use of that help? Is it incremental fixes, transactional changes, or are we wanting transformational changes, completely new features, and new opportunities?
Ten years ago, Wikipedia was already a given success, and we started to branch out into projects like Wikisource, Wikinews and what not. That was also when Google Book Search started, and when the Internet Archive got its current practices for book scanning (with the "Scribe" scanning stations) in place. Ten years earlier, in the mid 90s, the first large-scale book scanning projects appeared. In the two decades 1990-2010, several books were published on the future of digital libraries. But what has happened in the last decade? What is new, really? Has anything changed in Google Book Search or the Internet Archive in the five years 2010-2014? Yes, more books have been digitized, but are they presented or used differently?
I think a lot more can be done, e.g. algorithmic improvement of OCR engines. Wikisource hasn't looked into that, neither has the Internet Archive, and nobody knows much about what Google does internally. This isn't necessarily "wiki", so it's not clear that it's a task for WMF and its projects. Another thing could be "gamification" of proofreading or mark-up / categorization / analysis of scanned books.
As for new kinds of content, the digitization of entire newspapers is still a new area, where the Australian national library was a pioneer some years ago, but what has happened since then? Potentially, it could become a cross-over between Wikisource and Wikinews, where each event can be found on the same day in many different newspapers. How to link them together? The problem: If we get scanned images + OCR text of 10 different newspapers, 10 years, 10 pages each day, that is 365 × 10 × 10 × 10 = 365,000 large pages to proofread, before we can do any serious analysis. How do we proofread so many pages in any reasonable time? We don't have enough volunteers for that.
-- Lars Aronsson (lars@aronsson.se) Aronsson Datateknik - http://aronsson.se
Wikisource-l mailing list Wikisource-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikisource-l
2014-11-24 19:51 GMT+01:00 Andrea Zanni zanni.andrea84@gmail.com:
Please keep up this good discussion :-) We have the Wikisource contest on it.source right now, so this mail is not going to be as long and detailed as I hoped.
I agree with Vigneron that the Survey report is a good start: having written it myself, I'm well aware that it's not perfect, and that questions were not bulletproof, as well the methodology. Nonetheless, we tried hard to make it and many results are as good and trustworthy.
I personally agree that a VE integration with the Proofread extension would be much needed: if you think about it, Wikisource is the right place for the VE. We could simplify enormously the life of new proofreaders, and formatting on Wikisource is ten times more difficult than in Wikipedia. I'm sure it's one of the best thing to do right now.
At the same time, I agree with Lars (who always has great insights) that we still need to do the big leap in digital libraries. For me, one of the thing Wikisource offers that nobody does is *hypertextuality*, and connections and integration with other projects as Wikidata (hopefully) and Wikipedia. I agree with him that algorithmic learning of Wikisource is an amazing idea: just think about having a Tesseract instance for every Wikisource, and the tesseract learns from every page the community proofreads... In few years, we could even think about tell our Tesseract to distinguish between XII century Italian vs XIX century... We could have amazing open source OCRs to give to the world.
Another greataccomplishment could be *giving back proofread OCR* to GLAMs: think about libraries (or Internet Archive!) give us ancient texts, and us giving them back a perfect djvu or PDF with mapped text inside... I'm sure we could have many GLAMs coming to us then :-) We cannot give them back almost anything, right now, a part from our HTML pages.
Aubrey
VE integration is important and could be very useful but I'm not sure if it's really that urgent for the wikisources. In short : is VE really a priority ? On a wikisource page there is far less formatting than in a wikipedia article (but ‘touché’ : the little formatting on Wikisource could be a pain in the a**). VE has still some glitch/malstructure (my favorite : did you ever try to put a ref with a template inside ?), should we wait before adapting it to Wikisource ? (or should we start right now knowing it's a long way…).
A tool like Gallica (website of the National Library of France) is testing seems more useful to me. You can test it here : https://ozalid.orange-labs.fr/ozviewer/
There's probably something to look further about a tool like http://tools.wmflabs.org/dicompte/ (compare the dump of Wikisource and Wiktionary and give the list of words in Wikisource without definition on Wiktionary) but in real time and integrated in the edit interface.
Cdlt, ~nicolas
On 24 November 2014 at 13:51, Andrea Zanni zanni.andrea84@gmail.com wrote:
Another greataccomplishment could be *giving back proofread OCR* to GLAMs: think about libraries (or Internet Archive!) give us ancient texts, and us giving them back a perfect djvu or PDF with mapped text inside... I'm sure we could have many GLAMs coming to us then :-) We cannot give them back almost anything, right now, a part from our HTML pages.
This is exactly the kind of suggestion I have been looking for. Many cultural institutions are developing their own crowdsourced transcription projects. I think Wikisource can be a much more robust platform than these one-off projects, with a more well-developed community that aggregates the transcription efforts of texts from many institutions in a single place with a proven process.
At NARA, along with our own transcription program, we are also developing a writable API for submitting transcriptions to it, because we recognize that third-party platforms like Wikisource might be the best place for the actual transcribing to take place. As long as we can ingest that data back into our own dataset, that is.
How would I do that now? Wikisource pages are not structured data (though Wikimedia Commons image metadata will soon be!), so there is not a clear way to use the Wikisource API to extract just the relevant transcribed text on the page as a field. And on top of that, any text you do extract this way will be full of templates and other code that has no meaning outside of the context of Wikisource. I don't see a way to easily extract just the plain text that is meaningful and relevant (along with other fielded data, like what page or text it belongs to).
How would I do that now? Wikisource pages are not structured data (though Wikimedia Commons image metadata will soon be!), so there is not a clear way to use the Wikisource API to extract just the relevant transcribed text on the page as a field. And on top of that, any text you do extract this way will be full of templates and other code that has no meaning outside of the context of Wikisource. I don't see a way to easily extract just the plain text that is meaningful and relevant (along with other fielded data, like what page or text it belongs to).
Wikisource as a "structured" repository is what we ask from the dawn of time :-) The problem, as usual, is that if things are left to volunteer developers thing will go slooooowly. I do think this is fundamental: an ideal Wikisource would ingest and understand many times metadata standards, and would give them back as well.
As for the Wikimedia API, I did this awful script: https://github.com/Aubreymcfato/ws_scraper Please come and make it better :-D
It just scrapes the data from the HTML (it is localized to it.source, but a quick glance at the HTML source of your own ws could help you, especially if you use microformats) and puts them on a csv. If you take the HTML you can also get the formatted text. (I also wonder of a Wikisource which understands Markdown, but that's too far :-)
Aubrey
On 25 November 2014 at 11:33, Andrea Zanni zanni.andrea84@gmail.com wrote:
How would I do that now? Wikisource pages are not structured data (though
Wikimedia Commons image metadata will soon be!), so there is not a clear way to use the Wikisource API to extract just the relevant transcribed text on the page as a field. And on top of that, any text you do extract this way will be full of templates and other code that has no meaning outside of the context of Wikisource. I don't see a way to easily extract just the plain text that is meaningful and relevant (along with other fielded data, like what page or text it belongs to).
Wikisource as a "structured" repository is what we ask from the dawn of time :-) The problem, as usual, is that if things are left to volunteer developers thing will go slooooowly. I do think this is fundamental: an ideal Wikisource would ingest and understand many times metadata standards, and would give them back as well.
As for the Wikimedia API, I did this awful script: https://github.com/Aubreymcfato/ws_scraper Please come and make it better :-D
Awesome! I'll definitely give it a whirl.
It just scrapes the data from the HTML (it is localized to it.source, but a quick glance at the HTML source of your own ws could help you, especially if you use microformats) and puts them on a csv. If you take the HTML you can also get the formatted text. (I also wonder of a Wikisource which understands Markdown, but that's too far :-)
You have a good point, though. One of the differences between Wikisource and most other platforms is that it is actually richly formatted. It's kind of a shame to strip all that formatting information out when extracting the transcriptions. (Though many destinations wouldn't know what to do with formatted text anyway.)
On Tue, Nov 25, 2014 at 6:34 PM, Dominic McDevitt-Parks mcdevitd@gmail.com wrote:
You have a good point, though. One of the differences between Wikisource and most other platforms is that it is actually richly formatted. It's kind of a shame to strip all that formatting information out when extracting the transcriptions. (Though many destinations wouldn't know what to do with formatted text anyway.)
I think this is a crucial issue. Many projects do give you the possibility to download a .txt, which is ok for digital preservation, but I challenge anyone to actually read a book in txt. :-) I believe that accessibility is having good ebooks accessible and readable on numerous devices. IMHO, what Tpt has done with his EPUB tool is remarkable: a nice, quick tool for generate fairly formatted ebooks, allow readers to actually read an ebook on a Kindle or a Kobo (or a tablet). It also work both with Index and ns0 books. The problem is that it's not perfectly formatted, of course, and it's not integrate within MediaWiki. When I put the link to the ebook converter directly in the Header template, stats skyrocketed : in few weeks we had thousands of downloads. (see it here: http://wsexport.wmflabs.org/tool/stat.php) http://wsexport.wmflabs.org/tool/stat.php
So, readibility is one big issue. We are here to be read.
Structured formats it's good for export, integration with different libraries, standardisation, and so on. It's fundamental, I think, for the development of the whole project. if we convince the WMF to put some permanent staff time, many things could be achieved :-)
Aubrey
PS: the script it's bad, I warn you, bit that's what I come up so far. I hope to improve it in the next weeks. If you can make it better, please do :-)
Lars is right, (too) little changed in several years; so let me say that my opinion has not changed since 2009 when I wrote "Make Wikisource scale" (which I dared to link from https://meta.wikimedia.org/wiki/Role_of_Wikisource#footer ). The one and only question worth asking is: can Wikisource, as a concept, proofread a million books and involve half a million volunteers? Because IMHO it must. When I think of this, I agree that OCR is the main issue. But it's not necessarily the one which worries me most, because tesseract is something living outside the wiki which can be improved even if the wiki has design issues. If we try really hard, we may face unsolvable integration problems in the OCR<->DjVU<->Wikisource food chain; but so far the issue is rather that we never tried seriously.[1] What worries me most is something else: all the effort we spend making perfectly loyal layouts with fragile templates, which are worth NOTHING outside our wiki; all the effort we spend organising books scattered across pages, to form a structure that not even MediaWiki knows about,[2] let alone an ePub exporter[3] or OAI-PMH handle[4] or third party user. I don't care if VisualEditor can make those templates easier to use, I care about things like making Proofread Page understand METS[5] or perhaps making sure what we're doing can end up in a DocBook[6]. We might discover that these things only require small adjustments, or that they don't matter that much. Or we might discover that one of the tools linked by Vigneron (which I didn't manage to try yet) requires a fundamental shift. Either way, we need to reason about it to be confident we're on the right track, and/or maybe pioneer some new way of working in one subdomain. However, in 5 years I've yet to find ONE person that says, yes Nemo, you're right, Wikisource should be 10 or 50 times as big as Wikipedia, let's plan for that. Probably I'm wrong. :)
Nemo
[1] https://www.mediawiki.org/wiki/CAPTCHA [2] Will it ever? https://meta.wikimedia.org/wiki/Book_management [3] Despite the recently-trashed work by PediaPress, and all Tpt's awesomeness with WSexport. [4] Though, https://www.mediawiki.org/wiki/Extension:Proofread_Page#OAI-PMH [5] https://lists.wikimedia.org/pipermail/wikisource-l/2014-September/002081.htm... [6] https://phabricator.wikimedia.org/T63047#679332
However, in 5 years I've yet to find ONE person that says, yes
Nemo, you're right, Wikisource should be 10 or 50 times as big as Wikipedia, let's plan for that. Probably I'm wrong. :)
+1. Count me in ! It will be hard and I'm afraid we might lose some good users in the process if we don't do things the right way but the wikisources should and need to be bigger (and easier to use/edit, and more compatible-compliant, and so on).
The very question is : how could and should we do it right ? My 2 cents : maybe we could do an other survey but more technical, writing down some proposals, drawing some mock-up and ask the wikisorcerers.
Cdlt, ~nicolas
Hi strongly agree with everything, Nemo. I also remember hearing Sj, in an official Board Q&A, say explicitly that he foresaw Wikisource as bigger than Wikipedia!
But Wikisource is out of the strategy, we know that. We ask, keep asking and will continue to ask, but we are still out of the development and strategic planning. I don't have a solution for that, that is different from I (and others, of course) have done in these years: build a community, build a consensus focus energies and minds on important issues, gain credibility and interest.
For me, just having a skilled Wikisorceror like Billinghurst onboard is a great step forward: in Wikimania we had several talks about Wikisource, and in the WS meeting there were almost 20 of us. Things are moving, but we really need to work together.
Aubrey
On Mon, Nov 24, 2014 at 11:41 PM, Nicolas VIGNERON < vigneron.nicolas@gmail.com> wrote:
However, in 5 years I've yet to find ONE person that says, yes
Nemo, you're right, Wikisource should be 10 or 50 times as big as Wikipedia, let's plan for that. Probably I'm wrong. :)
+1. Count me in ! It will be hard and I'm afraid we might lose some good users in the process if we don't do things the right way but the wikisources should and need to be bigger (and easier to use/edit, and more compatible-compliant, and so on).
The very question is : how could and should we do it right ? My 2 cents : maybe we could do an other survey but more technical, writing down some proposals, drawing some mock-up and ask the wikisorcerers.
Cdlt, ~nicolas
Wikisource-l mailing list Wikisource-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikisource-l
On 11/24/2014 11:13 PM, Federico Leva (Nemo) wrote:
When I think of this, I agree that OCR is the main issue. But it's not necessarily the one which worries me most, because tesseract is something living outside the wiki which can be improved even if the wiki has design issues. If we try really hard, we may face unsolvable integration problems in the OCR<->DjVU<->Wikisource food chain; but so far the issue is rather that we never tried seriously.[1]
The problem is that we are stuck in the notion that "it must be a wiki". The wiki is just one tool. Captchas could be another. The goal is to make the contents of books available in a more correct, more reliable and useful form. To scale things up, we should have the ambition to handle all books in the Internet Archive. (Books from other sources, such as Google, can be copied to the Internet Archive.)
Our use of OCR today is indeed "outside the wiki", it is a one-time operation to us. But it shouldn't be. When a book page is proofread, the OCR software should learn from this. Aha, it wasn't "arn", it was "am". And when the OCR software has improved, all other pages should be evaluated again. Maybe the arn/am error was found in more places? It sounds like an impossible job to process millions of pages again every day, but that's where an algorithm designer starts. Maybe we can index the patterns, so all possible arn/am patterns can be found in a second and quickly reprocessed. As you proofread one page, a hundred other pages in dozens of books are also improved. With this kind of application in mind, a wiki to proofread one page or a captcha to proofread one word are just two kinds of tools to collect the human contribution to the improvement of the OCR engine and to the library.
wikisource-l@lists.wikimedia.org