Hei,
I'm from the French chapter, Wikimédia France, and I want to annonce a good news for Wikisource. Some months ago Wikimédia France signed a partnership ("small" but significant in France) with the French National Library (BnF) who gives to Wikisource 1400 texts with their OCR [1].
Today these books arrived on Commons [2] and are ready to be checked by the Wikisource community. The list is available on [3], the range of books covers many century, different levels of confidence in the OCR and different subjects, but most of them are in French.
Now there is work for 800 years say somebody, 20 years for another, 'don't know, we'll see :-)
Wikimedially, Sébastien/Seb35 (I'm presently at Wikimania, Gdańsk, Poland)
[1] http://www.wikimedia.fr/wikim%C3%A9dia-france-and-french-national-library-en... [2] http://commons.wikimedia.org/wiki/Category:Books_provided_by_the_BNF [3] http://fr.wikisource.org/wiki/Wikisource:Dialogue_BnF/Liste_de_textes_fourni...
Congratulations !!!!
2010/7/10 Seb35 seb35wikipedia@gmail.com
Hei,
I'm from the French chapter, Wikimédia France, and I want to annonce a good news for Wikisource. Some months ago Wikimédia France signed a partnership ("small" but significant in France) with the French National Library (BnF) who gives to Wikisource 1400 texts with their OCR [1].
Today these books arrived on Commons [2] and are ready to be checked by the Wikisource community. The list is available on [3], the range of books covers many century, different levels of confidence in the OCR and different subjects, but most of them are in French.
Now there is work for 800 years say somebody, 20 years for another, 'don't know, we'll see :-)
Wikimedially, Sébastien/Seb35 (I'm presently at Wikimania, Gdańsk, Poland)
[1]
http://www.wikimedia.fr/wikim%C3%A9dia-france-and-french-national-library-en...http://www.wikimedia.fr/wikim%C3%A9dia-france-and-french-national-library-enter-partnership-wikisource [2] http://commons.wikimedia.org/wiki/Category:Books_provided_by_the_BNF [3]
http://fr.wikisource.org/wiki/Wikisource:Dialogue_BnF/Liste_de_textes_fourni...
Wikisource-l mailing list Wikisource-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikisource-l
Congratulations, this is an amazing event for all the Wikisource community. Hopefully we'll use it to show other libraries that such things are possible, and we should all take this as a model.
it would be really important to document everything is a common and visible location, where we can read and learn from this experience. Where would this be?
Aubrey (In Gdansk too)
2010/7/10 Michael Jörgens joergens.mic@googlemail.com
Congratulations !!!!
2010/7/10 Seb35 seb35wikipedia@gmail.com
Hei,
I'm from the French chapter, Wikimédia France, and I want to annonce a good news for Wikisource. Some months ago Wikimédia France signed a partnership ("small" but significant in France) with the French National Library (BnF) who gives to Wikisource 1400 texts with their OCR [1].
Today these books arrived on Commons [2] and are ready to be checked by the Wikisource community. The list is available on [3], the range of books covers many century, different levels of confidence in the OCR and different subjects, but most of them are in French.
Now there is work for 800 years say somebody, 20 years for another, 'don't know, we'll see :-)
Wikimedially, Sébastien/Seb35 (I'm presently at Wikimania, Gdańsk, Poland)
[1]
http://www.wikimedia.fr/wikim%C3%A9dia-france-and-french-national-library-en...http://www.wikimedia.fr/wikim%C3%A9dia-france-and-french-national-library-enter-partnership-wikisource [2] http://commons.wikimedia.org/wiki/Category:Books_provided_by_the_BNF [3]
http://fr.wikisource.org/wiki/Wikisource:Dialogue_BnF/Liste_de_textes_fourni...
Wikisource-l mailing list Wikisource-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikisource-l
Wikisource-l mailing list Wikisource-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikisource-l
Ouch, I haven't seen this when I was at Gdansk.
Seb35, 10/07/2010 21:42:
I'm from the French chapter, Wikimédia France, and I want to annonce a good news for Wikisource. Some months ago Wikimédia France signed a partnership ("small" but significant in France) with the French National Library (BnF) who gives to Wikisource 1400 texts with their OCR [1].
Great!
Today these books arrived on Commons [2] and are ready to be checked by the Wikisource community. The list is available on [3], the range of books covers many century, different levels of confidence in the OCR and different subjects, but most of them are in French.
How did you select them? What will the BNF do with the proofread texts? Can you share the agreement text (even if not translated and/or in some private wiki)? Thank you, Nemo
Tue, 13 Jul 2010 17:17:12 +0200, Federico Leva (Nemo) nemowiki@gmail.com wrote:
How did you select them?
I don't know exactly how they have been selected, it seems the BnF wanted to give us a panel of books with different characteristics, I don't know the Wikisource community has given any wishlist.
What will the BNF do with the proofread texts?
In the agreement nothing is expressed about that. This agreement is more an experiment to see what sort of partnership it would be possible to do (or not) in the future. The BnF gived us about 10% of its annual digitalization (I think but not sure) ; and they have nobody/no money to review the OCR they use.
Can you share the agreement text (even if not translated and/or in some private wiki)?
The agreement text is on our private wiki (for WMFR members) in French, I ask about a public version or about an internal-wiki version.
~ Seb35 [^_^]
Thanks for you answer, Seb35. As you imagine, everyone is eager to know as many details possible about this projects, to see if and how it can scale and be applied elsewhere.
In the Italian chapter right now there is a discussion about a possible money investment of a problem we all know: the transcription and proofreading of texts. In Wikimania many of us talked a lot also about Wikisource (unfortunately, AFAIK only to Wikisource admins were present, me and Lars from se.source) and about the practical problem that all these cultural heritage projects (the BnF projects with fr.source or the Enciclopedia Cilena project with es.source) have to cope with the minimal support of Wikisource communities. Not because there's no will to help, but just because the communities are far more smaller than Wikipedia one, which can support and help much better (see the "Wikipedian in Residence" project with the British Museum).
I would really like to ask to all what they think about this (big) issue. Do you have any suggestion? What do you think we can do about it? Do you think that national chapter could help financially paying some people to transcribe and proofread?
Aubrey
2010/7/15 Seb35 seb35wikipedia@gmail.com
Tue, 13 Jul 2010 17:17:12 +0200, Federico Leva (Nemo) nemowiki@gmail.com wrote:
How did you select them?
I don't know exactly how they have been selected, it seems the BnF wanted to give us a panel of books with different characteristics, I don't know the Wikisource community has given any wishlist.
What will the BNF do with the proofread texts?
In the agreement nothing is expressed about that. This agreement is more an experiment to see what sort of partnership it would be possible to do (or not) in the future. The BnF gived us about 10% of its annual digitalization (I think but not sure) ; and they have nobody/no money to review the OCR they use.
Can you share the agreement text (even if not translated and/or in some private wiki)?
The agreement text is on our private wiki (for WMFR members) in French, I ask about a public version or about an internal-wiki version.
~ Seb35 [^_^]
Wikisource-l mailing list Wikisource-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikisource-l
Thank you, Seb35!
Andrea Zanni, 15/07/2010 18:47:
I would really like to ask to all what they think about this (big) issue. Do you have any suggestion? What do you think we can do about it? Do you think that national chapter could help financially paying some people to transcribe and proofread?
May I mention http://strategy.wikimedia.org/wiki/Proposal:Make_Wikisource_scale ? There are some references, numbers and so on.
Nemo
2010/7/15 Federico Leva (Nemo) nemowiki@gmail.com
May I mention http://strategy.wikimedia.org/wiki/Proposal:Make_Wikisource_scale ? There are some references, numbers and so on.
Nemo
Mi accodo per informarvi che anche it.source è, in qualche modo, entrata in questo progetto: infatti Vigneron ci ha segnalato un'opera italiana, fra quelle appartenenti al progetto, e subitissimo ho creato una pagina http://it.wikisource.org/wiki/Indice:Ardigo_-_Scritti_vari.djvu che sfrutta il file Commons importato.
Questo ci dà la possibilità di giudicare anche la qualità della risoluzione e la conseguente precisione dell'interpretazione OCR automatica.... che non è affatto buona: a riprova che la dgitalizzazione delle immagini + un'interpretazione OCR standard è ben lungi dal concludere "l'iter della digitalizazzione" (a spanne: è circa il 5-10% del lavoro... )
On 07/15/2010 06:47 PM, Andrea Zanni wrote:
I would really like to ask to all what they think about this (big) issue.
Wikisource is a project that could have a great importance in the future, if it succeeds to grow. Scanned books should be useful references in Wikipedia. But it's far too early to say if it will be successful, because it is still too small.
In proofreading from scanned images, it's still very small and growing fast, because it really only started very recently. Only 3 languages have proofread more than 50,000 pages and 8 more languages have between 1000 and 10,000 pages. http://wikisource.org/wiki/Wikisource:ProofreadPage_Statistics
While 50,000 pages is big for Wikipedia, it only means 250 books of 200 pages each. Such a book is 1 cm thick (0.4 inches), so the entire bookshelf of 50,000 pages fits in 2.50 metres of shelving, less than a normal bookcase. Swedish Wikisource now has 5,500 proofread pages, but these belong to 237 different volumes, many being small pamphlets or individual issues of a newspaper. Only 15 volumes are fully proofread books with more than 100 pages. Many of the volumes are books that someone uploaded and started to proofread, but didn't quite finish. Many were uploaded because they were available (scanned by a library), not because they would be really useful as source documents. Of our 15 fully proofread books, I can name 5 that should be really useful as sources.
The Wikisource communities are very small, counting 12 contributors (making at least 5 edits per month) in Italian, 25 in Polish, 60 in German, 100 in French and 100 in English. For most languages, every single new contributor can make a huge difference. This might attract some users who have plenty of time and want to make a big difference, such as those who write hundreds of Wikipedia articles. But as time goes by and the projects scale up, you will find very few such contributors.
Even if we don't pay for their work, volunteers are a finite resource and we shouldn't waste their time. It's easy to waste time with manually proofreading text that has very poor OCR quality. Running a new OCR might save hours of work. It's even easier to waste time by proofreading a book that nobody finds useful.
That's where I think we should start: Which books are really useful? How do we determine that? Unfortunately, link search in Wikipedia does not count links to Wikisource, so it's hard to get good statistics.
The first two books that I put on Wikisource in the fall of 2005 were small encyclopedias in German and English. This was an experiment and it worked well as an experiment. But these small encyclopedias were next to useless, because they contained so much less information than was already in Wikipedia. These 1000 bytes on Elbe, http://en.wikisource.org/wiki/The_New_Student%27s_Reference_Work/Elbe were useless because Wikipedia in October 2005 already had 5 times as much (today, that is 20 times as much).
For Arabic and other languages where Wikipedia is still rather small, finding and scanning such a small 5 volume encyclopedia could be a great help, but for the languages where Wikipedia is bigger (and this is true for all languages where Wikisource is now active, except maybe Armenian), we need to look for more specialized reference works to be really useful as sources.
I'm making an experiment now with proofreading a newspaper. Not just articles, but entire issues. I have done three full weeks from January 1836. Fortunately, each daily issue is only 4 pages in 3 columns, or 75 kilobytes in total, e.g. http://sv.wikisource.org/wiki/Post-_och_Inrikes_Tidningar_1836-01-05
But it still takes a lot of work. On my own, I'm not able to proofread one issue per day, so I'm already lagging behind. Newspapers are useful as sources. Wikipedia often references articles in current newspapers, and it would be great to have complete year runs going back in time. Still, I doubt that we can find enough proofreading volunteers to cover any substantial timespan.
Le 21 juil. 10 à 06:01, Lars Aronsson a écrit :
Wikisource is a project that could have a great importance in the future, if it succeeds to grow. Scanned books should be useful references in Wikipedia. (...)
Wikisource may aim towards being a repository of sources for Wikipedia. But IMHO it is not its only goal. In French Wikisource, we proofread some novels and short stories... and we would like them to be available in e-pub and pdf: now many people read on e-readers and we must be here to give them free texts to read. Don't you think so ?
Best regards,
2010/7/21 Hélène Pedrosa-Masson edhral@free.fr
Wikisource may aim towards being a repository of sources for Wikipedia. But IMHO it is not its only goal. In French Wikisource, we proofread some novels and short stories... and we would like them to be available in e-pub and pdf: now many people read on e-readers and we must be here to give them free texts to read. Don't you think so ?
Best regards,
Edhral
I too think that wikisource should be much more a "container for arts and history" than a "container for update knowledge", since most works are ancient (a century or more); this is why - IMHO - my small field of interest, which brought me into source world: equitation ("the *art* and practice of riding"), that's much more an art than a science, is a rather interesting topic, just as poems and literature, old chronicles and old newspapers. This means too that users will be very few, just as are very few the users of old or ancient books. Therefore source can be a useful "source" for some particular topics only, from wikipedia point of view.
No matter: the whole server space devoided to wikisource is so small... the whole unzipped dump of it.source "good content" is more or less 250 Mby, t.i. less than three full-size (100 Mby) Commons files. ;-)
Alex
Wikisource may aim towards being a repository of sources for Wikipedia. But IMHO it is not its only goal. In French Wikisource, we proofread some novels and short stories... and we would like them to be available in e-pub and pdf: now many people read on e-readers and we must be here to give them free texts to read. Don't you think so ?
I think that Wikisource should be a repository for public domain knowledge and literature. As Wikipedia is a sum of many specialized encyclopedias, probably Wikisource is (should be) the sum of specialized libraries... We work with public domain texts, and that includes documents and grey literature, maybe from public sector and legal literature.
Anyway, it would be very important to have an automatic tool for exporting in ePub or pdf, but IMHO the community is really small for that. Probably, developers of Wikipedia, PediaPress and Source (ThomasV?) would be able to work together to create a tool that needs to be as automatic as possible. It is an important goal, the my only problem is that I can't really help with that, I'm not a developer and I don't know how to contribute to such a technical issue.
Of course, in the Italian community we're taliking a lot about this (also within Wikimedia Italy), but I do think that we lack people and competences to do something real and significative.
If there's someone out there who's working on this, we are ready to test ;-)
Aubrey
Hélène Pedrosa-Masson, 21/07/2010 08:13:
Le 21 juil. 10 à 06:01, Lars Aronsson a écrit :
Wikisource is a project that could have a great importance in the future, if it succeeds to grow. Scanned books should be useful references in Wikipedia. (...)
Wikisource may aim towards being a repository of sources for Wikipedia. But IMHO it is not its only goal.
I agree. Well, I think it isn't its goal at all.
Lars Aronsson, 21/07/2010 06:01:
For Arabic and other languages where Wikipedia is still rather small, finding and scanning such a small 5 volume encyclopedia could be a great help, but for the languages where Wikipedia is bigger (and this is true for all languages where Wikisource is now active, except maybe Armenian), we need to look for more specialized reference works to be really useful as sources.
Dictionaries! Wiktionary is the place where you still can give small "easy" contributions (as well as Wikiquote, but not bigger Wikipedias) and in fact it has the higher contributors/editor ratio in Wikimedia projects, as Erik Moeller showed at Wikimania (but the same applies to Wikiquote). The problem is that building a dictionary from scratch is a nightmare. If you look Wiktionaries statistics, you see that sever Wiktionaries started to grow significantly when they imported automatically some PD dictionary and reached a critical mass (the tipping point). It could even be sensible to actually buy the rights of old (but not PD) out of print dictionaries (even if not available in digital form and to be proofread on Wikisource) to improve smaller Wiktionaries, where PD dictionaries are not available. Such countries may have a greater need of dictionaries (I mean, Italian has dictionaries since 1583, lots of them are PD and you can buy a copy of a five years old dictionary for 20 €, but it's not the same in avery country, I suppose).
Nemo
On 07/21/2010 01:56 PM, Federico Leva (Nemo) wrote:
Dictionaries! Wiktionary is the place where you still can give small "easy" contributions (as well as Wikiquote, but not bigger Wikipedias)
Do you have any examples of dictionaries that have been proofread in Wikisource and were found to be useful for Wiktionary?
What were the parameters? Was the OCR of good quality? Was the proofreading done by one person, or a small or large team? Did proofreading take more or less time (per page) than for other books? Did proofreaders have Wiktionary in mind?
Lars Aronsson, 21/07/2010 17:32:
On 07/21/2010 01:56 PM, Federico Leva (Nemo) wrote:
Dictionaries! Wiktionary is the place where you still can give small "easy" contributions (as well as Wikiquote, but not bigger Wikipedias)
Do you have any examples of dictionaries that have been proofread in Wikisource and were found to be useful for Wiktionary?
No, I don't: this is a proposal. If I remember correctly, the Merriam-Webster dictionary imported by en.wikt was already available somewhere, fr.wikt imported something but I don't know what (I'm not even sure they did), ru.wikt imported some dictionary which I don't remember because it's quite difficult to find info on this. On it.source someone started http://it.wikisource.org/wiki/Indice:Vocabolario_degli_accademici_della_Crus... obviously having Wikisource and not Wiktionary in minad (the 1623 Vocabolario della Crusca is the most important dictionary in Italian language history, but not so helpful for Wiktionary now). On it.wikt we found some PD dictionaries http://it.wiktionary.org/wiki/Wikizionario:Importazione_dizionari_PD from IA (by Google Books), and OCR is quite messy: http://it.wiktionary.org/wiki/Wikizionario:Importazione_dizionari_PD/Rigutin... http://it.wiktionary.org/wiki/Wikizionario:Importazione_dizionari_PD/Zambald...
Nemo
2010/7/21 Federico Leva (Nemo) nemowiki@gmail.com
Lars Aronsson, 21/07/2010 17:32:
On 07/21/2010 01:56 PM, Federico Leva (Nemo) wrote:
Dictionaries! Wiktionary is the place where you still can give small "easy" contributions (as well as Wikiquote, but not bigger Wikipedias)
I tested a simple script to parse text of a proofread page and to output the list of different words into the page. My aim was mainly to find my mistakes when working about ancient, or unknown, languages: see http://pt.wikisource.org/wiki/P%C3%A1gina:Bem_cavalgar.djvu/40 (a 15° century portuguese book) and its talk page, where there's an example of such bot-created list of words. This rough script could be expanded, to produce the whole list of different works into a book, anyone linked with the proofread page where the word has been used. Could this trick be useful as a link between wikisource work and wiktionaries?
Alex Brollo, 22/07/2010 07:21:
2010/7/21 Federico Leva (Nemo) <nemowiki@gmail.com mailto:nemowiki@gmail.com>
Lars Aronsson, 21/07/2010 17:32: > On 07/21/2010 01:56 PM, Federico Leva (Nemo) wrote: >> Dictionaries! Wiktionary is the place where you still can give small >> "easy" contributions (as well as Wikiquote, but not bigger Wikipedias)
I tested a simple script to parse text of a proofread page and to output the list of different words into the page. My aim was mainly to find my mistakes when working about ancient, or unknown, languages: see http://pt.wikisource.org/wiki/P%C3%A1gina:Bem_cavalgar.djvu/40 (a 15° century portuguese book) and its talk page, where there's an example of such bot-created list of words. This rough script could be expanded, to produce the whole list of different works into a book, anyone linked with the proofread page where the word has been used. Could this trick be useful as a link between wikisource work and wiktionaries?
I think it isn't. "Linguistico" folks have done this (http://linguistico.sourceforge.net/pages/start.html , their dictionary is used by OpenOffice, Firefox etc.) and anyway a list of words is not so useful for Wiktionary, what is needed are definitions.
Nemo
2010/7/22 Federico Leva (Nemo) nemowiki@gmail.com
I think it isn't. "Linguistico" folks have done this (http://linguistico.sourceforge.net/pages/start.html , their dictionary is used by OpenOffice, Firefox etc.) and anyway a list of words is not so useful for Wiktionary, what is needed are definitions.
Nemo
Well, I understand that a "rough" list of words isn't so useful; but a list of words, a*nyone linked with a full bibliographic source pointing to the image and text of the original page*, perhaps could be of some interest. No matter if it isn't.... simply, I'll not think about any more.
Alex
Alex
On Thu, Jul 22, 2010 at 6:25 PM, Alex Brollo alex.brollo@gmail.com wrote:
2010/7/22 Federico Leva (Nemo) nemowiki@gmail.com
I think it isn't. "Linguistico" folks have done this (http://linguistico.sourceforge.net/pages/start.html , their dictionary is used by OpenOffice, Firefox etc.) and anyway a list of words is not so useful for Wiktionary, what is needed are definitions.
Nemo
Well, I understand that a "rough" list of words isn't so useful; but a list of words, anyone linked with a full bibliographic source pointing to the image and text of the original page, perhaps could be of some interest. No matter if it isn't.... simply, I'll not think about any more.
I think Wiktionary does want to include examples of the words 'in use', and Wikisource can provide this.
Linking to Wikisource is encouraged on English Wiktionary. e.g.
http://en.wiktionary.org/wiki/demirep
If you create a list of words used in a book, it would be beneficial to also record how many times each word is used.
-- John Vandenberg
2010/7/26 John Vandenberg jayvdb@gmail.com
I think Wiktionary does want to include examples of the words 'in use', and Wikisource can provide this.
Linking to Wikisource is encouraged on English Wiktionary. e.g.
http://en.wiktionary.org/wiki/demirep
If you create a list of words used in a book, it would be beneficial to also record how many times each word is used.
Thanks John, yes, it's pretty simple to do such type of statistics. The trick is really simple, and - in my opinion - anyone could implement it with a python script much better than my one. It consists simply of a routine that converts a string into a python list where "words characters " and "other text characters" are separated, giving simply the "word character" list as a parameter (or, what's the same, the list of "not word characters" . I.e, "This could be a piece of raw wikitext splitted by [[python]] routine" is converted into list ["This"," ","could be"," ","a;" ","piece"," ","of"," ","raw"," ","wikitext"," ","splitted"," ", "by"," [[","python","]] ", "routine"] where "words and "not-words" regularly alternate and a simple "".join() method of the list gives back *exactly* the source string.
Simply selecting "words" from such a list, you can do anything you like with them.
Alex
On 07/26/2010 04:52 AM, John Vandenberg wrote:
I think Wiktionary does want to include examples of the words 'in use', and Wikisource can provide this.
Wiktionary can need many things, coverage of common words as well as examples of how to use uncommon words.
From the Swedish Wikisource, I extracted the body text and made a word frequency list, which I put on the Swedish Wiktionary with each word in brackets, so I could see which ones were red links. This doesn't indicate whether Wiktionary covers all different meanings of each word, but at least it makes sure Wiktionary has something about each of the most common words. From the top word (2.7 million occurrences) the first red link is now for a word with 6,300 occurrences or 400 times less common than the top. For a good dictionary, we need to extend this to maybe 40,000 (words that occur only 67 times in Wikisource), so we have a long way to go, but at least we know which words to start with.
http://sv.wiktionary.org/wiki/Anv%C3%A4ndare:LA2/Ordfrekvens_Wikipedia_20100...
2010/7/28 Lars Aronsson lars@aronsson.se
Wiktionary can need many things, coverage of common words as well as examples of how to use uncommon words.
From the Swedish Wikisource, I extracted the body text and made a word frequency list,
This is very interesting. Can you tell us more details about? has been the job documented (in English, Swedish is "a little difficult" for me...) somewhere? I can produce lists by my rought script, but it works on raw wiki code and the result is "dirty" - it contains markup words, and obviously all wrong words too (seaching for wrong words was my fisrt aim...). Did you work on html dump perhaps?
On 07/29/2010 07:02 AM, Alex Brollo wrote:
2010/7/28 Lars Aronsson <lars@aronsson.se mailto:lars@aronsson.se>
Wiktionary can need many things, coverage of common words as well as examples of how to use uncommon words. From the Swedish Wikisource, I extracted the body text and made a word frequency list,
This is very interesting. Can you tell us more details about? has been the job documented (in English, Swedish is "a little difficult" for me...) somewhere? I can produce lists by my rought script, but it works on raw wiki code and the result is "dirty" - it contains markup words, and obviously all wrong words too (seaching for wrong words was my fisrt aim...). Did you work on html dump perhaps?
My code for extracting the body text from the XML dumps has not been published. But Erik Zachte has published his code for extracting "readable text", and maybe you can use that. See http://stats.wikimedia.org/scripts.zip It's only a lot of regular expressions and substitutions.
After the body text has been extracted, you can either fold case (so Madrid becomes madrid) or not, you can either remove interpunctiation (so e.g. becomes e g) or not, depending on how you want to treat proper names and abbreviations. I use simple "sed" expressions for this. If you don't fold case and don't remove interpunctuation, you will get a lot of false entries where sentences meet, e.g. both "this." and "this", both "after" and "After".
2010/7/29 Lars Aronsson lars@aronsson.se
My code for extracting the body text from the XML dumps has not been published. But Erik Zachte has published his code for extracting "readable text", and maybe you can use that. See http://stats.wikimedia.org/scripts.zip It's only a lot of regular expressions and substitutions.
Thanks Lars for details! From xml dump: this is what I 'd like to know (the same I do). HTML is too interesting as a source, since "absolutely not well formed wiki syntax" is replaced by a "well formed html syntax", but so far I didn't explore it.
Thanks too for your link.
Alex
wikisource-l@lists.wikimedia.org