Hi everyone,
The next strategic planning office hours are:
Wednesday, 04:00-05:00 UTC, which is:
-Tuesday (8-9pm PST)
-Tuesday (11pm-12am EST)
There has been a lot of tremendous work on the strategy wiki the past
few months, and Task Forces are finishing up their work.
Office hours will be a great opportunity to discuss the work that's
happened as well as the work to come.
As always, you can access the chat by going to
https://webchat.freenode.net and filling in a username and the channel
name (#wikimedia-strategy). You may be prompted to click through a
security warning. It's fine. More details at:
http://strategy.wikimedia.org/wiki/IRC_office_hours
Thanks! Hope to see many of you there.
____________________
Philippe Beaudette
Facilitator, Strategy Project
Wikimedia Foundation
philippe(a)wikimedia.org
mobile: 918 200-WIKI (9454)
Imagine a world in which every human being can freely share in
the sum of all knowledge. Help us make it a reality!
http://wikimediafoundation.org/wiki/Donate
Digitization projects sometimes mention their OCR quality,
depending on the print quality, image quality, and which
OCR software they used, as a percentage in the range
80-100%. I guess that is the percentage of characters
correctly interpreted. When you outsource digitization,
the OCR quality can be a parameter for the delivery.
Anyway, I see so many OCR errors, that I doubt that these
estimates are accurate. Are there any known cases where
statements about OCR quality have been questioned?
One problem with estimating the OCR quality is that you
compare what you have (the actual OCR output) against
something you don't have (the perfectly proofread page).
You can make samples, but in Wikisource we have more than
just samples. We have complete works that have been
fully proofread. And a version history that shows what
we started out with. Yes, I think it is important to
save an initial version of the raw OCR text before you
start to do any proofreading.
Do we have any software that can compare two versions
of a page and tell what percentage of characters were
the same in both versions, i.e. the OCR quality?
--
Lars Aronsson (lars(a)aronsson.se)
Aronsson Datateknik - http://aronsson.se
Dear all,
I have a question concerning the search function on wikisource. For several
purposes it is very useful search after a term neither within the whole
wikisource, nor within just the page of a book that is currently displayed, but
rather in the whole book (and only the book). There is, of course, the option
of exporting the whole book into a PDF and search that, but if the book is
split up in, say, 500 wiki-pages, this becomes an extremely time-consuming
task. I am wondering whether there is any solution to this problem, either by
automatizing the 'create book' function such that it automatically gathers all
pages that belong to a particular book, or by restricting the search directly
to one particular book.
Thanks in advance and kind regards,
Stefan Roski
As I'm uploading and proofreading texts, I'm surprised how slow Google
is to pick up the new content. As far as I can see, there's nothing
that blocks search engines from indexing Page: or Index: pages, so
it should not be entirely necessary to transclude content into pages
in the main namespace, right?
For example, I'm googling the exact phrase "exportable effekter"
(antiquated Swedish) which since April 3 (25 days) has been here,
http://sv.wikisource.org/wiki/Sida:Post-_och_Inrikes_Tidningar_1835-12-31_3…
In my experience, Google is very quick to pick up new content in
Wikipedia. I assume it tracks the recent changes page.
Is it a problem that the URL ends in ".jpg"? Would search engines
avoid or delay to index it, assuming it was an image?
Part of my problem is that I'm proofreading entire newspapers, and
so far I have only transcluded a few articles. Maybe I should make a
main namespace page for each day's full issue. Would that help me?
When you proofread The New York Times (English) or Die Gartenlaube
(German Wikisource), do you proofread entire pages and issues, and
do you transclude everything that you proofread?
From the Index: page, the <pagelist/> generates normal HTML links
to each page, which is fine. But from the individual pages, the
links to the previous (<), next (>) and index (^) pages are created
only by Javascript. Is this a problem for search engines? Is it
a problem for blind readers? Is there any good reason not to
generate standard HTML links for these navigation tabs?
--
Lars Aronsson (lars(a)aronsson.se)
Aronsson Datateknik - http://aronsson.se
Wikimania is an annual global event devoted to Wikimedia projects
around the globe (including Wikipedia, Wikibooks, Wikinews,
Wiktionary, Wikispecies, Wikimedia Commons, and MediaWiki). The
conference is a community gathering, giving the editors, users
and developers of Wikimedia projects an opportunity to meet each
other, exchange ideas, report on research and projects, and
collaborate on the future of the projects. The conference is open
to the public, and is a chance for educators, researchers,
programmers and free culture activists who are interested in the
Wikimedia projects to learn more and share ideas about the
Wikimedia projects.
This year's conference will be held JULY 9-11, 2010 in Gdansk,
Poland at Polish Baltic Philharmonic. For more information, please
visit the official Wikimania 2010 site:
http://wikimania2010.wikimedia.org/
Wikimania 2010 will be a mix of submitted talks, open space
meetings, birds of a feather groups, and lightning talks.
Submissions will be discussed and selected in an informal process
on the wiki. If your submission is not added to the schedule, you
will still have many opportunities to bring topics forward
on-site.
IMPORTANT DATES
* Deadline for submitting workshop, tutorial, panel and
presentation proposals: May 20
* Notification of acceptance: May 25 (workshops), May 31
(panels, tutorials, presentations)
* All proposals and presentations will be welcome in the
Open Space track of the conference, whether or not they
are accepted in this initial process.
PROGRAM COMMITTEE
Submissions will be reviewed informally by a team of volunteers.
TRACKS
This year Wikimania will offer three tracks for submissions for
members of wiki communities and interested observers to share
their own experiences and thoughts and to present new ideas:
People and Community
The People and Community track provides a unique forum for
discussing topics related to people using/building wikis.
Relevant topics include, but are not restricted to, the
following:
* Wiki Community: Conflict resolution and community dynamics;
reputation and identity;
* Wiki Outreach: Promotion of wikis and Wikimedia projects among
the general public;
* North meets south, east meets west: How can people of a
different cultural background create an encyclopedia according
to common rules? Same subject in the eye of different cultures.
* Special: Wikipedia in Central/Eastern Europe: this theme will
provide a forum to present and discuss the latest progress of
Wikis in the central/eastern European community.
Knowledge and Collaboration
The Knowledge and Collaboration track aims to promote research
and find exciting ideas related to knowledge...
* Wiki Content: New ways to improve content quality, credibility;
legal issues and copyrights (is free knowledge free?); use of
the content in education, journalism, research;
* Semantic Wikis: The use of semantic web technologies, linked
data; semantic annotation and metadata (in particular manual
vs. automated approaches).
Infrastructure Track
The Infrastructure track at Wikimania will provide a forum where
both researchers and practitioners can share new approaches,
applications, and explore how to make Wiki access ever more
ubiquitous:
* MediaWiki development: issues related to MediaWiki development
and extensions;
* Moving beyond MediaWiki: what other Wiki-like platforms exist;
what tools and features do we need for collaboration on
different types of knowledge?
* Mobile Wikis: The Web is moving off the desktop and into mobile
phones, how we use wikis on mobile devices?; wiki-based
Augmented Reality (AR) applications, location based services
* User Interface Design: Usability and user experience;
accessibility, adaptive interfaces and personalization; novel
UI designs.
WIKISYM 2010
Please note that Wikimania 2010 is co-located with WikiSym, The
International Symposium on Wikis and Open Collaboration. More
information about WikiSym can be found on the conference website:
http://www.wikisym.org/
SUBMIT A PROPOSAL
To submit a proposal for a presentation, workshop, panel or
tutorial, please visit:
http://bit.ly/Submit2010
Thank you for helping make Wikimania 2010 a successful event. :-)
See you in Gdansk, July 9-11!
--
Marcin Cieslak
Wikimania 2010 Gdansk
Hello all,
I have created new graphs, that show the proofreading activity at
Wikisource : number of pages proofread/validated per day.
http://toolserver.org/~thomasv/stats.html
if we combine all subdomains, we can see that Wikisource proofreads more
than 500 pages per day.
Thomas
Lars Aronsson a écrit :
> Hi Thomas,
>
>> The new Ajax-OCR service does not use a robot to create pages; it sends
>> the OCR text directly in the edit box, so that it can be proofread by
>> the user. It has its own job queue, which can be seen here :
>> http://toolserver.org/~thomasv/ocr.php
>
> Oh, great, I didn't know this existed. For which languages
> does it work? Does it scale to other languages? Does your
> OCR engine get any feedback from proofreading? Is there
> any documentation of how this works?
>
it is installed at 11 subdomains.
see OCR.js at http://wikisource.org/wiki/Wikisource:Shared_Scripts
>> If you want to upload or improve
>> OCR, you should update the OCR layer of the DjVu file. Thus, you do not
>> need to create dozens of pages with a robot.
>
> Yes, in my dreams. And ultimately the new Djvu would
> be fed back to the place (Internet Archive, Google, ...)
> where it came from. But how would this work?
>
> If (in my science fiction dreams) Commons had an API
> that would accept new OCR for a page of a Djvu file,
> your Ajax routine as well as the standard proofreading
> form could do this right away. One major problem is that
> our proofreading (and some OCR software) loses the
> image coordinates of the words in the text.
this is not a dream, it is a very common operation.
I believe there is a help page at en.ws, that describes how
to update the text layer of a djvu file. once you've done this,
you just need to upload the modified djvu as a new version
of the file. The fact that image coordinates are lost in the
process is not a problem for wikisource.
First: Choose an appropriate subject line for your contribution!
"Wikisource-l Digest" isn't helpful.
Second: Don't give a fullquote of the message (or digest) you are referring to.
Thank you
Dr Klaus Graf
A quick comment on Lars's last post:
I spoke at the Museums and the Web conference last Tuesday in Denver,
and someone from the Library of Congress came. They aren't a Museum,
but they are working on a small-scale wikipedia project to help
illustrate articles in WP with deep citations into some of their more
interesting public domain holdings (which are a bit museum-like).
She mentioned that she was most interested in pursuing a further
wikisource project, as they have many unique works (journals,
manuscripts and the like) which are only available in their original
language -- say, a journal of a French botanist -- and which deserve
to be translated into other languages.
They would like to help publish digital scans and existing
native-language text (cleaned up OCR) to wikisource, in the hopes that
translations can be made, into English and other languages.
They are interested in identifying the 'most interesting works' in
their untranslated collections, and happy to have some discussion
about what make works interesting. One of the things I like about
this idea is that as curators of one of the world's largest
international libraries they have a broad sense of 'notability' and
interest in a wikisource sense which we currently lack on the
Project... so this could help drive style guide improvements as well.
Thoughts? If anyone is specifically interested in this collaboration,
let me know and I'll put you in touch with the organizer. Of course
once this is more than a pipe dream there will be a public project
page... but early interest now could help frame the initial proposal.
SJ
On Wed, Apr 21, 2010 at 8:56 AM,
<wikisource-l-request(a)lists.wikimedia.org> wrote:
> Send Wikisource-l mailing list submissions to
> wikisource-l(a)lists.wikimedia.org
>
> To subscribe or unsubscribe via the World Wide Web, visit
> https://lists.wikimedia.org/mailman/listinfo/wikisource-l
> or, via email, send a message with subject or body 'help' to
> wikisource-l-request(a)lists.wikimedia.org
>
> You can reach the person managing the list at
> wikisource-l-owner(a)lists.wikimedia.org
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of Wikisource-l digest..."
>
>
> Today's Topics:
>
> 1. Re: PDF/Djvu to Index (ThomasV)
> 2. Re: PDF/Djvu to Index (Cecil)
> 3. Strategic Planning Office Hours (Philippe Beaudette)
> 4. Experience, funding, outreach (Lars Aronsson)
> 5. Re: Experience, funding, outreach (Sydney Poore)
>
>
> ----------------------------------------------------------------------
>
> Message: 1
> Date: Tue, 13 Apr 2010 14:56:38 +0200
> From: ThomasV <thomasV1(a)gmx.de>
> Subject: Re: [Wikisource-l] PDF/Djvu to Index
> To: "discussion list for Wikisource, the free library"
> <wikisource-l(a)lists.wikimedia.org>
> Message-ID: <4BC46A06.3070506(a)gmx.de>
> Content-Type: text/plain; charset=ISO-8859-1
>
> Lars Aronsson a ?crit :
>> ThomasV wrote:
>>
>>> the problem is that djvu pages on common
>>> do not have a parsable format.
>>>
>>
>> Many of them are. For example,
>> http://commons.wikimedia.org/wiki/File:Swedish_patent_14_Mj%C3%B6lqvarn.pdf
>> contains:
>>
>> {{Information
>> |Description=Swedish patent 14: Mj?lqvarn
>> |Source=Digitized by [http://nordiskapatent.se/ NordiskaPatent.se]
>> |Date=January 23, 1885
>> |Author=R. Setz, J. Schweiter, Clus, Switzerland
>> |Permission=
>> |other_versions=
>> }}
>>
>> This looks very formalized and parsable to me.
>> I filled it in when I uploaded the file to
>> Commons, and exactly the same fields need to be
>> filled in manually in the newly created Index page.
>>
>> Maybe I should design a tool or bot that asks for
>> these fields once, and then uploads the file and
>> creates the Index page, based on that information.
>> And my question was: Has anybody already done that?
>>
>>
> not to my knowledge.
> it is possible to request the text with the api, why dont you try it.
>
>>
>>> What you describe is _already_ implemented :
>>> when a page is created, its text is extracted
>>> from the text layer of the corresponding djvu or pdf.
>>> All you need to do is create djvu files with a proper text layer.
>>>
>>
>> You are correct, it does indeed work, but only
>> after I action=purge the PDF file on Commons.
>> It never worked for me on the first try,
>> without any purge. And I was misled by an
>> earlier bug where action=purge didn't help,
>> so it took me a while before I tested this.
>>
>> So why is the purge necessary? If OCR text
>> extraction ever fails, why is this not detected
>> and automatically retried?
>>
> purge is necessary only for files that were uploaded previously, when
> text extraction was not performed. Note that text layer extraction for
> pdf files is new.
>
>> When I try to create
>> http://sv.wikisource.org/wiki/Sida:Swedish_patent_14_Mj%C3%B6lqvarn.pdf/1
>> there is a character encoding error in the OCR text.
>> It looks as if the PDF contains 8-bit data, which
>> is loaded into the UTF-8 form without conversion.
>> Cut-and-paste from Acrobat Reader works fine.
>>
> yes there is a conversion problem with pdf; it works better with djvu.
>
>
>
>
>
> ------------------------------
>
> Message: 2
> Date: Tue, 13 Apr 2010 16:43:47 +0300
> From: Cecil <cecilatwp(a)gmail.com>
> Subject: Re: [Wikisource-l] PDF/Djvu to Index
> To: "discussion list for Wikisource, the free library"
> <wikisource-l(a)lists.wikimedia.org>
> Message-ID:
> <y2ucec8cb611004130643jfbc6bd1cq6b4f561c5c8e3929(a)mail.gmail.com>
> Content-Type: text/plain; charset="iso-8859-1"
>
> 2010/4/13 ThomasV <thomasV1(a)gmx.de>
>
>> Lars Aronsson a ?crit :
>> > ThomasV wrote:
>> >
>> >> the problem is that djvu pages on common
>> >> do not have a parsable format.
>> >>
>> >
>> > Many of them are. For example,
>> >
>> http://commons.wikimedia.org/wiki/File:Swedish_patent_14_Mj%C3%B6lqvarn.pdf
>> > contains:
>> >
>> > {{Information
>> > |Description=Swedish patent 14: Mj?lqvarn
>> > |Source=Digitized by [http://nordiskapatent.se/ NordiskaPatent.se]
>> > |Date=January 23, 1885
>> > |Author=R. Setz, J. Schweiter, Clus, Switzerland
>> > |Permission=
>> > |other_versions=
>> > }}
>> >
>> > This looks very formalized and parsable to me.
>> > I filled it in when I uploaded the file to
>> > Commons, and exactly the same fields need to be
>> > filled in manually in the newly created Index page.
>> >
>> > Maybe I should design a tool or bot that asks for
>> > these fields once, and then uploads the file and
>> > creates the Index page, based on that information.
>> > And my question was: Has anybody already done that?
>> >
>> >
>> not to my knowledge.
>> it is possible to request the text with the api, why dont you try it.
>>
>>
>
> Two problems.
>
> 1. I don't think all projects are using this template as it does not really
> fit for books: At least German Wikisource has special templates for books,
> single pages, djvu-Files, pdf-Files and so on. It uses mostly those for its
> Commons-uploads and not the non-specific information-template as this
> template does not have the parameters needed to describe book data (author,
> publisher, place of publishing, year of first publishing, publishing
> version, year of publishing of this version, ...). And AFAIK de.WS is not
> the only project which uses specialized templates for its Commons-uploads.
>
> 2. I'm not sure about this but I think the index-file has the same fields in
> all projects which use the extension. That would mean that the
> Information-template does not contain the correct data for filling the
> index-page. At least the index-file on de.WS has separate fields for author
> and publisher and year of publishing and place of publishing and we usually
> also link locally to the author and the title-page. So at least from your
> example up there only one parameter (the source) is really useable (and at
> least I usually linked to the Commons-file in the source-parameter on the
> WS-indexPage). I'm not sure how much time the parse-request would need but
> it does not look really worth the time (both programming it and later using
> it) considering its useable return values.
>
> IMO you could create a lot of index-files in the time you spend figuring out
> if it is possible to extract, parse and interpret the data from Commons.
> There are too many uploads which do not use any template, other templates or
> while using this template are filled out in an unuseable way (as everybody
> has a little bit a different style even when using templates), and even then
> it lacks half the needed information while the rest still needs formatting.
> The benefit is quite small compared to the amount of work it requires.
>
> But hey, if you have spare time it would be still interesting to know if you
> can get the data in a way that would not slow down users with not-so-fast
> internet-connections.
>
> Cecil
>