I've crossposted my response to the Foundation and Wikisource lists
since it could interest people there.
>Ray Saintonge wrote:
>>>On 9/20/06, Delirium <delirium(a)hackish.org> wrote:
>>>>I guess as a reader I don't see the benefit in *not* covering
>>>>everything. I agree there is a slant towards more coverage of recent
>>>>news events, but that's simply because they're easier to cover. The
>>>>solution, IMO, is not to cover recent events less, but to cover older
>>>>events more. I want to know the equivalent of this stuff for other time
>>>>periods! Were there short-lived but at the time massively-covered
>>>>events in the 1890s, equivalent to today's frenzies over child
>>>>kidnappings? What about the thousands of political scandals, major and
>>>>minor, that have at various times shortened governments' tenures, forced
>>>>cabinet reshuffles, etc., etc.? It's all good info we're missing!
>>>Problem is that a lot of the data that would be useful in answering
>>>your question is stored on microfilm and there isn't really a quick
>>>way to scan that.
>>This is a Wikisource function, but that dosn't make it easier. I have
>>most of the first 20 years of McClure's Magazine. It was a monthly that
>>became famous for muckraking journalism, and exposing the behaviour of
>>big companies and government administration in the pre WWI era. 1,200
>>pages per year for 20 years gives 24,000 pages, and is a daunting task.
>>Weeklies and dailies don't make things any easier.
>While it would certainly be nice to have it all scanned, I don't think
>it's necessary. We already cite lots of sources that aren't available
>on the internet---recently published books, journal articles, etc.---so
>I don't see why it would be a bigger problem that old news articles are
>only available in archives, on microfilm, or via digital subscription.
>Ain't nothin' wrong with citing sources that require a visit to a
>library to access.
This is certainly a fair comment. Of course the recent publications
have copyright constraints that are a block to any kind of scanning.
Certainly, for the sake of discussion I am limiting my comments to
material where the public domain status is unquestioned. That's enough
material to keep us busy.
Some of my old bound volumes of "McClure's", "Scientific American",
"Popular Science", and other odd volumes have library markings and
indications that they were discarded by some public or college library.
I have no objection to people visiting libraries, but there's no
guarantee that a nearby library will have the material sought. Project
Gutenberg already includes 6 issues of "McClure's, a far but complete
but substantial number of "Scientific American" when it was a weekly,
and no "Popular Science". ("Popular Science" in the 19th century had
far more in-depth articles than its present incarnation.) In general, I
don't think we should be duplicating the efforts of PG; there's more
than enough work for everybody to do.
Other important magazines like [[The Smart Set]], where H. L. Mencken
wrote, are much more difficult to find. We do need to stay within the
realm of the possible. Making information freely available is not a
simple task; it will likely take the co-operation and co-ordination of
many players who will each establish where they can work best. I would
love to be able to create direct links from a WMF project to a specific
spot in a book that has been digitized by another player without having
to contend with a lot of proprietary restrictions being applied to
public domain books.
The task is enormous.
now all active members of Wikimedia projects are invited to vote in
the 2006 Election to the Board of Trustees of the Wikimedia
Foundation. After one week from the beginning, we have over 1,000
votes. Thank you for your interest!
On most projects of the Wikimedia Foundation, it is advertised at the
top of website. On some projects, it hasn't yet. According my
presumption 8-10% projects hasn't recognized now they are invited to
give voices (see below).
Hence, local sysops are highly expected to ensure their wiki to be
informed this Election, and if not yet, please modify
[[MediaWiki:Sitenotice]] and put a brief notice about it. If you are
not a sysop but visit such a project frequently, please let the local
sysops know about that.
You may want to render meta version or other projects'.
Here are some examples you can reuse:
I found by chance some projects not informed yet, and checked some
- 25 largest Wikipedias +1 I visited by chance; 3 Wikipedias lacked information
- 5 large Wiktionaries; all had sitenotice
- 3 relatively large Wikinews; all had sitenotice
3/31 = 10% projects had no news about Election; I feel unease with
this result. One of the reasons would be that they optimized the
sitenotice for other purposes. In some projects, it was just only the
former news (like Wikimania scholarship) remained. Anyway, there would
be terra incognita from several reasons, which all of us are invited
to develop to assure the integrity of our global community.
Wikimedia Election Committee, 2006
* vox pubuli, vox dei *
Via Slashdot - could be useful for future digitization efforts. Anyone
know how good it is compared to the latest ScanSoft engine?
Announcing Tesseract OCR
By Eric Case - 12:25 PM
Post by Luc Vincent, Uber Tech Lead
We wanted to let you all know that a few months ago we quietly
released - or actually re-released - an Optical Character Recognition
(OCR) engine into open source. You might wonder why Google is
interested in OCR? In a nutshell, we are all about making information
available to users, and when this information is in a paper document,
OCR is the process by which we can convert the pages of this document
into text that can then be used for indexing.
This particular OCR engine, called Tesseract, was in fact not
originally developed at Google! It was developed at Hewlett Packard
Laboratories between 1985 and 1995. In 1995 it was one of the top 3
performers at the OCR accuracy contest organized by University of
Nevada in Las Vegas. However, shortly thereafter, HP decided to get
out of the OCR business and Tesseract has been collecting dust in an
HP warehouse ever since. Fortunately some of our esteemed HP
colleagues realized a year or two ago that rather than sit on this
engine, it would be better for the world if they brought it back to
life by open sourcing it, with the help of the Information Science
Research Institute at UNLV. UNLV was happy to oblige, but they in turn
asked for our help in fixing a few bugs that had crept in since 1995
(ever heard of bit rot?)... We tracked down the most obvious ones and
decided a couple of months ago that Tesseract OCR was stable enough to
be re-released as open source.
A few things to know about Tesseract OCR: for now it only supports the
English language, and does not include a page layout analysis module
(yet), so it will perform poorly on multi-column material. It also
doesn't do well on grayscale and color documents, and it's not nearly
as accurate as some of the best commercial OCR packages out there.
Yet, as far as we know, despite its shortcomings, Tesseract is far
more accurate than any other Open Source OCR package out there. If you
know of one that is more accurate, please do tell us!
We are grateful to all the people at HP who made it possible to
release Tesseract into open source, and especially John Burns, who
championed and babysat the project. We would also like to thank the
original Tesseract development team, a partial list of whom is here.
Last but not least, many thanks to our friends at UNLV's ISRI,
including Tom Nartker, Kazem Taghva, Julie Borsack and Steve Lumos,
for all their help with this project.
Peace & Love,
I'm currently in Bloomington for a database workshop. There I met
Stacy Kowalczyk, who is working on Indiana University's Digital
I've already given Stacy a brief summary of Wikisource and where I see
potential for collaboration (metadata, translation, proofreading..). I
will still be here until Friday, so if you guys can come up with
useful questions to ask or suggestions to relay, I can do so.
Peace & Love,