On Wed, Jul 21, 2010 at 5:47 PM, David Goodman <dgoodmanny(a)gmail.com>
Sure, but first, is this capable of being done at all?
I have never
seen a method of bibliographic control that can cope with the complete
range of publications, even just print publications. Perhaps we need
to proceed within narrow domains.
I assume that by range you mean the number of publications in a domain, and
that by domain you mean the type of publication, be it a book, webpage or
The generic nature of a markup such as wiki template syntax allows us to
easily adapt the same application to new domains. The challenge of the range
within a domain is largely one of resolving ambiguities, which can be
settled with policies that carefully adjudicate troublesome cases.
Second, is this capable of being done by
crowd-sourcing, or does it
require enforceable standards? The work of Open Library is not a
promising model, being a uncontrolled mix, done to many different
standards. Actually, within the domain of scientific journal articles
from the last 10 years in Western languages, the best current method
seems to be a mechanical algorithm, the one used by Google Scholar.
True, it does not aggregate perfectly--but it does aggregate better
than any other existing database. And it does not get them all--nor
could it no matter how much improved, for many of the versions that
are actually available are off limits to its crawlers.
In my conception the enforceable standards are to emerge in the meta pages
of this project based on the actual issues that the community encounters.
Googlebot has many deep web accounts to journals online. When you search
Google Scholar the relevance algorithm is actually comparing your query to
the content of pdf pages which you do not have permission to access. Of
course, Google can't access them all, but many publishers have found it in
their interest to give them a complimentary account since it drives
We can rely on individuals, particularly academics, who have access to the
deep web to help us curate the bibliography. And we can rely on the massive
number of personal bibliographies already out there to help us get good
Cleaning up the mass of bibliographic content that I anticipate would be
uploaded by users would require the writing of bots in coordination with the
creation of policy pages.
Getting rid of copyright material would be handled in the same manner, I
presume. After major content publishers see what we are doing, I am sure
they will let us know their opinion about what we can and cannot do. It
seems likely that they will overreach their bounds, and as I have seen on
Wikipedia, the community members will happily ignore them. Or, if they think
the requests are actually in compliance with the law, they will comply.