[Wikisource-l] [Wikitech-l] Wikisource bugs

Lars Aronsson lars at aronsson.se
Mon Jul 5 12:46:08 UTC 2010


On 07/05/2010 10:45 AM, ThomasV wrote:
> I do not agree with you on namespaces.
>
> I think that the "Page" namespace is the
> best way to handle the separattion between
> the physical object (a book and its pages)
> and the logical object that we present to
> readers (the text, divided in sections or chapters)

I agree that having two structures (physical pages
and chapters) is a challenge (I even wrote a paper
about this, eleven years ago), but the introduction
of the Page: namespace is not without problems.

Moving the physical structure to its own namespace is
based on the assumption that a separate presentation
structure (the chapters of the book) exists and is the
more important one. But for odd formats such as
dictionaries or newspapers, this is far from obvious.

Should each dictionary entry become a chapter of its own?
One dictionary might have 150,000 entries.

For newspapers, it's easy to agree that each major news
article can be a chapter of its own. But maybe not each
small advertisement? Should the whole ad section be one
chapter? People might want to search these ads. They can
be far more important to current readers than the news.
So treating them as whitespace is not a solution.

In such cases, it might be best to just proofread the
physical page and keep it as it is. Even for ordinary
books, while they are being proofread, which can extend
for months or years, only the physical structure exists.
But then the Page: is what will be exposed to the public
reader. So maybe it should be dressed up with the green
{{header}} to look nicer?

Already today, we have the problem that searches
using the site's own search box will show content in
the Page: namespace, rather than the transcluded
chapters in the main namespace (related to
https://bugzilla.wikimedia.org/show_bug.cgi?id=18861 )
and there are no links from the found Page: to the
chapters that transclude its text, unless you bother
to use "what links here".

Wikistats (Erik Zachte) also reports user activity based on
the main namespace. It's odd that on Wikisource, the "other"
namespaces have far more editing activity than the main one.

For a dictionary, creating 150,000 tiny pages that each
transclude 2 lines of text is not a good match for the
current wiki technology. Having dozens of <section.../>
tags in each page, will also look very clumsy. It would
be comfortable if the section markers were much smaller,
and treated like anchor points. Search should also
return the closest preceding anchor point (even if that
is on a preceding page), rather than the page URL.

The Bible, being one of the oldest texts on Wikisource,
is a good test case. It consists of 2 testaments, 66 books,
1189 chapters and 31,103 verses. When printed on paper,
it typically fits on 1200 physical pages. Today we typically
create one wiki page per book or per chapter, e.g.
http://en.wikisource.org/wiki/Bible_(King_James)/Matthew
and this is what turns up in searches, since it was imported
from existing e-texts, rather than being proofread in Wikisource.
These 66 or 1189 wiki pages have headlines for each chapter
and anchor points for each verse, but these are not presented
in the search results. Imagine you could search "candle under bushel"
and up comes "Matthew 5:15", even if you had a proofread
but not yet transcluded version divided into 1200 wiki
pages in the Page: namespace. Today search turns up things
such as "Page:The Granite Monthly Volume 5.djvu/82",
which simply isn't pretty.

In my eyes, this means: 1) many problems (e.g. search) are
generic problems, not connected with ProofreadPage, and
2) the existing ProofreadPage (PR2) may work okay for
traditional books with chapters, but it can also co-exist
with an alternative ProofreadPage that works better for
dictionaries and newspapers.

Next, consider digitizing old maps with Wikisource, and
matching them (through coordinate transformation) with
OpenStreetMap.


-- 
   Lars Aronsson (lars at aronsson.se)
   Aronsson Datateknik - http://aronsson.se





More information about the Wikisource-l mailing list