Re: [Wikisource-l] What is our next major hurdle, or where we need most development assistance

25 Nov 2014


      On 25 November 2014 at 11:33, Andrea Zanni zanni.andrea84@gmail.com wrote:
...
How would I do that now? Wikisource pages are not structured data (though
...
Wikimedia Commons image metadata will soon be!), so there is not a clear
way to use the Wikisource API to extract just the relevant transcribed text
on the page as a field. And on top of that, any text you do extract this
way will be full of templates and other code that has no meaning outside of
the context of Wikisource. I don't see a way to easily extract just the
plain text that is meaningful and relevant (along with other fielded data,
like what page or text it belongs to).
Wikisource as a "structured" repository is what we ask from the dawn of
time :-)
The problem, as usual, is that if things are left to volunteer developers
thing will go slooooowly.
I do think this is fundamental: an ideal Wikisource would ingest and
understand many times metadata standards, and would give them back as well.
As for the Wikimedia API, I did this awful script:
https://github.com/Aubreymcfato/ws_scraper
Please come and make it better :-D
Awesome! I'll definitely give it a whirl.
...
It just scrapes the data from the HTML (it is localized to it.source, but
a quick glance at the HTML source of your own ws could help you, especially
if you use microformats) and puts them on a csv.
If you take the HTML you can also get the formatted text.
(I also wonder of a Wikisource which understands Markdown, but that's too
far :-)
You have a good point, though. One of the differences between Wikisource
and most other platforms is that it is actually richly formatted. It's kind
of a shame to strip all that formatting information out when extracting the
transcriptions. (Though many destinations wouldn't know what to do with
formatted text anyway.)

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

Re: [Wikisource-l] What is our next major hurdle, or where we need most development assistance