Re: [Wikisource-l] What is our next major hurdle, or where we need most development assistance

25 Nov 2014

On 25 November 2014 at 11:33, Andrea Zanni &lt;zanni.andrea84(a)gmail.com&gt; wrote:

...

 How would I do that now? Wikisource pages are not structured data (though
  Wikimedia Commons image metadata will soon be!),
so there is not a clear
 way to use the Wikisource API to extract just the relevant transcribed text
 on the page as a field. And on top of that, any text you do extract this
 way will be full of templates and other code that has no meaning outside of
 the context of Wikisource. I don't see a way to easily extract just the
 plain text that is meaningful and relevant (along with other fielded data,
 like what page or text it belongs to).

 Wikisource as a "structured" repository is what we ask from the dawn of
 time :-)
 The problem, as usual, is that if things are left to volunteer developers
 thing will go slooooowly.
 I do think this is fundamental: an ideal Wikisource would ingest and
 understand many times metadata standards, and would give them back as well.

 As for the Wikimedia API, I did this awful script:
 https://github.com/Aubreymcfato/ws_scraper
 Please come and make it better :-D

 Awesome! I'll definitely give it a whirl. 

...
  It just scrapes the data from the HTML (it is
localized to it.source, but
 a quick glance at the HTML source of your own ws could help you, especially
 if you use microformats) and puts them on a csv.
 If you take the HTML you can also get the formatted text.
 (I also wonder of a Wikisource which understands Markdown, but that's too
 far :-)

You have a good point, though. One of the differences between Wikisource
and most other platforms is that it is actually richly formatted. It's kind
of a shame to strip all that formatting information out when extracting the
transcriptions. (Though many destinations wouldn't know what to do with
formatted text anyway.)

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

Re: [Wikisource-l] What is our next major hurdle, or where we need most development assistance