Re: [Wikisource-l] What is our next major hurdle, or where we need most development assistance

25 Nov 2014

...
  How would I do that now? Wikisource pages are not
structured data (though
 Wikimedia Commons image metadata will soon be!), so there is not a clear
 way to use the Wikisource API to extract just the relevant transcribed text
 on the page as a field. And on top of that, any text you do extract this
 way will be full of templates and other code that has no meaning outside of
 the context of Wikisource. I don't see a way to easily extract just the
 plain text that is meaningful and relevant (along with other fielded data,
 like what page or text it belongs to).

Wikisource as a "structured" repository is what we ask from the dawn of
time :-)
The problem, as usual, is that if things are left to volunteer developers
thing will go slooooowly.
I do think this is fundamental: an ideal Wikisource would ingest and
understand many times metadata standards, and would give them back as well.

As for the Wikimedia API, I did this awful script:
https://github.com/Aubreymcfato/ws_scraper
Please come and make it better :-D

It just scrapes the data from the HTML (it is localized to it.source, but a
quick glance at the HTML source of your own ws could help you, especially
if you use microformats) and puts them on a csv.
If you take the HTML you can also get the formatted text.
(I also wonder of a Wikisource which understands Markdown, but that's too
far :-)

Aubrey

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

Re: [Wikisource-l] What is our next major hurdle, or where we need most development assistance