Hello all,
is there a query language for wiki syntax?
(NOTE: I really do not mean the Wikipedia API here.)
I am looking for an easy way to scrape data from Wiki pages.
In this way, we could apply a crowd-sourcing approach to knowledge
extraction from Wikis.
There must be thousands of data scraping approaches. But is there one
amongst them that has developed a "wiki scraper language" ?
Maybe with some sort of fuzziness involved, if the pages are too messy.
I have not yet worked with the XML transformation of the wiki markup:
*action=expandtemplates **
generatexml - Generate XML parse tree
Is it any good for issuing XPATH queries ?
Thank you very much,
Sebastian
--
Dipl. Inf. Sebastian Hellmann
Department of Computer Science, University of Leipzig
Projects: http://nlp2rdf.org , http://dbpedia.org
Homepage: http://bis.informatik.uni-leipzig.de/SebastianHellmann
Research Group: http://aksw.org
It's now possible to request a parsed wiki page via the MediaWiki API.
You have to have node (and all the dependencies) installed. See the
README file in the modules/parse directory. It simply shells out to
node.js. A more clever version might make use of a node.js daemon, but
it was just easier to reuse the STDIN/STDERR/STDOUT model this way.
It's implemented as a 'property' of a title, i.e.:
http://wiki.ivy.local/w/api.php?action=query&prop=parsetree&titles=Main%20P…
If I've read the docs right, this should work well with caching.
It doesn't work with format=xml, for some reason.
https://bugzilla.wikimedia.org/show_bug.cgi?id=34058
--
Neil Kandalgaonkar ) <neilk(a)wikimedia.org>
The current code of the Visual Editor reimplements many basic editing
functions, which are available in the browsers: text selection,
alignment, cursor movement and appearance and even cursor blinking.
This is a major problem for internationalization. These functions work
very differently in different languages. Most browsers already support
most of the needed functionality in textarea and in content-editable.
Overriding them in the Visual Editor means that it will be unusable in
any language except English.
To name just a few basic examples:
* Cursor shape: In bidirectional environments the cursor has an arrow
pointing in the writing direction. In the current VE it's just a
stick. See https://en.wikipedia.org/wiki/Text_cursor#Bi-directional_text
* Cursor position: VE always positions the cursor as it the text is
left-to-right. The same goes for text selection.
* Word boundary: Ctrl-left/right arrow moves the cursor word by word.
Browsers support all Unicode scripts for word boundary identification,
but in the current VE it only works with ASCII.
* Character segmentation and cursor movement: A letter with one or
more diacritics is one character segment according to Unicode (UAX
#29), even if internally it's composed of several Unicode characters.
For example, <שָּׁ> is composed of four characters, but only one arrow
key press is supposed to move the cursor over it. That's what the
browsers already do for years, but in the Visual Editor, four key
presses are needed.
This is just a small sample of the potential issues.
Bugs are already reported for some of these issues, but the basic
question is - why reimplement it all in the first place? Why not use
what the browsers offer as much as possible?
--
Amir Elisha Aharoni · אָמִיר אֱלִישָׁע אַהֲרוֹנִי
http://aharoni.wordpress.com
“We're living in pieces,
I want to live in peace.” – T. Moore