Hello all,
is there a query language for wiki syntax?
(NOTE: I really do not mean the Wikipedia API here.)
I am looking for an easy way to scrape data from Wiki pages.
In this way, we could apply a crowd-sourcing approach to knowledge
extraction from Wikis.
There must be thousands of data scraping approaches. But is there one
amongst them that has developed a "wiki scraper language" ?
Maybe with some sort of fuzziness involved, if the pages are too messy.
I have not yet worked with the XML transformation of the wiki markup:
*action=expandtemplates **
generatexml - Generate XML parse tree
Is it any good for issuing XPATH queries ?
Thank you very much,
Sebastian
--
Dipl. Inf. Sebastian Hellmann
Department of Computer Science, University of Leipzig
Projects: http://nlp2rdf.org , http://dbpedia.org
Homepage: http://bis.informatik.uni-leipzig.de/SebastianHellmann
Research Group: http://aksw.org
Hy!
sumanah, from the MWiki forums suggested I ask for help here.
I'm building an application and would like add wiki support to it. So I was
wondering, how it would be possible to parse a wiki text with MediaWiki?
--
Hannes
We, the Visual Editor team, have decided to move away from the custom
WikiDom format in favor of plain HTML DOM, which is already used
internally in the parser. The mapping of WikiText to the DOM was very
pragmatic so far, but now needs to be cleaned up before being used as an
external interface. Here are a few ideas for this.
Wikitext can be divided into shorthand notation for HTML elements and
higher-level features like templates, media display or categories.
The shorthand portion of wikitext maps quite directly to an HTML DOM.
Details like the handling of unbalanced tags while building the DOM
tree, remembering extra whitespace or wiki vs. html syntax for
round-tripping need to be considered, but appear to be quite manageable.
This should be especially true if some normalization in edge cases can
be tolerated. We plan to localize normalization (and thus mostly avoid
dirty diffs) by serializing only modified DOM sections while using the
original source for unmodified DOM parts. Attributes are used to track
the original source offsets of DOM elements.
Higher-level features can be represented in the HTML DOM using different
extension mechanisms:
* Introduce custom elements with specific attributes:
<template href="Template:Bla' args=".../>
For display or WYSIWYG editing these elements then need to be
expanded with the template contents, thumbnail html and so on.
Unbalanced templates (table start/row/end) are very difficult
to expand.
* Expand higher-level features to their presentational DOM, but
identify and annotate the result using custom attributes. This is the
approach we have taken so far in the JS parser [1]. Template
arguments and similar information are stored as JSON in data
attributes, which made their conversion to the JSON-based WikiDom
format quite easy.
Both are custom solutions for internal use. For an external interface, a
standardized solution would be preferable. HTML5 microdata [2] seems to
fit our needs quite well.
Assuming a template that expands to a div and some content, this would
be represented like this:
<div itemscope
itemtype='http://en.wikipedia.org/wiki/Template:Sometemplate' >
<h2>A static header from the template</h2>
<!-- The template argument 'name', expanded in the template -->
<p itemprop='name' content='The wikitext name'>The rendered name</p>
</div>
In this case, an expanded template argument within (for example) an
infobox is identified inside the template-provided HTML structure, which
could enable in-place editing.
Unused arguments (which are not found in the template expansion) or
unexpanded templates can be represented using non-displaying meta elements:
<div itemscope
itemtype='http://en.wikipedia.org/wiki/Template:Sometemplate'
id='uid-1' >
<h2>A static header from the template</h2>
<!-- The template argument 'name', expanded in the template -->
<p itemprop='name' content='The wikitext name'>The rendered name</p>
<meta itemprop='firstname' content='The wikitext firstname'>
</div>
The itemref mechanism can be used to tie together template data from a
single template that does not expand to a single subtree:
<div itemscope itemref='uid-1'>
<!-- Some more template output from expansion of
http://en.wikipedia.org/wiki/Template:Sometemplate -->
</div>
The itemtype attributes in these examples all point to the template
location, which normally contains a plain-text documentation of the
template parameters and their semantics. The most common application of
microdata however references standardized schemas, often from
http://schema.org as those are understood by Google [3], Microsoft, and
Yahoo!. A mapping of semi-structured template arguments to a standard
schema is possible as demonstrated by http://dbpedia.org/. It appears to
be feasible to provide a similar mapping directly as microdata within
the template documentation, which could then potentially be used to add
standard schema information to regular HTML output when rendering a page.
The visual editor could also use schema information to customize the
editing experience for templates or images. Inline editing of fields in
infoboxes with schema-based help is one possibility, but in other cases
a popup widget might be more appropriate. Additional microdata in
template documentation sections could provide layout or other UI
information for these widgets.
There are still quite a few loose ends, but I think the general
direction of reusing standards as far as possible and hooking into the
thriving HTML5 ecosystem has many advantages. It allows us to reuse
quite a few libraries and infrastructure, and makes our own developments
(and data of course) more useful to others.
So- I hope you made it here without falling asleep!
What do you think about these ideas?
Gabriel
References:
[1]: http://www.mediawiki.org/wiki/Future/Parser_development
[2]:
http://www.whatwg.org/specs/web-apps/current-work/multipage/microdata.html
[3]:
http://support.google.com/webmasters/bin/answer.py?hl=en&answer=99170&topic…
This text is on the wiki at
http://www.mediawiki.org/wiki/Future/HTML5_DOM_with_microdata
(Continuing the crunching. huh? But this message is only 4 pages long.)
Forked into the new thread.
I'm afraid latest maillist messages were considered TLDR by most of
the subscribers so I will put this in the beginning: is there any
point in continuing our discussion on the subject? Platonides is a
constructive company but he seems to be the only one participating.
Is the community truly interested in reworking the markup?
I have some knowledge and code assets that I will be happy to
contribute; I will gladly take part in discussions or help improve the
situation in some other way. But if Wikimedia team has different views
onto the markup evolution it's fruitless to spend so much time
chatting before the closed doors.
My reply follows.
On 08.02.2012 2:27, Platonides wrote:
> Nobody proposed to change the template in that way? :)
You mean that nobody has actually studied markup usability?
> If you start creating inline, block and mixed template modes, I
> suspect the syntax will end up being chaotic (I'm thinking in
> concrete cases in MW syntax).
True, that's why I propose only two modes: block and inline, both with
clear distinctions and features.
> That assumes that there's a non-ambiguous way to express that in
> natural language (plus that it is easily parseable by a machine).
Yes, added a few simple rules an unambiguous language can be created.
I'm sure most of those business e-mails and official documents can be
processed by the machine without much effort. And we're talking about
even more formalized language here - text markup.
> So, how do you split {{About Bijection, injection and surjection}} ?
If that is supposed to be a long caption (4 words and a comma) then just
use quotes - like in natural handwriting. {{About "Bijection, injection
and surjection"}}
> The point of using an additional character not used in normal
> language is precisely for working the metalanguage.
I disagree, it only means that this subject has not yet been researched
enough.
>>> Also, there are colons as parameters. How would you write as the
>>> parameter the article [[Gypsy: A Musical Fable]] or [[Batman:
>>> Year One]] ? By banning ':' in titles?
>> Have I said something about colons and links? Links are fine with
>> colons or any other symbols.
> You mentioned colons for template arguments I'm acting as the
> devil's advocate asking you how to provide those titles as parameters
> to a template.
Uh, I have mistyped "comma" instead of "colon". Let me correct this:
1. {{About Something}}
2. {{About Something, of kind}}
3. {{About "Something, something and something", of kind}}
4. {{About "Something, something and something", "of kind, kind and kind"}}
As you can see, no character is banned from the title while in current
pipe-centric approach I don't thing it's possible to have pipes there
without a headache.
>> But if we're touching this pipes in links are not that intuitive
>> either. Pipes are actually not present on many keyboard layouts but
>> even apart from that it's more natural to use an equality sign. Or
>> double, for the purpose of text markup.
> It's consistent with the use of pipes in templates (which do use
> equal in that way to name parameters). Although link syntax was
> probably earlier.
Right, and pipes should not appear in templates either. It's too special
symbol.
> So is [[Batman Forever]] your syntax for [[Batman Forever|Batman
> Forever]] or [[Batman|Forever]] ? So much cases are bad, KISS.
I do not see your point. The processing is straightforward:
1. Link contains == - it separates address from title.
2. Link contains no == but contains a space - the first space separates
address from title.
3. There is neither == nor ' ' - link is titleless. This means that:
* local links get titles from page name, not page address (this is
important and differs from current MediaWiki implementation in a better way)
* remote links can also get their title from <title> after fetching
first 4 KiB of that page or something
Use cases:
* "[[http://google/search?q=%61%62%63 Google it]]" - for external links
== delimiter won't be used at all
* "See this [[page]]" - current wikitext is the same
* "See [[page that page]]" vs. current [[page|that page]]. Looks more
clean and easier to type (space is present on all keyboards and is quite
large in size). This covers not less than half local links.
* "See [[Some page==this page]]" vs. current [[Some page|this page]].
This case has less drastic differences than previous 3 but a pipe is
still both special to English layouts and less noticeable to human eye
than double equality sign.
Does "KISS" mean that every use case should be created with uniform but
because of this equally inconvenient syntax? I agree that more complex
cases should have correspondingly more complex syntax but this scaling
must be adequate. By placing pipe everywhere not only cross-language
usability is reduced but the fact that it's redundant in some cases (#1
and #3 items above) is ignored.
>> 4. Finally, in very rare cases when both space and equality symbol
>> is necessary a special markup-wise (!) escape symbol can be used.
> As an example: [[2 + 2 = 5]]
Your example contains no double equality symbol and is treated as
space-separated title: [[2| + 2 = 5]] in current wikitext.
> Would you remove === headings?
No, headings are consistent because the first heading starts with double
equality sign.
>> Currently wikitext uses terrible "<nowiki>stuff</nowiki>" but it
>> doesn't always work and HTMLTidy comes in handy with its< and
>> >. And some places (such as link titles) cannot be escaped
>> altogether.
> Really? I think you can.
Give some examples and we will examine their adequateness.
> Your proposal for forcing to edit the urls is very bad. You can't
> just paste, you need to go changing every = on it (which is a
> frequent character) to ~==.
No, no, no, you have got a completely wrong idea. You don't have to
escape SINGLE = because it is not special. You only need to escape
double ==. How much double == have you seen in the links? I have seen
them being used on my local bookstore site but it's surely an exception.
> Pipes are banned from titles.
Great, let's make machine's life easier.
> I'm not sure this is a good analogy. Copy-pasting chunks of code look
> like copying phrases from other articles to make your own. That
> should be original. OTOH, reusing the existing LaTeX template is much
> more appropiate than writing your own from scratch trying to copy the
> style of the provided one.
For such things templates must be created that will reduce the number of
entities identical to all of their use cases to minimum. In MediaWiki
this is done using {{templates and=parameters}} and this is good. If you
were talking about copy-pasting these templates, their parameters and
empty values - this is fine. But if it was about copy-pasting the same
code with all rendering tricks ( , {{iejrhgy}} and other cryptic
things) - this is bad.
> Even if I write a program from scratch, I should make it consistent
> with other tools. That means an appropiate arguments would be sort
> -r --ignore-case --sort=month ./myfile instead of sort<- !case (sort
> as month) \\\\./myfile\\\\
Standardizing is fine unless it starts looking unnatural. The following
example might be argued but I can't think of another one quickly:
tar -czf file.tar.gz .
While this uses standard CLI syntax is in true *nix ideology this is
what (among other things) separate POSIX from Windows. For instance, I
could write:
tar file.tar.gz .
...and the program will detect -czf arguments on its own based on
-f is simply implied because there are 2 unnamed arguments (without
leading -X)
-c target file doesn't exist
-z target file has extension .gz
It's the same with templates or other markup: while {{About page=Earth
kind=planet}} or something similar is fine, {{About Earth, planet}} or
some other form is more appropriate in this particular use case.
> You are giving many attributions to the machine. Personally, I would
> spit out an error, just in they were eg. in different units.
Yes, this is one of the ways and I would opt for it if we want to have a
strict syntax.
> But you are making up your syntax, then requiring the system to adapt
> for you.
Can you elaborate more on this point?
>>> The goal of wikitext is to make html editing easy.
>> HTML editing? I thought wikitext was about text editing. Why not
>> edit HTML using HTML?
> Because it's considered cumbersome. (Actually, it's presentational
> editing, but as the presentation is obtained by using HTML as an
> intermediate language...)
Indeed, HTML is cumbersome, that's why wikitext and all other text
markups have been invented. But they don't have to copy HTML syntax -
just the opposite.
> And you have complicated the originally clean syntax of 1, 2, 3
Clean syntax for whom? For Englishmen? And are hashes actually clean? If
so, why don't we use them in our e-mail messages?
> Would html links become italic? (that was a problem of wikicreole, it
> was defined as 'italic unless in links')
Not at all because we are talking about context-specific grammar.
Addresses in links can hold no formatting and thus all but context
ending tokens (]], space and ==) are ignored there.
And yes, context-specific grammar is more than regular expressions can
handle. Regexps are good but this doesn't mean anything incompatible
with sed is beyond "too complex".
As already mentioned, I am using my own markup processor written in
PHP on my projects and it implements all markup already described
including the [[http://italic]] (context-specific grammar) case. And
its parsing loop is under 350 lines of code.
> Well, I have to say it seems well though, it "doesn't look bad".
Thank you. I have given it a lot of thinking and practice but I'm sure
there still are things to improve. I would be ecstatic if my
experience can help the world's largest free knowledge community.
Thanks again for your mail, Platonides.
Signed,
P. Tkachenko
Hi wikitext-l!
I've read http://www.mediawiki.org/wiki/Future/Parser_plan recently,
and the plans seemed strange and scary to me.
In several places, there is the following stuff said:
...rich text editor which will let most editors and commenters
contribute text without encountering source markup...
...further reducing the need of advanced editors to work with markup
directly...
...by integrating well with a rich text editor, we reduce the
dependence of editors and commenters on dealing with low-level
wikitext...
..."oh there's that funky apostrophe thing in this old-style page".
Most editors will never need to encounter it...
Such plans seem very scary to me, as I think the PLAIN-TEXT is one of
the MOST IMPORTANT features of Wiki software! And you basically say you
want to move away from it and turn MediaWiki to another Word, having all
problems of "WYSIWYdnG" (...Is What Wou don't Get) editors. I don't
think I need to explain the advantages of plain-text markup to core
developers of MediaWiki... :)
I've patched the parser code slightly (in mediawiki4intranet) and I
understand it's not perfect, so I support any effort for creating a new
parser, but if it involves moving away from markup in the future
fully...
So, my question is - is this all true, or did I misunderstand the
plans?
Hi Good People.
I'd like to thank everyone for helping me with labs, code reviews and other difficulties.
Recent Search Related Activity:
1. Branched the project to svn:https://svn.wikimedia.org/svnroot/mediawiki/trunk/lucene-search-3
2. Upgraded the code from is Lucene 2.4.0 to 2.9.1 last December and I've been reviewing and committing to svn.
3. I've migrated the project from Ant To Maven.
4. We have placed the Maven based code is in Continuous Integration on Jenkins with JUnit PMD & Coverage report in place.
5. With the help of some excellent volunteers, I've set up a lab to test the build using simple English Wikipedia.
6. One major setback is that there is no proper testing or deployment possible for update. For this reason I've not closed any of the bugs I've worked on. (Access to the production machines is considered too sensitive now that there are labs. At this time labs do not have capacity for setting up a At this Labs do not have There is no labs Setting up a lab which replicates the production has been unsuccessful,. Once the scripts are sanitized, and production search will be put into puppet it may be possible. However as the labs environment is a far cry from the production in terms of both content, and updating.
7. I've done some rough analysis and design for the next version of search which will feature computational linguistics support for the many languages being used in wikipedias. Search analytics (optimizing ranking) and innovative content analytics for Ranking but including objective metrics on neutral point of view (via. sentiment analysis), notability (via. semantic algorithms), checking of external links (anti-link spam).
8. We are trying to relicense the search code so that the Lucene community in the apache projects will become more involved. It may be necessary also to relicense MWdumper since the two projects are related.
In the pipeline:
1. Testing and Integration into Lucene a ANTLR grammar which parses wiki syntax Tables. Once successful it will be also integrated
into SOLR and become the prototype of more difficult wiki syntax analysis tasks.
2. I've started on some of the NLP tasks:
a. a Transducer of Devanagari scripts to IPA. (in HFST)
b. a Transducer of English to IPA with the common goal to index named entities based on their sound in a language agnostic fashion.
(also in HFST)
c. Extraction of phonetics data from English Wiktionary.
d. conversion of CMU pronunciation dictionary to IPA.
e. Extraction of bi-lingual lexicons from Wiktionary and conversion to Apertium Formats.
f. Unsupervised learning of morphologies using minimum description length.
g. Sentence boundary detection (SVM and MaxEnt Models).
h. Topological text alignment algorithm.
3. A Maven based POM for building and packaging SOLR + our extension for distributed use.
4. A repository for NLP artifacts built from WikiContent.
Oren Bochman
MediaWiki Search Lead.
(branched from the long thread)
Dear Jay and list,
if there are 97-99% compatible alternate parsers that is in fact
something I should know about!
I tried to figure out which are those 4-5 you mentioned but from
http://www.mediawiki.org/wiki/Alternative_parsers it's not so easy.
FlexBisonParse? Magnus's?
So I thought I ask everyone: which are the top alternate parsers? (in
compatibility with wikitext syntax)
Thanks
Mihály
On 6 February 2012 22:32, Jay Ashworth <jra(a)baylink.com> wrote:
> ----- Original Message -----
>> From: "Mihály Héder" <hedermisi(a)gmail.com>
>
>> By following this list I hope I gathered how they plan to tackle this
>> really hard problem:
>> -a functional decomposition of what the current parser does to a
>> separate tokenizer, an AST(aka WOM or now just DOM) builder and a
>> serializer. Also, AST building might be further decomposed to the
>> builder part and an error handling according to html specs.
>> -in architecture terms, all this will be a separate component, unlike
>> the old php parser which is really hard to take out from the rest of
>> the code.
>> In this setup there is hope that the tokenizing task can be specified
>> with a set of rules, thus effectively creating a wikitext tokenizing
>> standard (already a great leap forward!)
>> Then the really custom stuff (because wikitext still lacks a formal
>> grammar) can be encapsulated in AST building.
>
> As I noted in a reply I wrote on this thread a few minutes ago (but it
> was kinda buried): there are between 4 and 7 projects with varying
> stages of seriosity that are already in work, some of them having posted
> to this list one or more times.
>
> At least a couple of them had as a serious goal producing a formalized,
> architecturally cleaner parser that could be dropped into Mediawiki.
>
> The framing of your reply suggests that you needed to know that and
> didn't.
>
> Cheers,
> -- jra
> --
> Jay R. Ashworth Baylink jra(a)baylink.com
> Designer The Things I Think RFC 2100
> Ashworth & Associates http://baylink.pitas.com 2000 Land Rover DII
> St Petersburg FL USA http://photo.imageinc.us +1 727 647 1274
>
> _______________________________________________
> Wikitext-l mailing list
> Wikitext-l(a)lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitext-l