On Mon, May 2, 2011 at 12:04 AM, Tim Starling tstarling@wikimedia.orgwrote:
Can someone please tell me, in precise technical terms, what is wrong with Wikia's WYSIWYG editor and why we can't use it?
I have heard that it has bugs in it, but I have not been told exactly what these bugs are, why they are more relevant for Wikimedia than for Wikia, or why they can't be fixed.
Years ago, we talked dismissively about WYSIWYG. We discussed the features that a WYSIWYG editor would have to have, pointing out how difficult they would be to implement and how we didn't have the manpower to pull off such a thing. Now that Wikia has gone ahead and implemented those exact features, what is the problem?
The most fundamental problem with Wikia's editor remains its fallback behavior when some structure is unsupported:
"Source mode required
Rich text editing has been disabled because the page contains complex code."
Here's an example of unsupported code, the presence of which makes a page permanently uneditable by the rich editor until it's removed:
<table> <tr><td>a</td></tr> </table>
You can try this out now at http://communitytest.wikia.com/
It will at least let you edit other *sections* that don't contain anything that scares it, but if the nasty bit is somewhere in what you want to edit, it just doesn't recover.
There are some smart things in what they're doing: annotating the markup ought to be a big help in hooking up the rendered HTML bits back to the original source. The way they hold template invocations and plugins as standalone placeholders within the rich text is pretty good (and could be a bit better if it could display some content and provide even more advanced invocation editing tools, which is all detail work).
But if it just gives up on entire pages, we've got a problem because to handle Wikipedia we need to handle lllooonnnggg pages that tend to include lots of complex templates which pull in funky code of their own.
At a minimum, assuming that other round-tripping problems are all resolved and the treatment of templates and extensions can be improved, it would need to be changed to recognize uneditable chunks and present them as a sort of placeholders too -- like the templates you should be able to dive into source and edit them if need be, but they ought not destroy the rest of the page.
Beyond that let's flip the question the other way -- what do we *want* out of WYSIWYG editing, and can that tool provide it or what else do we need? I've written up some notes a few weeks ago, which need some more collation & updating from the preliminary experiments I'm doing, and I would strongly appreciate more feedback from you Tim and from everyone else who's been poking about in parser & editing land:
http://www.mediawiki.org/wiki/Wikitext.next
And also some of Trevor's notes which I have poked at:
http://www.mediawiki.org/wiki/Visual_Editor_design
I've got some aggressive ideas about normalizing how we deal with template expansion to work at the parse tree level; this can be friendlier to some levels of caching, splitting portions of parsing between PHP and optimized native code, or even mixing some things between pre-parsed text and client-side work, but most importantly I'm interested in making sure we have a relatively clean hierarchical relationship between parts of the document, which we can use to much more reliably hook up parts of the rendered HTML output:
* maintain an abstract parse tree that can be hooked up fully to both the original source text *and* the live output DOM * do section, paragraph, or table-cell editing inline directly on a view page, with predictable replacements
It may well be that this is too expansive and we'll want to contract to something that's more like Wikia's annotated parser output -- in most cases it should give us similar information but it'll probably be harder to replace parts of the page at runtime in JavaScript.
Another goal beyond editing itself is normalizing the world of 'alternate parsers'. There've been several announced recently, and we've got such a large array now of them available, all a little different. We even use mwlib ourselves in the PDF/ODF export deployment, and while we don't maintain that engine we need to coordinate a little with the people who do so that new extensions and structures get handled.
A new visual editor that's built around a normalized, defined parser could be a great help; other folks will be able to use compatible parsers instead of mostly-similar parsers.
For the moment I'm mostly schooling myself on the current state of the world and setting up experimental tools to aid in debugging extra parser/editor-related goodies (eg the inspector tool I'm fiddling with at http://en.wikipedia.org/wiki/User:Brion_VIBBER/vector.js ), but hope to get some of these projects starting moving forward after Berlin.
-- brion
Beyond that let's flip the question the other way -- what do we *want* out of WYSIWYG editing, and can that tool provide it or what else do we need?
We want something simpler and easier to use. That is not what Wikia has. I could hardly stand trying it out for a few minutes.
Fred
On Mon, May 2, 2011 at 7:33 PM, Fred Bauder fredbaud@fairpoint.net wrote:
Beyond that let's flip the question the other way -- what do we *want* out of WYSIWYG editing, and can that tool provide it or what else do we need?
We want something simpler and easier to use. That is not what Wikia has. I could hardly stand trying it out for a few minutes.
So, why not use my WYSIFTW approach? It will only "parse" the parts of the wikitext that it can turn back, edited or unedited, into wikitext, unaltered (including whitespace) if not manually changed. Some parts may therefore stay as wikitext, but it's very rare (except lists, which I didn't implement yet, but they look intuitive enough).
Today's featured article parses in 2 sec in Chrome, so it's fast enough for most situations using a current browser, and it also supports section editing. There's basic functionality for most things, even a one-click "insert reference" function. There's also still lots missing, but nothing fundamental, mostly time-sink functions like "insert table column" etc.
Magnus
On Mon, May 2, 2011 at 12:55 PM, Magnus Manske magnusmanske@googlemail.comwrote:
On Mon, May 2, 2011 at 7:33 PM, Fred Bauder fredbaud@fairpoint.net wrote:
Beyond that let's flip the question the other way -- what do we *want*
out of WYSIWYG editing, and can that tool provide it or what else do we
need?
We want something simpler and easier to use. That is not what Wikia has. I could hardly stand trying it out for a few minutes.
So, why not use my WYSIFTW approach? It will only "parse" the parts of the wikitext that it can turn back, edited or unedited, into wikitext, unaltered (including whitespace) if not manually changed. Some parts may therefore stay as wikitext, but it's very rare (except lists, which I didn't implement yet, but they look intuitive enough).
There's a lot I like about the WYSIFTW tool: * replacing the section edits inline is kinda nice * folding of extensions and templates is intelligent and allows you to edit them easily (unlike Wikia's which drops in opaque placeholders, currently requiring you to switch the *entire* section to source mode to change them at all) -- some infoboxes for instance show up as basically editable tables of parameter pairs, which is pretty workable! * popup menus on links, images, etc provide access to detail controls without cluttering up their regular view
I've added a side-by-side view of a popular article (top of [[w:Barack Obama]]) with its WYSIFTW editing view and the Wikia editor (which just gives up and shows source) at:
http://www.mediawiki.org/wiki/Wikitext.next#Problems
There are though cases where WYSIFTW gets confused, such as a <ref> with multi-line contents -- it doesn't get that the lists, templates etc are inside the ref rather than outside, which messes up the folding.
These sorts of things are why I think it'd be a win to use a common wikitext->AST parser for both rendering and editing tasks: if they're consistent we eliminate a lot of such odd edge cases. It could also make it much easier to do fine-grained editing; instead of invoking the editor on an entire section at a time, we could click straight into a paragraph, table, reference, etc, knowing that the editor and the renderer both are dividing the page up the same way.
-- brion
Magnus Manske wrote:
So, why not use my WYSIFTW approach? It will only "parse" the parts of the wikitext that it can turn back, edited or unedited, into wikitext, unaltered (including whitespace) if not manually changed. Some parts may therefore stay as wikitext, but it's very rare (except lists, which I didn't implement yet, but they look intuitive enough).
Magnus
Crazy idea: What if it was an /extensible/ editor? You could add later a module for enable lists, or "enable graphic <ref>", but also instruct it on how to present to the user some crazy template with a dozen parameters...
best idea so far ...
On 03. 05. 2011 00:29, Platonides wrote:
Magnus Manske wrote:
So, why not use my WYSIFTW approach? It will only "parse" the parts of the wikitext that it can turn back, edited or unedited, into wikitext, unaltered (including whitespace) if not manually changed. Some parts may therefore stay as wikitext, but it's very rare (except lists, which I didn't implement yet, but they look intuitive enough).
Magnus
Crazy idea: What if it was an /extensible/ editor? You could add later a module for enable lists, or "enable graphic<ref>", but also instruct it on how to present to the user some crazy template with a dozen parameters...
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
On Mon, May 2, 2011 at 3:29 PM, Platonides Platonides@gmail.com wrote:
Magnus Manske wrote:
So, why not use my WYSIFTW approach? It will only "parse" the parts of the wikitext that it can turn back, edited or unedited, into wikitext, unaltered (including whitespace) if not manually changed. Some parts may therefore stay as wikitext, but it's very rare (except lists, which I didn't implement yet, but they look intuitive enough).
Magnus
Crazy idea: What if it was an /extensible/ editor? You could add later a module for enable lists, or "enable graphic <ref>", but also instruct it on how to present to the user some crazy template with a dozen parameters...
Generically a nice idea.
Specific to Wikipedia / WMF projects - all the extensions you might consider adding are pretty much required for our internal uptake of the tool, as our pages are the biggest / oldest / crustyest ones likely to have to be managed...
On 03/05/11 04:25, Brion Vibber wrote:
The most fundamental problem with Wikia's editor remains its fallback behavior when some structure is unsupported:
"Source mode required
Rich text editing has been disabled because the page contains complex code."
I don't think that's a fundamental problem, I think it's a quick hack added to reduce the development time devoted to rare wikitext constructs, while maintaining round-trip safety. Like you said further down in your post, it can be handled more elegantly by replacing the complex code with a placeholder. Why not just do that?
CKEditor makes adding such placeholders really easy. The RTE source has a long list of such client-side modules, added to support various Wikia extensions.
Here's an example of unsupported code, the presence of which makes a page permanently uneditable by the rich editor until it's removed:
<table> <tr><td>a</td></tr> </table>
You can try this out now at http://communitytest.wikia.com/
Works for me.
http://communitytest.wikia.com/wiki/Brion%27s_table
Beyond that let's flip the question the other way -- what do we *want* out of WYSIWYG editing, and can that tool provide it or what else do we need? I've written up some notes a few weeks ago, which need some more collation & updating from the preliminary experiments I'm doing, and I would strongly appreciate more feedback from you Tim and from everyone else who's been poking about in parser & editing land:
Some people in this thread have expressed concerns about the tiny breakages in wikitext backwards compatibility introduced by RTE, despite the fact that RTE has aimed for, and largely achieved, precise backwards compatibility with legacy wikitext.
I find it hard to believe that those people would be comfortable with a project which has as its goal a broad reform of wikitext syntax.
Perhaps there are good arguments for wikitext syntax reform, but I have trouble believing that WYSIWYG support is one of them, since the problem appears to have been solved already by RTE, without any reform.
Another goal beyond editing itself is normalizing the world of 'alternate parsers'. There've been several announced recently, and we've got such a large array now of them available, all a little different. We even use mwlib ourselves in the PDF/ODF export deployment, and while we don't maintain that engine we need to coordinate a little with the people who do so that new extensions and structures get handled.
I know that there is a camp of data reusers who like to write their own parsers. I think there are more people who have written a wikitext parser from scratch than have contributed even a small change to the MediaWiki core parser. They have a lot of influence, because they go to conferences and ask for things face-to-face.
Now that we have HipHop support, we have the ability to turn MediaWiki's core parser into a fast, reusable library. The performance reasons for limiting the amount of abstraction in the core parser will disappear. How many wikitext parsers does the world really need?
-- Tim Starling
On Mon, May 2, 2011 at 8:28 PM, Tim Starling tstarling@wikimedia.org wrote:
I know that there is a camp of data reusers who like to write their own parsers. I think there are more people who have written a wikitext parser from scratch than have contributed even a small change to the MediaWiki core parser. They have a lot of influence, because they go to conferences and ask for things face-to-face.
Now that we have HipHop support, we have the ability to turn MediaWiki's core parser into a fast, reusable library. The performance reasons for limiting the amount of abstraction in the core parser will disappear. How many wikitext parsers does the world really need?
People want to write their own parsers because they don't want to use PHP. They want to parse in C, Java, Ruby, Python, Perl, Assembly and every other language other than the one that it wasn't written in. There's this, IMHO, misplaced belief that "standardizing" the parser or markup would put us in a world of unicorns and rainbows where people can write their own parsers on a whim, just because they can. Other than "making it easier to integrate with my project," I don't see a need for them either (and tbh, the endless discussions grow tedious).
I don't see any problem with keeping the parser in PHP, and as you point out with HipHop support on the not-too-distant horizon the complaints about performance with Zend will largely evaporate.
-Chad
On Mon, May 2, 2011 at 8:38 PM, Chad innocentkiller@gmail.com wrote:
People want to write their own parsers because they don't want to use PHP. They want to parse in C, Java, Ruby, Python, Perl, Assembly and every other language other than the one that it wasn't written in.
s/wasn't/was/
-Chad
2011-05-03 02:38, Chad skrev:
On Mon, May 2, 2011 at 8:28 PM, Tim Starling tstarling@wikimedia.org wrote:
I know that there is a camp of data reusers who like to write their own parsers. I think there are more people who have written a wikitext parser from scratch than have contributed even a small change to the MediaWiki core parser. They have a lot of influence, because they go to conferences and ask for things face-to-face.
Now that we have HipHop support, we have the ability to turn MediaWiki's core parser into a fast, reusable library. The performance reasons for limiting the amount of abstraction in the core parser will disappear. How many wikitext parsers does the world really need?
People want to write their own parsers because they don't want to use PHP. They want to parse in C, Java, Ruby, Python, Perl, Assembly and every other language other than the one that it wasn't written in. There's this, IMHO, misplaced belief that "standardizing" the parser or markup would put us in a world of unicorns and rainbows where people can write their own parsers on a whim, just because they can. Other than "making it easier to integrate with my project," I don't see a need for them either (and tbh, the endless discussions grow tedious).
My motivation for attacking the task of creating a wikitext parser is, aside from it being an interesting problem, a genuin concern for the fact that such a large body of data is encoded in such a vaguely specified format.
I don't see any problem with keeping the parser in PHP, and as you point out with HipHop support on the not-too-distant horizon the complaints about performance with Zend will largely evaporate.
But most of the parser's work consists of running regexp pattern matching over the article text, doesn't it? Regexp pattern matching are implemented by native functions. Does the Zend engine have a slow regexp implementation? I would have guessed that the main reason that the parser is slow is the algorithm, not its implementation.
Best Regards,
Andreas Jonsson
On 11-05-03 03:40 AM, Andreas Jonsson wrote:
2011-05-03 02:38, Chad skrev: [...]
I don't see any problem with keeping the parser in PHP, and as you point out with HipHop support on the not-too-distant horizon the complaints about performance with Zend will largely evaporate.
But most of the parser's work consists of running regexp pattern matching over the article text, doesn't it? Regexp pattern matching are implemented by native functions. Does the Zend engine have a slow regexp implementation? I would have guessed that the main reason that the parser is slow is the algorithm, not its implementation.
Best Regards,
Andreas Jonsson
regexps might be fast, but when you have to run hundreds of them all over the place and do stuff in-language then the language becomes the bottleneck.
regexps might be fast, but when you have to run hundreds of them all over the place and do stuff in-language then the language becomes the bottleneck.
some oprofile data showsthat pcre is few percent of execution time - and there's really lots of Zend internals that stand in the way - memory management (HPHP implements it as C++ object allocations via jemalloc), symbol resolutions (native calls in C++), etc.
Domas
samples % image name app name symbol name 492400 9.6648 libphp5.so libphp5.so _zend_mm_alloc_int 451573 8.8634 libc-2.7.so libc-2.7.so (no symbols) 347812 6.8268 libphp5.so libphp5.so zend_hash_find 345665 6.7847 no-vmlinux no-vmlinux (no symbols) 330513 6.4873 libphp5.so libphp5.so _zend_mm_free_int 225755 4.4311 libpcre.so.3.12.1 libpcre.so.3.12.1 (no symbols) 159925 3.1390 libphp5.so libphp5.so zend_do_fcall_common_helper_SPEC 137709 2.7029 libphp5.so libphp5.so _zval_ptr_dtor 127233 2.4973 libxml2.so.2.6.31 libxml2.so.2.6.31 (no symbols) 111249 2.1836 libphp5.so libphp5.so zend_hash_quick_find 93994 1.8449 libphp5.so libphp5.so _zend_hash_quick_add_or_update 84693 1.6623 libphp5.so libphp5.so zend_assign_to_variable 84256 1.6538 fss.so fss.so (no symbols) 56474 1.1085 libphp5.so libphp5.so execute 49959 0.9806 libphp5.so libphp5.so zend_hash_destroy 48450 0.9510 libz.so.1.2.3.3 libz.so.1.2.3.3 (no symbols) 46967 0.9219 libphp5.so libphp5.so ZEND_JMPZ_SPEC_TMP_HANDLER 46523 0.9131 libphp5.so libphp5.so _zend_hash_add_or_update 45747 0.8979 libphp5.so libphp5.so zend_str_tolower_copy 39154 0.7685 libphp5.so libphp5.so zend_fetch_dimension_address 35356 0.6940 libphp5.so libphp5.so ZEND_RECV_SPEC_HANDLER 33381 0.6552 libphp5.so libphp5.so compare_function 32660 0.6410 libphp5.so libphp5.so _zend_hash_index_update_or_next_insert 31815 0.6245 libphp5.so libphp5.so zend_parse_va_args 31689 0.6220 libphp5.so libphp5.so ZEND_SEND_VAR_SPEC_CV_HANDLER 31554 0.6193 libphp5.so libphp5.so _emalloc 30404 0.5968 libphp5.so libphp5.so _get_zval_ptr_var 29812 0.5851 libphp5.so libphp5.so ZEND_ASSIGN_REF_SPEC_CV_VAR_HANDLER 28092 0.5514 libphp5.so libphp5.so ZEND_DO_FCALL_SPEC_CONST_HANDLER 27760 0.5449 libphp5.so libphp5.so zend_hash_clean 27589 0.5415 libphp5.so libphp5.so zend_fetch_var_address_helper_SPEC_CONST 26731 0.5247 libphp5.so libphp5.so _zval_dtor_func 24732 0.4854 libphp5.so libphp5.so ZEND_ASSIGN_SPEC_CV_VAR_HANDLER 24732 0.4854 libphp5.so libphp5.so ZEND_RECV_INIT_SPEC_CONST_HANDLER 22587 0.4433 libphp5.so libphp5.so zend_send_by_var_helper_SPEC_CV 22176 0.4353 libphp5.so libphp5.so _efree 21911 0.4301 libphp5.so libphp5.so .plt 21102 0.4142 libphp5.so libphp5.so ZEND_SEND_VAL_SPEC_CONST_HANDLER 19556 0.3838 libphp5.so libphp5.so zend_fetch_property_address_read_helper_SPEC_UNUSED_CONST 18568 0.3645 libphp5.so libphp5.so zend_get_property_info 18348 0.3601 libphp5.so libphp5.so zend_std_get_method 18279 0.3588 libphp5.so libphp5.so zend_get_hash_value 17944 0.3522 libphp5.so libphp5.so php_var_unserialize 17461 0.3427 libphp5.so libphp5.so _zval_copy_ctor_func 17187 0.3373 libtidy-0.99.so.0.0.0 libtidy-0.99.so.0.0.0 (no symbols) 16341 0.3207 libphp5.so libphp5.so zend_get_parameters_ex 16103 0.3161 libphp5.so libphp5.so zend_std_read_property 15662 0.3074 libphp5.so libphp5.so zend_hash_copy 14678 0.2881 libphp5.so libphp5.so zend_binary_strcmp 14556 0.2857 apc.so apc.so my_copy_hashtable_ex 14279 0.2803 libphp5.so libphp5.so _zend_mm_realloc_int 13993 0.2747 oprofiled oprofiled (no symbols) 13680 0.2685 libphp5.so libphp5.so dom_nodelist_length_read 13265 0.2604 libphp5.so libphp5.so zval_add_ref 13166 0.2584 libphp5.so libphp5.so zend_objects_store_del_ref_by_handle 13084 0.2568 libphp5.so libphp5.so ZEND_INIT_METHOD_CALL_SPEC_CV_CONST_HANDLER 13030 0.2558 libphp5.so libphp5.so zend_assign_to_object 11822 0.2320 libphp5.so libphp5.so ZEND_INSTANCEOF_SPEC_CV_HANDLER 11511 0.2259 libphp5.so libphp5.so zend_fetch_property_address_read_helper_SPEC_CV_CONST 11425 0.2242 libphp5.so libphp5.so _estrndup 11340 0.2226 libphp5.so libphp5.so zendi_smart_strcmp 11227 0.2204 libphp5.so libphp5.so ZEND_JMPZ_SPEC_VAR_HANDLER 11174 0.2193 libphp5.so libphp5.so ZEND_FETCH_CLASS_SPEC_CONST_HANDLER 11080 0.2175 libphp5.so libphp5.so _zend_hash_init 10908 0.2141 libphp5.so libphp5.so zend_object_store_get_object 10623 0.2085 libphp5.so libphp5.so zend_assign_to_variable_reference 10577 0.2076 libphp5.so libphp5.so zend_hash_index_find 10231 0.2008 libphp5.so libphp5.so ZEND_JMP_SPEC_HANDLER 10227 0.2007 libphp5.so libphp5.so ZEND_RETURN_SPEC_CONST_HANDLER 9400 0.1845 libphp5.so libphp5.so _safe_emalloc 8973 0.1761 libphp5.so libphp5.so ZEND_BOOL_SPEC_TMP_HANDLER 8652 0.1698 libphp5.so libphp5.so zend_lookup_class_ex 8504 0.1669 libphp5.so libphp5.so ZEND_JMPZ_EX_SPEC_TMP_HANDLER 8489 0.1666 libphp5.so libphp5.so zend_call_function 8448 0.1658 libphp5.so libphp5.so convert_to_boolean 8307 0.1630 libphp5.so libphp5.so ZEND_JMPZ_SPEC_CV_HANDLER 8297 0.1629 libphp5.so libphp5.so zend_hash_rehash 8092 0.1588 libphp5.so libphp5.so ZEND_INIT_METHOD_CALL_SPEC_UNUSED_CONST_HANDLER 7855 0.1542 libphp5.so libphp5.so ZEND_RETURN_SPEC_VAR_HANDLER 7659 0.1503 libphp5.so libphp5.so instanceof_function_ex 7552 0.1482 libphp5.so libphp5.so ZEND_FE_FETCH_SPEC_VAR_HANDLER 7383 0.1449 libphp5.so libphp5.so ZEND_FETCH_OBJ_R_SPEC_UNUSED_CONST_HANDLER 7036 0.1381 libphp5.so libphp5.so is_identical_function 7012 0.1376 libphp5.so libphp5.so php_is_type 6907 0.1356 libphp5.so libphp5.so zend_hash_get_current_data_ex 6901 0.1355 libphp5.so libphp5.so ZEND_SEND_REF_SPEC_CV_HANDLER 6881 0.1351 libphp5.so libphp5.so concat_function 6860 0.1346 libphp5.so libphp5.so zend_hash_del_key_or_index 6843 0.1343 libphp5.so libphp5.so php_pcre_match_impl 6648 0.1305 libphp5.so libphp5.so zend_isset_isempty_dim_prop_obj_handler_SPEC_VAR_CV 6600 0.1295 libphp5.so libphp5.so ZEND_ASSIGN_DIM_SPEC_CV_UNUSED_HANDLER 6538 0.1283 libphp5.so libphp5.so _phpi_pop 6306 0.1238 libphp5.so libphp5.so zend_get_constant_ex 6254 0.1228 libphp5.so libphp5.so zif_strtr 5901 0.1158 libphp5.so libphp5.so zend_fetch_class 5829 0.1144 libphp5.so libphp5.so zif_dom_nodelist_item 5809 0.1140 libphp5.so libphp5.so sub_function 5805 0.1139 libphp5.so libphp5.so zend_std_write_property 5789 0.1136 libphp5.so libphp5.so ZEND_RETURN_SPEC_CV_HANDLER 5753 0.1129 libphp5.so libphp5.so _ecalloc 5678 0.1114 libmysqlclient.so.15.0.0 libmysqlclient.so.15.0.0 (no symbols) 5650 0.1109 libphp5.so libphp5.so ZEND_ADD_ARRAY_ELEMENT_SPEC_CONST_UNUSED_HANDLER 5470 0.1074 libphp5.so libphp5.so ZEND_FETCH_W_SPEC_CONST_HANDLER 5262 0.1033 libphp5.so libphp5.so ZEND_SEND_VAL_SPEC_TMP_HANDLER 5259 0.1032 libphp5.so libphp5.so ZEND_ASSIGN_SPEC_CV_TMP_HANDLER 5128 0.1007 libphp5.so libphp5.so ZEND_FETCH_DIM_W_SPEC_CV_CV_HANDLER
Hi,
I'm not sure what you are profiling, but when repeatingly requesting a preview of an article containing 200000 bytes of data consisting of the pattern "a a a a a a " I got the below results. (The php parser doesn't seem to depend on perl regexps.)
CPU: CPU with timer interrupt, speed 0 MHz (estimated) Profiling through timer interrupt samples % app name symbol name 994 23.4933 libpcre.so.3.12.1 /lib/libpcre.so.3.12.1 545 12.8811 libphp5.so zendparse 369 8.7213 libphp5.so lex_scan 256 6.0506 libc-2.11.2.so memcpy 137 3.2380 libphp5.so zend_hash_find 135 3.1907 libphp5.so _zend_mm_alloc_canary_int 105 2.4817 libphp5.so __i686.get_pc_thunk.bx 90 2.1272 libphp5.so _zend_mm_free_canary_int 67 1.5835 libphp5.so zif_strtr 59 1.3945 libphp5.so zend_mm_add_to_free_list 48 1.1345 libphp5.so zend_mm_remove_from_free_list
/Andreas
2011-05-03 14:40, Domas Mituzas skrev:
some oprofile data showsthat pcre is few percent of execution time - and there's really lots of Zend internals that stand in the way - memory management (HPHP implements it as C++ object allocations via jemalloc), symbol resolutions (native calls in C++), etc.
Domas
samples % image name app name symbol name 492400 9.6648 libphp5.so libphp5.so _zend_mm_alloc_int 451573 8.8634 libc-2.7.so libc-2.7.so (no symbols) 347812 6.8268 libphp5.so libphp5.so zend_hash_find 345665 6.7847 no-vmlinux no-vmlinux (no symbols) 330513 6.4873 libphp5.so libphp5.so _zend_mm_free_int 225755 4.4311 libpcre.so.3.12.1 libpcre.so.3.12.1 (no symbols) 159925 3.1390 libphp5.so libphp5.so zend_do_fcall_common_helper_SPEC 137709 2.7029 libphp5.so libphp5.so _zval_ptr_dtor 127233 2.4973 libxml2.so.2.6.31 libxml2.so.2.6.31 (no symbols) 111249 2.1836 libphp5.so libphp5.so zend_hash_quick_find 93994 1.8449 libphp5.so libphp5.so _zend_hash_quick_add_or_update 84693 1.6623 libphp5.so libphp5.so zend_assign_to_variable 84256 1.6538 fss.so fss.so (no symbols) 56474 1.1085 libphp5.so libphp5.so execute 49959 0.9806 libphp5.so libphp5.so zend_hash_destroy 48450 0.9510 libz.so.1.2.3.3 libz.so.1.2.3.3 (no symbols) 46967 0.9219 libphp5.so libphp5.so ZEND_JMPZ_SPEC_TMP_HANDLER 46523 0.9131 libphp5.so libphp5.so _zend_hash_add_or_update 45747 0.8979 libphp5.so libphp5.so zend_str_tolower_copy 39154 0.7685 libphp5.so libphp5.so zend_fetch_dimension_address 35356 0.6940 libphp5.so libphp5.so ZEND_RECV_SPEC_HANDLER 33381 0.6552 libphp5.so libphp5.so compare_function 32660 0.6410 libphp5.so libphp5.so _zend_hash_index_update_or_next_insert 31815 0.6245 libphp5.so libphp5.so zend_parse_va_args 31689 0.6220 libphp5.so libphp5.so ZEND_SEND_VAR_SPEC_CV_HANDLER 31554 0.6193 libphp5.so libphp5.so _emalloc 30404 0.5968 libphp5.so libphp5.so _get_zval_ptr_var 29812 0.5851 libphp5.so libphp5.so ZEND_ASSIGN_REF_SPEC_CV_VAR_HANDLER 28092 0.5514 libphp5.so libphp5.so ZEND_DO_FCALL_SPEC_CONST_HANDLER 27760 0.5449 libphp5.so libphp5.so zend_hash_clean 27589 0.5415 libphp5.so libphp5.so zend_fetch_var_address_helper_SPEC_CONST 26731 0.5247 libphp5.so libphp5.so _zval_dtor_func 24732 0.4854 libphp5.so libphp5.so ZEND_ASSIGN_SPEC_CV_VAR_HANDLER 24732 0.4854 libphp5.so libphp5.so ZEND_RECV_INIT_SPEC_CONST_HANDLER 22587 0.4433 libphp5.so libphp5.so zend_send_by_var_helper_SPEC_CV 22176 0.4353 libphp5.so libphp5.so _efree 21911 0.4301 libphp5.so libphp5.so .plt 21102 0.4142 libphp5.so libphp5.so ZEND_SEND_VAL_SPEC_CONST_HANDLER 19556 0.3838 libphp5.so libphp5.so zend_fetch_property_address_read_helper_SPEC_UNUSED_CONST 18568 0.3645 libphp5.so libphp5.so zend_get_property_info 18348 0.3601 libphp5.so libphp5.so zend_std_get_method 18279 0.3588 libphp5.so libphp5.so zend_get_hash_value 17944 0.3522 libphp5.so libphp5.so php_var_unserialize 17461 0.3427 libphp5.so libphp5.so _zval_copy_ctor_func 17187 0.3373 libtidy-0.99.so.0.0.0 libtidy-0.99.so.0.0.0 (no symbols) 16341 0.3207 libphp5.so libphp5.so zend_get_parameters_ex 16103 0.3161 libphp5.so libphp5.so zend_std_read_property 15662 0.3074 libphp5.so libphp5.so zend_hash_copy 14678 0.2881 libphp5.so libphp5.so zend_binary_strcmp 14556 0.2857 apc.so apc.so my_copy_hashtable_ex 14279 0.2803 libphp5.so libphp5.so _zend_mm_realloc_int 13993 0.2747 oprofiled oprofiled (no symbols) 13680 0.2685 libphp5.so libphp5.so dom_nodelist_length_read 13265 0.2604 libphp5.so libphp5.so zval_add_ref 13166 0.2584 libphp5.so libphp5.so zend_objects_store_del_ref_by_handle 13084 0.2568 libphp5.so libphp5.so ZEND_INIT_METHOD_CALL_SPEC_CV_CONST_HANDLER 13030 0.2558 libphp5.so libphp5.so zend_assign_to_object 11822 0.2320 libphp5.so libphp5.so ZEND_INSTANCEOF_SPEC_CV_HANDLER 11511 0.2259 libphp5.so libphp5.so zend_fetch_property_address_read_helper_SPEC_CV_CONST 11425 0.2242 libphp5.so libphp5.so _estrndup 11340 0.2226 libphp5.so libphp5.so zendi_smart_strcmp 11227 0.2204 libphp5.so libphp5.so ZEND_JMPZ_SPEC_VAR_HANDLER 11174 0.2193 libphp5.so libphp5.so ZEND_FETCH_CLASS_SPEC_CONST_HANDLER 11080 0.2175 libphp5.so libphp5.so _zend_hash_init 10908 0.2141 libphp5.so libphp5.so zend_object_store_get_object 10623 0.2085 libphp5.so libphp5.so zend_assign_to_variable_reference 10577 0.2076 libphp5.so libphp5.so zend_hash_index_find 10231 0.2008 libphp5.so libphp5.so ZEND_JMP_SPEC_HANDLER 10227 0.2007 libphp5.so libphp5.so ZEND_RETURN_SPEC_CONST_HANDLER 9400 0.1845 libphp5.so libphp5.so _safe_emalloc 8973 0.1761 libphp5.so libphp5.so ZEND_BOOL_SPEC_TMP_HANDLER 8652 0.1698 libphp5.so libphp5.so zend_lookup_class_ex 8504 0.1669 libphp5.so libphp5.so ZEND_JMPZ_EX_SPEC_TMP_HANDLER 8489 0.1666 libphp5.so libphp5.so zend_call_function 8448 0.1658 libphp5.so libphp5.so convert_to_boolean 8307 0.1630 libphp5.so libphp5.so ZEND_JMPZ_SPEC_CV_HANDLER 8297 0.1629 libphp5.so libphp5.so zend_hash_rehash 8092 0.1588 libphp5.so libphp5.so ZEND_INIT_METHOD_CALL_SPEC_UNUSED_CONST_HANDLER 7855 0.1542 libphp5.so libphp5.so ZEND_RETURN_SPEC_VAR_HANDLER 7659 0.1503 libphp5.so libphp5.so instanceof_function_ex 7552 0.1482 libphp5.so libphp5.so ZEND_FE_FETCH_SPEC_VAR_HANDLER 7383 0.1449 libphp5.so libphp5.so ZEND_FETCH_OBJ_R_SPEC_UNUSED_CONST_HANDLER 7036 0.1381 libphp5.so libphp5.so is_identical_function 7012 0.1376 libphp5.so libphp5.so php_is_type 6907 0.1356 libphp5.so libphp5.so zend_hash_get_current_data_ex 6901 0.1355 libphp5.so libphp5.so ZEND_SEND_REF_SPEC_CV_HANDLER 6881 0.1351 libphp5.so libphp5.so concat_function 6860 0.1346 libphp5.so libphp5.so zend_hash_del_key_or_index 6843 0.1343 libphp5.so libphp5.so php_pcre_match_impl 6648 0.1305 libphp5.so libphp5.so zend_isset_isempty_dim_prop_obj_handler_SPEC_VAR_CV 6600 0.1295 libphp5.so libphp5.so ZEND_ASSIGN_DIM_SPEC_CV_UNUSED_HANDLER 6538 0.1283 libphp5.so libphp5.so _phpi_pop 6306 0.1238 libphp5.so libphp5.so zend_get_constant_ex 6254 0.1228 libphp5.so libphp5.so zif_strtr 5901 0.1158 libphp5.so libphp5.so zend_fetch_class 5829 0.1144 libphp5.so libphp5.so zif_dom_nodelist_item 5809 0.1140 libphp5.so libphp5.so sub_function 5805 0.1139 libphp5.so libphp5.so zend_std_write_property 5789 0.1136 libphp5.so libphp5.so ZEND_RETURN_SPEC_CV_HANDLER 5753 0.1129 libphp5.so libphp5.so _ecalloc 5678 0.1114 libmysqlclient.so.15.0.0 libmysqlclient.so.15.0.0 (no symbols) 5650 0.1109 libphp5.so libphp5.so ZEND_ADD_ARRAY_ELEMENT_SPEC_CONST_UNUSED_HANDLER 5470 0.1074 libphp5.so libphp5.so ZEND_FETCH_W_SPEC_CONST_HANDLER 5262 0.1033 libphp5.so libphp5.so ZEND_SEND_VAL_SPEC_TMP_HANDLER 5259 0.1032 libphp5.so libphp5.so ZEND_ASSIGN_SPEC_CV_TMP_HANDLER 5128 0.1007 libphp5.so libphp5.so ZEND_FETCH_DIM_W_SPEC_CV_CV_HANDLER
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Hi!
I'm not sure what you are profiling,
Wikipedia :)
but when repeatingly requesting a preview of an article containing 200000 bytes of data consisting of the pattern "a a a a a a " I got the below results. (The php parser doesn't seem to depend on perl regexps.)
I'm sure nothing profiles better than a synthetic edge case. What do you mean by it not depending on perl regexps? It is top symbol in your profile.
CPU: CPU with timer interrupt, speed 0 MHz (estimated) Profiling through timer interrupt samples % app name symbol name 994 23.4933 libpcre.so.3.12.1 /lib/libpcre.so.3.12.1 545 12.8811 libphp5.so zendparse 369 8.7213 libphp5.so lex_scan 256 6.0506 libc-2.11.2.so memcpy 137 3.2380 libphp5.so zend_hash_find 135 3.1907 libphp5.so _zend_mm_alloc_canary_int 105 2.4817 libphp5.so __i686.get_pc_thunk.bx 90 2.1272 libphp5.so _zend_mm_free_canary_int 67 1.5835 libphp5.so zif_strtr 59 1.3945 libphp5.so zend_mm_add_to_free_list 48 1.1345 libphp5.so zend_mm_remove_from_free_list
Domas
2011-05-03 22:50, Domas Mituzas skrev:
Hi!
I'm not sure what you are profiling,
Wikipedia :)
but when repeatingly requesting a preview of an article containing 200000 bytes of data consisting of the pattern "a a a a a a " I got the below results. (The php parser doesn't seem to depend on perl regexps.)
I'm sure nothing profiles better than a synthetic edge case. What do you mean by it not depending on perl regexps? It is top symbol in your profile.
The discussion was concerning parser performance, so a profile of only parser execution would have been most relevant. In my profiling data, parser execution dominates, and as you can see, its mostly regexp evaluation. (With "php parser", I was referring to "zendparse" and "lex_scan", which doesn't seem to use libpcre. I.e., almost all calls to libpcre is made from the wikitext parser.)
/Andreas
CPU: CPU with timer interrupt, speed 0 MHz (estimated) Profiling through timer interrupt samples % app name symbol name 994 23.4933 libpcre.so.3.12.1 /lib/libpcre.so.3.12.1 545 12.8811 libphp5.so zendparse 369 8.7213 libphp5.so lex_scan 256 6.0506 libc-2.11.2.so memcpy 137 3.2380 libphp5.so zend_hash_find 135 3.1907 libphp5.so _zend_mm_alloc_canary_int 105 2.4817 libphp5.so __i686.get_pc_thunk.bx 90 2.1272 libphp5.so _zend_mm_free_canary_int 67 1.5835 libphp5.so zif_strtr 59 1.3945 libphp5.so zend_mm_add_to_free_list 48 1.1345 libphp5.so zend_mm_remove_from_free_list
Domas
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Hi!
The discussion was concerning parser performance,
the discussion was concerning overall parser performance, not just edge cases.
so a profile of only parser execution would have been most relevant.
Indeed, try parsing decent articles with all their template trees.
In my profiling data, parser execution dominates, and as you can see, its mostly regexp evaluation.
Indeed, because you don't have anything what would invoke callbacks or branching in the code.
(With "php parser", I was referring to "zendparse" and "lex_scan", which doesn't seem to use libpcre. I.e., almost all calls to libpcre is made from the wikitext parser.)
I usually don't know what "php parser" is, opcode caches have been around for past decade.
Domas
2011-05-03 13:25, Daniel Friesen skrev:
On 11-05-03 03:40 AM, Andreas Jonsson wrote:
2011-05-03 02:38, Chad skrev: [...]
I don't see any problem with keeping the parser in PHP, and as you point out with HipHop support on the not-too-distant horizon the complaints about performance with Zend will largely evaporate.
But most of the parser's work consists of running regexp pattern matching over the article text, doesn't it? Regexp pattern matching are implemented by native functions. Does the Zend engine have a slow regexp implementation? I would have guessed that the main reason that the parser is slow is the algorithm, not its implementation.
Best Regards,
Andreas Jonsson
regexps might be fast, but when you have to run hundreds of them all over the place and do stuff in-language then the language becomes the bottleneck.
The time it takes to execute the code that glues together the regexps will be insignificant compared to actually executing the regexps for any article larger than a few hundred bytes. This is at least the case for the articles are the the easiest for the core parser, which are articles that contains no markup. The more markup the slower it will run. It is possible that this slowdown will be lessened if compiled with HipHop. But the top speed of the parser (in bytes/seconds) will be largely unaffected.
/Andreas
On 04/05/11 15:52, Andreas Jonsson wrote:
The time it takes to execute the code that glues together the regexps will be insignificant compared to actually executing the regexps for any article larger than a few hundred bytes. This is at least the case for the articles are the the easiest for the core parser, which are articles that contains no markup. The more markup the slower it will run. It is possible that this slowdown will be lessened if compiled with HipHop. But the top speed of the parser (in bytes/seconds) will be largely unaffected.
PHP execution dominates for real test cases, and HipHop provides a massive speedup. See the previous HipHop thread.
http://lists.wikimedia.org/pipermail/wikitech-l/2011-April/052679.html
Unfortunately, users refuse to write articles consisting only of hundreds of kilobytes of plain text, they keep adding references and links and things. So we don't really care about the parser's "top speed".
-- Tim Starling
2011-05-04 08:13, Tim Starling skrev:
On 04/05/11 15:52, Andreas Jonsson wrote:
The time it takes to execute the code that glues together the regexps will be insignificant compared to actually executing the regexps for any article larger than a few hundred bytes. This is at least the case for the articles are the the easiest for the core parser, which are articles that contains no markup. The more markup the slower it will run. It is possible that this slowdown will be lessened if compiled with HipHop. But the top speed of the parser (in bytes/seconds) will be largely unaffected.
PHP execution dominates for real test cases, and HipHop provides a massive speedup. See the previous HipHop thread.
http://lists.wikimedia.org/pipermail/wikitech-l/2011-April/052679.html
Unfortunately, users refuse to write articles consisting only of hundreds of kilobytes of plain text, they keep adding references and links and things. So we don't really care about the parser's "top speed".
We are talking about different things. I don't consider callbacks made when processing "magic words" or "parser functions" being part of the actual parsing. The reference case of no markup input is interesting to me as it marks the maximum throughput of the MediaWiki parser, and is what you would compare alternative implementations to. But, obviously, if the Barack Obama article takes 22 seconds to render, there are more severe problems than parser performance at the moment.
Best Regards,
/Andreas Jonsson
On Thu, May 5, 2011 at 3:21 AM, Andreas Jonsson andreas.jonsson@kreablo.se wrote:
2011-05-04 08:13, Tim Starling skrev:
On 04/05/11 15:52, Andreas Jonsson wrote:
The time it takes to execute the code that glues together the regexps will be insignificant compared to actually executing the regexps for any article larger than a few hundred bytes. Â This is at least the case for the articles are the the easiest for the core parser, which are articles that contains no markup. Â The more markup the slower it will run. Â It is possible that this slowdown will be lessened if compiled with HipHop. But the top speed of the parser (in bytes/seconds) will be largely unaffected.
PHP execution dominates for real test cases, and HipHop provides a massive speedup. See the previous HipHop thread.
http://lists.wikimedia.org/pipermail/wikitech-l/2011-April/052679.html
Unfortunately, users refuse to write articles consisting only of hundreds of kilobytes of plain text, they keep adding references and links and things. So we don't really care about the parser's "top speed".
We are talking about different things. Â I don't consider callbacks made when processing "magic words" or "parser functions" being part of the actual parsing. Â The reference case of no markup input is interesting to me as it marks the maximum throughput of the MediaWiki parser, and is what you would compare alternative implementations to. Â But, obviously, if the Barack Obama article takes 22 seconds to render, there are more severe problems than parser performance at the moment.
It's a little more complicated than that, and obviously you haven't spent a lot of time looking at profiling output from parsing the Barack Obama article if you say that — what, if not the parser, is slowing down the processing of that article?
Consider the following:
1. Many things that you would exclude from "parsing" like reference tags and what-not call the parser themselves. 2. Regardless of whether you include the actual callback in your measurements of parser run time, you need to consider them. Identifying structures that require callbacks, as well as structures that don't (such as links, templates, images, and what not) takes time. While you might reasonably exclude ifexist calls and so on from parser run time, you most certainly cannot reasonably exclude template calls, link processing, nor the extra time taken by the preprocessor to identify such structures.
As Domas says, real world data is king. As far as I know, in the case of 'a a a a', even if you repeat it for a few MB, virtually no PHP code is run, because the preprocessor uses strcpsn to identify structures requiring preprocessing. That's implemented in C — in fact, for 'a a a' repeated for a few MB, it's my (probably totally wrong) understanding that the PHP code runs in more or less constant time. It's the structures that appear in real articles that make the parser slow.
—Andrew
-- Andrew Garrett Wikimedia Foundation agarrett@wikimedia.org
2011-05-06 03:27, Andrew Garrett skrev:
On Thu, May 5, 2011 at 3:21 AM, Andreas Jonsson andreas.jonsson@kreablo.se wrote:
2011-05-04 08:13, Tim Starling skrev:
On 04/05/11 15:52, Andreas Jonsson wrote:
The time it takes to execute the code that glues together the regexps will be insignificant compared to actually executing the regexps for any article larger than a few hundred bytes. This is at least the case for the articles are the the easiest for the core parser, which are articles that contains no markup. The more markup the slower it will run. It is possible that this slowdown will be lessened if compiled with HipHop. But the top speed of the parser (in bytes/seconds) will be largely unaffected.
PHP execution dominates for real test cases, and HipHop provides a massive speedup. See the previous HipHop thread.
http://lists.wikimedia.org/pipermail/wikitech-l/2011-April/052679.html
Unfortunately, users refuse to write articles consisting only of hundreds of kilobytes of plain text, they keep adding references and links and things. So we don't really care about the parser's "top speed".
We are talking about different things. I don't consider callbacks made when processing "magic words" or "parser functions" being part of the actual parsing. The reference case of no markup input is interesting to me as it marks the maximum throughput of the MediaWiki parser, and is what you would compare alternative implementations to. But, obviously, if the Barack Obama article takes 22 seconds to render, there are more severe problems than parser performance at the moment.
It's a little more complicated than that, and obviously you haven't spent a lot of time looking at profiling output from parsing the Barack Obama article if you say that — what, if not the parser, is slowing down the processing of that article?
Consider the following:
- Many things that you would exclude from "parsing" like reference
tags and what-not call the parser themselves. 2. Regardless of whether you include the actual callback in your measurements of parser run time, you need to consider them. Identifying structures that require callbacks, as well as structures that don't (such as links, templates, images, and what not) takes time. While you might reasonably exclude ifexist calls and so on from parser run time, you most certainly cannot reasonably exclude template calls, link processing, nor the extra time taken by the preprocessor to identify such structures.
As Domas says, real world data is king. As far as I know, in the case of 'a a a a', even if you repeat it for a few MB, virtually no PHP code is run, because the preprocessor uses strcpsn to identify structures requiring preprocessing. That's implemented in C — in fact, for 'a a a' repeated for a few MB, it's my (probably totally wrong) understanding that the PHP code runs in more or less constant time. It's the structures that appear in real articles that make the parser slow.
I'm sorry, I misunderstood the original statement that HipHop would make _parsing_ significantly faster and questioned that on false premises, because I'm thinking of the parser and the preprocessor as distinctly different components.
Let me explain: as I see it, the first step in formalizing wikitext syntax is to analyze and write a parser that can be used as a drop in replacement after preprocessing. The stuff that is preprocessed cannot be integrated with the parser without sacrificing compatiblity. Preprocessing is problematic. It breaks the one-to-one relationship with the wikitext and the syntax tree, (i.e., it impossible to serialize a syntax tree back to the same wikitext that generated it.) Therefore, in a second step, it should be analyzed how the preprocessed constructions can be integrated with the parser and how to minimize the damage from this change.
I had not analyzed the parts of the core parser that I consider "preproprocessing", and it came as a suprise to me that it was as slow as the Barack Obama benchmark shows. But integrating template expansion with the parser would solve this performance problem, and is therefore in itself a strong argument for working towards replacing it. I will write about this on wikitext-l.
Best Regards,
Andreas Jonsson
On 06/05/11 17:13, Andreas Jonsson wrote:
I had not analyzed the parts of the core parser that I consider "preproprocessing", and it came as a suprise to me that it was as slow as the Barack Obama benchmark shows. But integrating template expansion with the parser would solve this performance problem, and is therefore in itself a strong argument for working towards replacing it. I will write about this on wikitext-l.
That benchmark didn't have any templates in it, I expanded them with Special:ExpandTemplates before I started. So it's unlikely that a significant amount of the time was spent in the preprocessor.
It was a really quick benchmark, with no profiling or further testing whatsoever. It took a few minutes to do. You shouldn't base architecture decisions on it, it might be totally invalid. It might not be a parser benchmark at all. I might have made some configuration error, causing it to test an unrelated region of the code.
All I know is, I sent in wikitext, the CPU usage went to 100% for a while, then HTML came back.
I've spent a lot of time profiling and optimising the parser in the past. It's a complex process. You can't just look at one number for a large amount of very complex text and conclude that you've found an optimisation target.
-- Tim Starling
Ohi,
The time it takes to execute the code that glues together the regexps will be insignificant compared to actually executing the regexps for any article larger than a few hundred bytes.
Well, you did an edge case - a long line. Actually, try replacing spaces with newlines, and you will get 25x cost difference ;-)
But the top speed of the parser (in bytes/seconds) will be largely unaffected.
Damn!
Domas
2011-05-04 08:41, Domas Mituzas skrev:
Ohi,
The time it takes to execute the code that glues together the regexps will be insignificant compared to actually executing the regexps for any article larger than a few hundred bytes.
Well, you did an edge case - a long line. Actually, try replacing spaces with newlines, and you will get 25x cost difference ;-)
A single long line containing no markup is indeed an edge case, but it is a good reference case since it is the input where the parser will run at its fastest.
Replacing the spaces with newlines will cause a tenfold increase in the execution time. Sure, in relative numbers less is time spent executing regexps, but in absolute numbers, more time is spent there.
/Andreas
samples % app name symbol name 283 8.6044 libphp5.so zend_hash_quick_find 188 5.7160 libpcre.so.3.12.1 /lib/libpcre.so.3.12.1 177 5.3816 libphp5.so zend_parse_va_args 165 5.0167 libphp5.so zend_do_fcall_common_helper_SPEC 160 4.8647 libphp5.so __i686.get_pc_thunk.bx 131 3.9830 libphp5.so zend_hash_find 127 3.8614 libphp5.so _zval_ptr_dtor 87 2.6452 libc-2.11.2.so memcpy 82 2.4932 libphp5.so _zend_mm_alloc_canary_int 79 2.4019 libphp5.so zend_get_hash_value 72 2.1891 libphp5.so _zend_mm_free_canary_int 59 1.7939 libphp5.so zend_std_read_property 55 1.6722 libphp5.so execute 52 1.5810 libphp5.so suhosin_get_config 51 1.5506 libphp5.so zend_fetch_property_address_read_helper_SPEC_UNUSED_CONST 48 1.4594 libphp5.so zendparse
But the top speed of the parser (in bytes/seconds) will be largely unaffected.
Damn!
Domas _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Hi!
A single long line containing no markup is indeed an edge case, but it is a good reference case since it is the input where the parser will run at its fastest.
Bubblesort will also have O(N) complexity sometimes :-)
Replacing the spaces with newlines will cause a tenfold increase in the execution time. Sure, in relative numbers less is time spent executing regexps, but in absolute numbers, more time is spent there.
Well, this is not fair - you should sum up all zend symbols if you compare that way - there're no debugging symbols for libpcre, so you get aggregated view. Thats same like saying that 10 is smaller number than 7, just because you can factorize it ;-)
Comparing apples and oranges doesn't always help, that kind of hand waving may impress others, but some have spent more time looking at that data than just for ranting in single mailing list thread ;-)
Cheers, Domas
----- Original Message -----
From: "Andreas Jonsson" andreas.jonsson@kreablo.se
Subject: Re: [Wikitech-l] WYSIWYG and parser plans (was What is wrong with Wikia's WYSIWYG?)
My motivation for attacking the task of creating a wikitext parser is, aside from it being an interesting problem, a genuin concern for the fact that such a large body of data is encoded in such a vaguely specified format.
Correct: Until you have (at least) two independently written parsers, both of which pass a test suite 100%, you don't have a *spec*.
Or more to the point, it's unclear whether the spec or the code rules, which can get nasty.
Cheers, -- jra
On Mon, May 2, 2011 at 5:28 PM, Tim Starling tstarling@wikimedia.orgwrote:
On 03/05/11 04:25, Brion Vibber wrote:
The most fundamental problem with Wikia's editor remains its fallback behavior when some structure is unsupported:
"Source mode required
Rich text editing has been disabled because the page contains complex code."
I don't think that's a fundamental problem, I think it's a quick hack added to reduce the development time devoted to rare wikitext constructs, while maintaining round-trip safety. Like you said further down in your post, it can be handled more elegantly by replacing the complex code with a placeholder. Why not just do that?
Excellent question -- how hard would it be to change that?
I'm fairly sure that's easier to do with an abstract parse tree generated from source (don't recognize it? stash it in a dedicated blob); I worry it may be harder trying to stash that into the middle of a multi-level HTML translation engine that wasn't meant to be reversible in the first place (do we even know if there's an opportunity to recognize the problem component within the annotated HTML or not? Is it seeing things it doesn't recognize in the HTML, or is it seeing certain structures in the source and aborting before it even gets there?).
Like many such things, this might be better resolved by trying it and seeing what happens -- I don't want us to lock into a strategy too early when a lot of ideas are still unresolved.
I'm very interested in making experimentation easy; for my pre-exploratory work I'm stashing things into a gadget which adds render/parse tree/inspector modes to the editing page:
http://www.mediawiki.org/wiki/File:Parser_Playground_demo.png (screenshot & links)
I've got this set up as a gadget on mediawiki.org now and as a user script on en.wikipedia.org (loaded on User:Brion_VIBBER/vector.js) just for tossing random pages in and getting a better sense of how things break down. Currently parser variant choices are:
* the actual MediaWiki parser via API (parse tree shows the preprocessor XML; side-by-side mode doesn't have a working inspector mode though) * a really crappy FakeParser class I threw together, able to handle only a few constructs. Generates a JSON parse tree, and the inspector mode can match up nodes in side-by-side view of the tree & HTML. * PegParser using the peg.js parser generator to build the source->tree parser, and the same tree->html and tree->source round-trip functions as FakeParser. The peg source can be edited and rerun to regen the new parse tree. It's fun!
These are a long way off from the level of experimental support we're going to want, but I think people are going to benefit from trying a few different things and getting a better feel for how source, parse trees, and resulting HTML really will look.
(Template expansion isn't yet presented in this system, and that's going to be where the real fun is. ;)
Some people in this thread have expressed concerns about the tiny
breakages in wikitext backwards compatibility introduced by RTE, despite the fact that RTE has aimed for, and largely achieved, precise backwards compatibility with legacy wikitext.
I find it hard to believe that those people would be comfortable with
a project which has as its goal a broad reform of wikitext syntax.
Perhaps there are good arguments for wikitext syntax reform, but I have trouble believing that WYSIWYG support is one of them, since the problem appears to have been solved already by RTE, without any reform.
Well, Wikia's RTE still doesn't work on high-profile Wikipedia article pages, so that remains unproven...
That said, an RTE that doesn't require changing core parser behavior yet *WILL BE A HUGE BENEFIT* to getting it into use sooner, and still leaves future reform efforts open.
I'm *VERY OPEN* to the notion of doing the RTE using either a supplementary source-level parser (which doesn't have to render all structures 100% the same as the core parser, but *needs* to always create sensible structures that are useful for editors and can round-trip cleanly) or an alternate version of the core parser with annotations and limited transformations (eg like how we don't strip comments out when producing editable source, so we need to keep them in the output in some way if it's going to be fed into an HTML-ish editing view).
A supplementary parser that deals with all your editing fun, but doesn't play super nice with open...close templates is probably just fine for a huge number of purposes.
Now that we have HipHop support, we have the ability to turn
MediaWiki's core parser into a fast, reusable library. The performance reasons for limiting the amount of abstraction in the core parser will disappear. How many wikitext parsers does the world really need?
I'm not convinced that a giant blob of MediaWiki is suitable as a reusable library, but would love to see it tried.
-- brion
On Mon, May 2, 2011 at 5:55 PM, Brion Vibber brion@pobox.com wrote:
On Mon, May 2, 2011 at 5:28 PM, Tim Starling tstarling@wikimedia.orgwrote:
I don't think that's a fundamental problem, I think it's a quick hack added to reduce the development time devoted to rare wikitext constructs, while maintaining round-trip safety. Like you said further down in your post, it can be handled more elegantly by replacing the complex code with a placeholder. Why not just do that?
Excellent question -- how hard would it be to change that?
I'm fairly sure that's easier to do with an abstract parse tree generated from source (don't recognize it? stash it in a dedicated blob); I worry it may be harder trying to stash that into the middle of a multi-level HTML translation engine that wasn't meant to be reversible in the first place (do we even know if there's an opportunity to recognize the problem component within the annotated HTML or not? Is it seeing things it doesn't recognize in the HTML, or is it seeing certain structures in the source and aborting before it even gets there?).
Like many such things, this might be better resolved by trying it and seeing what happens -- I don't want us to lock into a strategy too early when a lot of ideas are still unresolved.
Had a quick chat with Tim in IRC -- we're definitely going to try poking at the current state of the Wikia RTE a bit more.
I'll start merging it to our extensions SVN so we've got a stable clone of it that can be run on stock trunk. Little changes should be mergable back to Wikia's SVN, and we'll have something available for stock distributions that's more stable than the old FCK extension, and that we can start experimenting with along with other stuff.
Another good thing in this code is the client-side editor plugins; once one gets past the raw "shove stuff in/out of the markup format" most of the hard work and value of an editor actually comes in the helpers for working with links, images, tables, galleries, etc -- dialogs, wizards, helpers for dragging things around. That's all stuff that we can examine and improve or base from.
-- brion
Tim Starling wrote:
Another goal beyond editing itself is normalizing the world of 'alternate parsers'. There've been several announced recently, and we've got such a large array now of them available, all a little different. We even use mwlib ourselves in the PDF/ODF export deployment, and while we don't maintain that engine we need to coordinate a little with the people who do so that new extensions and structures get handled.
I know that there is a camp of data reusers who like to write their own parsers. I think there are more people who have written a wikitext parser from scratch than have contributed even a small change to the MediaWiki core parser. They have a lot of influence, because they go to conferences and ask for things face-to-face.
Now that we have HipHop support, we have the ability to turn MediaWiki's core parser into a fast, reusable library. The performance reasons for limiting the amount of abstraction in the core parser will disappear. How many wikitext parsers does the world really need?
I realize you have a dry wit, but I imagine this joke was lost on nearly everyone. You're not really suggesting that everyone who wants to parse MediaWiki wikitext compile and run HipHop PHP in order to do so.
It's unambiguously a fundamental goal that content on Wikimedia wikis be able to be easily redistributed, shared, and spread. A wikisyntax that's impossible to adequately parse in other environments (or in Wikimedia's environment, for that matter) is a critical and serious inhibitor to this goal.
MZMcBride
On Tue, May 3, 2011 at 2:15 PM, MZMcBride z@mzmcbride.com wrote:
I realize you have a dry wit, but I imagine this joke was lost on nearly everyone. You're not really suggesting that everyone who wants to parse MediaWiki wikitext compile and run HipHop PHP in order to do so.
And how is using the parser with HipHop going to be any more difficult than using it with Zend?
-Chad
On Tue, May 3, 2011 at 10:25 AM, Chad innocentkiller@gmail.com wrote:
On Tue, May 3, 2011 at 2:15 PM, MZMcBride z@mzmcbride.com wrote:
I realize you have a dry wit, but I imagine this joke was lost on nearly everyone. You're not really suggesting that everyone who wants to parse MediaWiki wikitext compile and run HipHop PHP in order to do so.
And how is using the parser with HipHop going to be any more difficult than using it with Zend?
It's slightly more difficult, but it definitely isn't any easier. The point here is that only having one implementation of the parser, which can change at any time, which also defines the spec (and I use the word spec here really loosely), is something that inhibits the ability to share knowledge.
Requiring people use our PHP implementation, whether or not it is compiled to C is ludicrous.
- Ryan
It is much easier to embed it in other languages, once you get shared object with Parser methods exposed ;-)
Which would also require the linking application to be GPL licensed, which is less than ideal. We shouldn't limit the licensing of applications that want to write wikitext. An alternative implementation can be licensed in any way the author sees fit.
- Ryan
Which of course allows me to fork the thread and ask why does MediaWiki have to be GPL licensed.
I was just talking about this in IRC :). We could re-license the parser to be LGPL or BSD so that other implementations can use our parser more freely.
- Ryan
Hi!
I was just talking about this in IRC :). We could re-license the parser to be LGPL or BSD so that other implementations can use our parser more freely.
This is how WMF staff treats volunteers:
[21:17:23] <Ryan_Lane> domas: and now I took your BSD idea, and didn't give you credit [21:17:38] * Ryan_Lane wins [21:17:51] <yuvipanda_> FLAWLESS VICTORY [21:17:55] <yuvipanda_> except for the IRC logs
Domas
This is how WMF staff treats volunteers:
[21:17:23] Â <Ryan_Lane> domas: and now I took your BSD idea, and didn't give you credit [21:17:38] Â * Ryan_Lane wins [21:17:51] Â <yuvipanda_> FLAWLESS VICTORY [21:17:55] Â <yuvipanda_> except for the IRC logs
You are evil Domas. For those interested, check the logs higher up, where I discuss licensing way before Domas forked the email thread. He steals my licensing idea, I steal his specific license. :D
- Ryan
If you guys still think about ideas as things that can be stolen, perhaps you should check out the open source movement. Here's a good reference:
http://en.wikipedia.org/wiki/Open_source
On Tue, May 3, 2011 at 11:21 AM, Ryan Lane rlane32@gmail.com wrote:
This is how WMF staff treats volunteers:
[21:17:23] Â <Ryan_Lane> domas: and now I took your BSD idea, and didn't give you credit [21:17:38] Â * Ryan_Lane wins [21:17:51] Â <yuvipanda_> FLAWLESS VICTORY [21:17:55] Â <yuvipanda_> except for the IRC logs
You are evil Domas. For those interested, check the logs higher up, where I discuss licensing way before Domas forked the email thread. He steals my licensing idea, I steal his specific license. :D
- Ryan
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
On Wed, May 4, 2011 at 1:35 AM, Peter Youngmeister py@wikimedia.org wrote:
If you guys still think about ideas as things that can be stolen,
I'm pretty sure that was meant as a joke.
Oh wait, what if *this* was also meant as a joke and I didn't get it? :|
/me hopes the OT messages end with this and next one is on topic.
/me hopes the OT messages end with this and next one is on topic.
So, getting back on topic, if we are seriously considering telling people to use a hphp compiled version of the parser as a library, we should re-license it in a more compatible license. LGPL is good, BSD is likely better.
Thoughts? Also, for re-licensing, what level of approval do we need? All authors of the parser, or the current people in an svn blame?
- Ryan
Thoughts? Also, for re-licensing, what level of approval do we need? All authors of the parser, or the current people in an svn blame?
Current people are doing 'derivative work' on previous authors work. I think all are needed. Pain oh pain.
Domas
On 3 May 2011 21:15, Domas Mituzas midom.lists@gmail.com wrote:
Thoughts? Also, for re-licensing, what level of approval do we need? All authors of the parser, or the current people in an svn blame?
Current people are doing 'derivative work' on previous authors work. I think all are needed. Pain oh pain.
This is the other reason to reduce it to mathematics, which can then be freely reimplemented.
- d.
I think the idea that we might break the existing PHP "parser" out into a library for general use is rather silly.
The "parser" is not a parser, it's a macro expander with a pile of regular-expressions used to convert short-hand HTML into actual HTML. The code that it outputs is highly dependent on the state of the wiki's configuration and database content at the moment of "parsing". It also is useless to anyone wanting to do anything other than render a page into HTML, because the output is completely opaque as to where any of it was derived. Dividing the "parser" off into a library would require an substantial amount of MediaWiki code to be ported too just to get it working. On it's own, it would be essentially useless.
So, it's probably not an issue what license this hypothetical code would be released under.
- Trevor
On Tue, May 3, 2011 at 1:25 PM, David Gerard dgerard@gmail.com wrote:
On 3 May 2011 21:15, Domas Mituzas midom.lists@gmail.com wrote:
Thoughts? Also, for re-licensing, what level of approval do we need? All authors of the parser, or the current people in an svn blame?
Current people are doing 'derivative work' on previous authors work. I
think all are needed. Pain oh pain.
This is the other reason to reduce it to mathematics, which can then be freely reimplemented.
- d.
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
On Tue, May 3, 2011 at 1:33 PM, Trevor Parscal tparscal@wikimedia.org wrote:
I think the idea that we might break the existing PHP "parser" out into a library for general use is rather silly.
Well, if that's the case, why was it brought up in the discussion to begin with? Here's the comment Tim made:
"Now that we have HipHop support, we have the ability to turn MediaWiki's core parser into a fast, reusable library. The performance reasons for limiting the amount of abstraction in the core parser will disappear. How many wikitext parsers does the world really need?"
He tends to think it's an option. Domas mentioned in IRC that he made a standalone version of the parser a while back, as well.
The "parser" is not a parser, it's a macro expander with a pile of regular-expressions used to convert short-hand HTML into actual HTML. The code that it outputs is highly dependent on the state of the wiki's configuration and database content at the moment of "parsing". It also is useless to anyone wanting to do anything other than render a page into HTML, because the output is completely opaque as to where any of it was derived. Dividing the "parser" off into a library would require an substantial amount of MediaWiki code to be ported too just to get it working. On it's own, it would be essentially useless.
The parser has a configuration state, takes wikitext in, and gives back html. It pulls additional data from the database in these steps as well, yes. However, I don't see how this would be different than any other implementation of the parser. All implementations will require configuration state, and will need to deal with things like templates and extensions.
Though I prefer the concept of alternative parsers (for all the reasons mentioned in the other threads), I do think having our reference implementation available as a library is a good concept. I feel that making it available in a suitable license is ideal.
- Ryan
On Tue, May 3, 2011 at 10:56 PM, Ryan Lane rlane32@gmail.com wrote:
The parser has a configuration state, takes wikitext in, and gives back html. It pulls additional data from the database in these steps as well, yes. However, I don't see how this would be different than any other implementation of the parser. All implementations will require configuration state, and will need to deal with things like templates and extensions.
Not all implementations will want to output HTML, though. Like Neil said in the other thread, some implementations will want to output other formats (HTML for mobile, or PDF) or just want to analyze stuff (metadata from infoboxes/templates for Google or OpenStreetMap). What we have right now is mostly (the preprocessor is nicely separate now, but still) a black box that eats wikitext, reads additional data from places, and spits out HTML. A truly reusable component would at least produce something like an abstract syntax tree that can be rendered or traversed by different consumers to produce different results. Reducing the external dependencies is hard, I agree with that part. However, some components of the (hypothetically broken-up) parser don't necessarily need to know as much, so some gains could possibly be made there.
Roan Kattouw (Catrope)
Op 3 mei 2011, om 22:56 heeft Ryan Lane het volgende geschreven:
On Tue, May 3, 2011 at 1:33 PM, Trevor Parscal tparscal@wikimedia.org wrote:
On it's own, it would be essentially useless.
The parser has a configuration state, takes wikitext in, and gives back html. It pulls additional data from the database in these steps as well, yes. However, I don't see how this would be different than any other implementation of the parser. All implementations will require configuration state, and will need to deal with things like templates and extensions.
Though I prefer the concept of alternative parsers (for all the reasons mentioned in the other threads), I do think having our reference implementation available as a library is a good concept. I feel that making it available in a suitable license is ideal.
- Ryan
Afaik parser does not need a database or extension hooks for minimum but fully operational use.
{{unknown templates}} default to redlinks, {{int:messages}} default to <unknown>, <tags> and {{#functions}} default to literals, {{MAGICWORDS}} to red links, etc...
If a user of the parser would not have any of these (either none existing or no registry / database configured at all). It would fallback to the behaviour as if they are inexistant, not a problem ?
By having this available as a parser sites that host blogs and forums could potentially use wikitext to format their comments and forum threads (to avoid visitors from having to for example learn Wikitext for their wiki, WYSIWYM WYMeditor for WordPress and BBCode for a forum).
Instead they could all be the same syntax. And within wiki context magic words, extensions, int messages etc. would be fed from the wiki database, outside just static.
-- Krinkle
On 4 May 2011 08:19, Krinkle krinklemail@gmail.com wrote:
Op 3 mei 2011, om 22:56 heeft Ryan Lane het volgende geschreven:
On Tue, May 3, 2011 at 1:33 PM, Trevor Parscal tparscal@wikimedia.org wrote:
On it's own, it would be essentially useless.
The parser has a configuration state, takes wikitext in, and gives back html. It pulls additional data from the database in these steps as well, yes. However, I don't see how this would be different than any other implementation of the parser. All implementations will require configuration state, and will need to deal with things like templates and extensions.
Though I prefer the concept of alternative parsers (for all the reasons mentioned in the other threads), I do think having our reference implementation available as a library is a good concept. I feel that making it available in a suitable license is ideal.
- Ryan
Afaik parser does not need a database or extension hooks for minimum but fully operational use.
{{unknown templates}} default to redlinks, {{int:messages}} default to <unknown>, <tags> and {{#functions}} default to literals, {{MAGICWORDS}} to red links, etc...
If a user of the parser would not have any of these (either none existing or no registry / database configured at all). It would fallback to the behaviour as if they are inexistant, not a problem ?
I agree a parser would not need a database but it would need a standard interface or abstraction that in the full MediaWiki would call to the database. Offline readers would implement this interface to extract the wikitext from their compressed format or direct from an XML dump file.
Some datamining tools might just stub this interface and deal with the bare minimum.
Extension hooks are more interesting. I might assume offline readers want as close results to the official sites as possible so will want to implement the same hooks.
Other non-wikitext or non-page data from the database would also go into the same interface/abstraction, or a separate one.
Andrew Dunbar (hippietrail)
By having this available as a parser sites that host blogs and forums could potentially use wikitext to format their comments and forum threads (to avoid visitors from having to for example learn Wikitext for their wiki, WYSIWYM WYMeditor for WordPress and BBCode for a forum).
Instead they could all be the same syntax. And within wiki context magic words, extensions, int messages etc. would be fed from the wiki database, outside just static.
-- Krinkle
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
On 4 May 2011 06:33, Trevor Parscal tparscal@wikimedia.org wrote:
I think the idea that we might break the existing PHP "parser" out into a library for general use is rather silly.
The "parser" is not a parser, it's a macro expander with a pile of regular-expressions used to convert short-hand HTML into actual HTML. The
Oh don't be silly. It may not be an LALR(1) parser or an LL parser or even a recursive descent parser but last I checked parsing was the act of breaking down a text into its elements, which the parser does. It just does it in a pretty clunky way. Whether it stores the results in an AST or in bunches of random state all over the place doesn't mean it's doing something other than parsing.
A more accurate argument is that it's not just a parser since goes directly on to transforming the input into HTML, which is the equivalent of code generation.
code that it outputs is highly dependent on the state of the wiki's configuration and database content at the moment of "parsing". It also is useless to anyone wanting to do anything other than render a page into HTML, because the output is completely opaque as to where any of it was derived. Dividing the "parser" off into a library would require an substantial amount of MediaWiki code to be ported too just to get it working. On it's own, it would be essentially useless.
It seems we're getting bogged won in semantics because in MediaWiki we use the word "parser" in two incompatible ways. 1) The PHP classes which convert wikitext to HTML 2) A hypothetical or postulated part of MediaWiki which does not exist to generate an intermediate form (AST) between wikitext and HTML.
So the first thing we need to do is decide which of these two concepts of parser we're talking about.
Would it be useful to have a library that can convert wikitext to HTML? Yes. Would it be useful to have a library that can convert wikitext to an AST? Unclear. Would it be useful to have a library that can convert such AST to HTML? Because of the semantic soup nobod has even brought this up yet.
So, it's probably not an issue what license this hypothetical code would be released under.
- Trevor
I'm pretty sure the offline wikitext parsing community would care about the licensing as a separate issue to what kind of parser technology it uses internally.
Andrew Dunbar (hippietrail)
On Tue, May 3, 2011 at 1:25 PM, David Gerard dgerard@gmail.com wrote:
On 3 May 2011 21:15, Domas Mituzas midom.lists@gmail.com wrote:
Thoughts? Also, for re-licensing, what level of approval do we need? All authors of the parser, or the current people in an svn blame?
Current people are doing 'derivative work' on previous authors work. I
think all are needed. Pain oh pain.
This is the other reason to reduce it to mathematics, which can then be freely reimplemented.
- d.
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
On 4 May 2011 07:37, Nikola Smolenski smolensk@eunet.rs wrote:
On 05/03/2011 10:10 PM, Ryan Lane wrote:
should re-license it in a more compatible license. LGPL is good, BSD is likely better.
What advantages does BSD offer over LGPL?
Putting the reference implementation of a format under an extremely permissive licence helps spread the format. (This is why Ogg Vorbis went from LGPL to BSD.) This assumes spreading the MediaWiki wikitext format is a good thing.
- d.
----- Original Message -----
From: "Peter Youngmeister" py@wikimedia.org
If you guys still think about ideas as things that can be stolen, perhaps you should check out the open source movement. Here's a good reference:
Aw, c'mon, Peter. No strawmen; it's late.
The reasons why many programmers prefer GPL to BSD -- to keep the work they've invested long hours in for free from being submerged in someone's commercial project with no recompense to them -- which GPL forbids and BSD does not -- is widely understoood.
Myself, I'm firmly convinced after 30 years in this business, that, for all its faults, the choice of the GPL by Linus changed the face of computing (and damned near everything else) just a much as the Apollo project's investment in microelectronics gave us all PCs to run them on in the first place.
You're welcome to disagree (though not on this list; there are lots of places better suited to license advocacy), but it would probably be good not to scoff at people for holding that view. 'Specially when you're using their code. :-)
Cheers, -- jra
On Tue, May 3, 2011 at 11:55 PM, Jay Ashworth jra@baylink.com wrote:
The reasons why many programmers prefer GPL to BSD -- to keep the work they've invested long hours in for free from being submerged in someone's commercial project with no recompense to them -- which GPL forbids and BSD does not -- is widely understoood.
GPL forbids use in a commercial project? Huh?
The difference between GPL and BSD (in a nutshell) is that GPL is copyleft while BSD is permissive (i.e. not copyleft). This means you can use BSD code in proprietary and/or closed source software. It doesn't have anything to do with being commercial or not. The main advantage of using a BSD license is that it is widely compatible with other licenses, unlike the GPL. The main advantage of the GPL is that you retain more control over reuse and know that your code will only be used in open source software.
Also keep in mind there are several variations of BSD licenses. Please don't use the original one as it has problems. The Simplified BSD License/FreeBSD License is my personal favorite.
Ryan Kaldari
On 5/5/11 3:59 PM, Anthony wrote:
On Tue, May 3, 2011 at 11:55 PM, Jay Ashworthjra@baylink.com wrote:
The reasons why many programmers prefer GPL to BSD -- to keep the work they've invested long hours in for free from being submerged in someone's commercial project with no recompense to them -- which GPL forbids and BSD does not -- is widely understoood.
GPL forbids use in a commercial project? Huh?
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
On Tue, May 3, 2011 at 2:11 PM, Domas Mituzas midom.lists@gmail.com wrote:
Which of course allows me to fork the thread and ask why does MediaWiki have to be GPL licensed.
Because all it takes is one developer with substantial contributions who doesn't want to relicense, and then you have to rewrite all their contributions and everything based on their contributions if you want to change the license. That's what a viral license is, after all. Of course, an independent component could be non-GPL-licensed, if it was written from scratch.
On Tue, May 3, 2011 at 6:14 PM, Aryeh Gregor Simetrical+wikilist@gmail.com wrote:
On Tue, May 3, 2011 at 2:11 PM, Domas Mituzas midom.lists@gmail.com wrote:
Which of course allows me to fork the thread and ask why does MediaWiki have to be GPL licensed.
Because all it takes is one developer with substantial contributions who doesn't want to relicense, and then you have to rewrite all their contributions and everything based on their contributions if you want to change the license. Â That's what a viral license is, after all. Â Of course, an independent component could be non-GPL-licensed, if it was written from scratch.
Who do we consider significant? Would it be possible to get consensus on a relicensing?
-Chad
On Tue, May 3, 2011 at 6:16 PM, Chad innocentkiller@gmail.com wrote:
Who do we consider significant? Would it be possible to get consensus on a relicensing?
As far as I know, the way the GPL works makes it effectively impossible to relicense a large project to something more permissive. You'd have to get permission from literally everyone who made nontrivial contributions, or else rewrite their code. But if there's serious interest in this, someone should get an official opinion from Wikimedia's lawyers on how (or if) it could be done.
Personally, I don't see any problem with a parser library being GPL. You can still link it with proprietary code as long as you don't distribute the result, so it would be fine for research projects or similar that rely on proprietary components. You can always *use* GPLd code however you like. If you want to *distribute* proprietary (or otherwise GPL-incompatible) code that depends on my volunteer contributions, I'm happy to tell you to go jump off a bridge.
On Tue, May 3, 2011 at 6:56 PM, Aryeh Gregor Simetrical+wikilist@gmail.com wrote:
On Tue, May 3, 2011 at 6:16 PM, Chad innocentkiller@gmail.com wrote:
Who do we consider significant? Would it be possible to get consensus on a relicensing?
As far as I know, the way the GPL works makes it effectively impossible to relicense a large project to something more permissive. You'd have to get permission from literally everyone who made nontrivial contributions, or else rewrite their code. Â But if there's serious interest in this, someone should get an official opinion from Wikimedia's lawyers on how (or if) it could be done.
Personally, I don't see any problem with a parser library being GPL. You can still link it with proprietary code as long as you don't distribute the result, so it would be fine for research projects or similar that rely on proprietary components. Â You can always *use* GPLd code however you like. Â If you want to *distribute* proprietary (or otherwise GPL-incompatible) code that depends on my volunteer contributions, I'm happy to tell you to go jump off a bridge.
I was just speculating. I don't have any problems with the GPL :)
-Chad
Personally, I don't see any problem with a parser library being GPL. You can still link it with proprietary code as long as you don't distribute the result, so it would be fine for research projects or similar that rely on proprietary components. Â You can always *use* GPLd code however you like. Â If you want to *distribute* proprietary (or otherwise GPL-incompatible) code that depends on my volunteer contributions, I'm happy to tell you to go jump off a bridge.
You'd have an issue with a proprietary application using the wikitext parser as a library? You really find the LGPL completely unacceptable in this situation?
Seems like kind of a hardline position to take. That same application could make API calls to MediaWiki, using it in essentially the same way, without the license restrictions. Also, GPL, in our use case, is fairly ineffective. Even if an application makes PHP calls directly into MediaWiki, that application doesn't necessarily need to be GPL, since there is no actual linking occurring. Not all MediaWiki extensions are GPL, for instance.
Essentially, this will just limit a C version of the software, which is slightly lame.
Meh. If we have a GPL library, I'll just wrap it in a wsgi python library to act as a shim.
- Ryan
On Tue, May 3, 2011 at 7:45 PM, Ryan Lane rlane32@gmail.com wrote:
You'd have an issue with a proprietary application using the wikitext parser as a library? You really find the LGPL completely unacceptable in this situation?
I prefer to license my own code under GPL instead of the the LGPL, yes. I'm not dogmatic about it, it's just a personal preference. If people want to release proprietary code, that's fine by me, but they can do it without my help.
Seems like kind of a hardline position to take. That same application could make API calls to MediaWiki, using it in essentially the same way, Â without the license restrictions.
Yep, it's possible to hack around the GPL in some cases. I'm okay with that. I don't think proprietary code is immoral or anything, so I'm fine with just making it more difficult. Doesn't have to be impossible.
Also, GPL, in our use case, is fairly ineffective. Even if an application makes PHP calls directly into MediaWiki, that application doesn't necessarily need to be GPL, since there is no actual linking occurring.
The only mention of the word "link" in the GPLv3 terms and conditions, outside an example, is in the phrase "link or combine" in section 13:
http://www.gnu.org/licenses/gpl.html
GPLv2 doesn't use it at all in the terms and conditions:
http://www.gnu.org/licenses/gpl-2.0.html
Linking has no special status in the GPL -- it's just a question of what legally constitutes a derivative work. If a C program that dynamically links to a library is legally a derivative work of that library, a PHP program that dynamically calls functions from another PHP program is almost surely a derivative work too. The decision would be made by a judge, who wouldn't have the faintest idea of the technical details and therefore would only care about the general effect.
Not all MediaWiki extensions are GPL, for instance.
This has been discussed before:
http://lists.wikimedia.org/pipermail/wikitech-l/2010-July/048436.html
The opinion of the lawyers employed by the FSF and SFLC implies that all typical MediaWiki extensions and skins must be licensed GPL-compatibly. The SFLC did a detailed analysis of Wordpress plugins, and concluded they all had to be GPL for reasons that apply identically to MediaWiki:
http://wordpress.org/news/2009/07/themes-are-gpl-too/
Our README file says that MediaWiki extensions have to be GPL also. However, we don't enforce any of this at mediawiki.org, since most developers don't seem to be in favor of it (although I personally am). As far as I know, no one's consulted Wikimedia lawyers about it, although by the interpretation of the FSF/SFLC Wikimedia is hosting copyright-infringing extensions at mediawiki.org.
Meh. If we have a GPL library, I'll just wrap it in a wsgi python library to act as a shim.
There is no interpretation of the GPL that I'm aware of that would say linking is not allowed, but calling a Python library function is allowed. Either both create a derivative work, or neither does.
The only mention of the word "link" in the GPLv3 terms and conditions, outside an example, is in the phrase "link or combine" in section 13:
http://www.gnu.org/licenses/gpl.html
GPLv2 doesn't use it at all in the terms and conditions:
http://www.gnu.org/licenses/gpl-2.0.html
Linking has no special status in the GPL -- it's just a question of what legally constitutes a derivative work. Â If a C program that dynamically links to a library is legally a derivative work of that library, a PHP program that dynamically calls functions from another PHP program is almost surely a derivative work too. Â The decision would be made by a judge, who wouldn't have the faintest idea of the technical details and therefore would only care about the general effect.
See the gnu faq on this:
http://www.gnu.org/licenses/gpl-faq.html#LinkingWithGPL
If you link, you must use a GPL compatible license.
Not all MediaWiki extensions are GPL, for instance.
This has been discussed before:
http://lists.wikimedia.org/pipermail/wikitech-l/2010-July/048436.html
The opinion of the lawyers employed by the FSF and SFLC implies that all typical MediaWiki extensions and skins must be licensed GPL-compatibly. Â The SFLC did a detailed analysis of Wordpress plugins, and concluded they all had to be GPL for reasons that apply identically to MediaWiki:
MediaWiki extensions aren't necessarily a derivative work. I'd argue that they fall within the "borderline case" described in the faq:
http://www.gnu.org/licenses/gpl-faq.html#GPLAndPlugins
No need to go through this again though, the thread you linked to already showed that it isn't conclusive either way.
There is no interpretation of the GPL that I'm aware of that would say linking is not allowed, but calling a Python library function is allowed. Â Either both create a derivative work, or neither does.
See my first link to the gnu faq. It isn't that linking isn't allowed, it's that code that directly links to it would need to be licensed in a compatible way. If I link to the library with a python wsgi shim, the shim itself needs to be GPL, but applications accessing the http wsgi interface would not need to be.
- Ryan
On Wed, May 4, 2011 at 8:04 PM, Ryan Lane rlane32@gmail.com wrote:
See the gnu faq on this:
http://www.gnu.org/licenses/gpl-faq.html#LinkingWithGPL
If you link, you must use a GPL compatible license.
Yes, but that's not specific to linking. Nothing in the license proper distinguishes between linking and similar ways of combining two programs. The FAQ's answers discuss linking because that's what the FAQ's questions ask, not because linking is different from other types of combining. (Probably it only mentions linking because the authors of the FAQ were mainly familiar with compiled code.)
I don't think we actually disagree on this, though.
MediaWiki extensions aren't necessarily a derivative work. I'd argue that they fall within the "borderline case" described in the faq:
http://www.gnu.org/licenses/gpl-faq.html#GPLAndPlugins
No need to go through this again though, the thread you linked to already showed that it isn't conclusive either way.
Right. If we care, which apparently we don't, we'd need to ask Wikimedia's lawyers for an official opinion.
See my first link to the gnu faq. It isn't that linking isn't allowed, it's that code that directly links to it would need to be licensed in a compatible way. If I link to the library with a python wsgi shim, the shim itself needs to be GPL, but applications accessing the http wsgi interface would not need to be.
Oh, sorry, I should have looked up wsgi before I replied. :) Yeah, if you make a shim that parses things in response to an HTTP request, users of the HTTP interface don't necessarily have to be licensed GPL-compatibly according to any position I know of.
On Wed, May 4, 2011 at 6:57 PM, Aryeh Gregor Simetrical+wikilist@gmail.com wrote:
Linking has no special status in the GPL -- it's just a question of what legally constitutes a derivative work. Â If a C program that dynamically links to a library is legally a derivative work of that library,
It isn't. A C program which *contains* a library is legally a derivative work of that library.
a PHP program that dynamically calls functions from another PHP program is almost surely a derivative work too. Â The decision would be made by a judge, who wouldn't have the faintest idea of the technical details and therefore would only care about the general effect.
Galoob v. Nintendo: "the infringing work must incorporate a portion of the copyrighted work in some form."
----- Original Message -----
From: "Anthony" wikimail@inbox.org
On Wed, May 4, 2011 at 6:57 PM, Aryeh Gregor Simetrical+wikilist@gmail.com wrote:
Linking has no special status in the GPL -- it's just a question of what legally constitutes a derivative work. If a C program that dynamically links to a library is legally a derivative work of that library,
It isn't. A C program which *contains* a library is legally a derivative work of that library.
Static linking fits that description. Dynamic linking -- through the FSF would really like it to -- does not.
Cheers, -- jra
On Fri, May 6, 2011 at 9:41 AM, Jay Ashworth jra@baylink.com wrote:
----- Original Message -----
From: "Anthony" wikimail@inbox.org
On Wed, May 4, 2011 at 6:57 PM, Aryeh Gregor Simetrical+wikilist@gmail.com wrote:
Linking has no special status in the GPL -- it's just a question of what legally constitutes a derivative work. If a C program that dynamically links to a library is legally a derivative work of that library,
It isn't. A C program which *contains* a library is legally a derivative work of that library.
Static linking fits that description. Â Dynamic linking -- through the FSF would really like it to -- does not.
I'm not sure if that's true or not. There's certainly an argument to be made that dynamic linking creates a derivative work *at the time it is linked*. Also, there's an even stronger argument that using the GPL header files to compile the unlinked program creates a derivative work. (If you want to reverse engineer the header files then you can get around that problem, but that's a lot of extra work, and in most cases, such as this one, you might as well convert the library into a standalone program that can be used via a pipe.)
----- Original Message -----
From: "Anthony" wikimail@inbox.org
On Fri, May 6, 2011 at 9:41 AM, Jay Ashworth jra@baylink.com wrote:
----- Original Message -----
From: "Anthony" wikimail@inbox.org
On Wed, May 4, 2011 at 6:57 PM, Aryeh Gregor Simetrical+wikilist@gmail.com wrote:
Linking has no special status in the GPL -- it's just a question of what legally constitutes a derivative work. If a C program that dynamically links to a library is legally a derivative work of that library,
It isn't. A C program which *contains* a library is legally a derivative work of that library.
Static linking fits that description. Dynamic linking -- through the FSF would really like it to -- does not.
I'm not sure if that's true or not. There's certainly an argument to be made that dynamic linking creates a derivative work *at the time it is linked*. Also, there's an even stronger argument that using the GPL header files to compile the unlinked program creates a derivative work. (If you want to reverse engineer the header files then you can get around that problem, but that's a lot of extra work, and in most cases, such as this one, you might as well convert the library into a standalone program that can be used via a pipe.)
Feist v Rural; header files are *factual data*; no creativity there.
The interoperability exception is to the DMCA, I think, not to section 107.
None of our opinions matter until there's caselaw, of course, and there isn't. But I think the arguments against headers being copyrightable -- and thereby dynamic linking not being a violation of the GPLv2 -- are pretty strong, myself.
Cheers, -- jr 'IANAL' a
On Fri, May 6, 2011 at 10:35 AM, Jay Ashworth jra@baylink.com wrote:
Feist v Rural; header files are *factual data*; no creativity there.
I disagree that there is no creativity in a header file. It's certainly not an open and shut case.
None of our opinions matter until there's caselaw, of course, and there isn't.
Right. That's the main point I was making. The safer option would be to convert the library into a standalone app (which in this case makes a lot of a sense anyway), and then just pipe data to/from it (or use files, or whatever). Wikitext in, html out. This would be *much* simpler than converting Mediawiki in its entirety into a standalone parser which you could pipe to. At least, it was last time I tried to do it, which admittedly was several years ago. And I really don't see how you could argue that this creates a derivative work. It's no different than piping to/from gzip, and I don't think anyone argues that *that* creates a derivative work.
On Fri, May 6, 2011 at 10:38 AM, Jay Ashworth jra@baylink.com wrote:
From: "Anthony" wikimail@inbox.org On Tue, May 3, 2011 at 6:56 PM, Aryeh Gregor
You can always *use* GPLd code however you like.
Does "use" include "prepare a derivative work"?
As long as you don't distribute it, sure. Â The GPL was, is, and will always be *a license to distribute*. Â GPL doesn't even forbid you to modify and make available as a web app; you need to release under AGPL if you want to restrict that.
It doesn't need to forbid it. It only needs to fail to permit it. "You may not propagate or modify a covered work except as expressly provided under this License." "or modify".
So where does the GPL expressly provide for modifying the program without licensing the derivative under the GPL?
Yes, there's nothing in the GPL which requires you to release that GPL derivative to the public. That's what the AGPL does. But if one of your employees or volunteers gets a hold of it and puts it up on a P2P hosting site, then it's out there, and you're going to have a hell of a time suing people for copyright infringement for copying your work which is a derivative of a GPL work and not getting sued yourself for violating the GPL.
If you want to *distribute* proprietary (or otherwise GPL-incompatible) code that depends on my volunteer contributions, I'm happy to tell you to go jump off a bridge.
Copyright law gives the author an exclusive right to *prepare* derivative works, not just to *distribute* derivative works. What in the GPL gives you permission to prepare a proprietary derivative work which you do not distribute?
Citation? Â Note that such a citation must take into account whether there's any *use* in so doing in a non-computer-code environment.
A citation for what? The fact that copyright law recognizes an exclusive right of an author to *prepare* derivative works? Title 17, Section 106(2) of the US Code (http://www.law.cornell.edu/uscode/html/uscode17/usc_sec_17_00000106----000-....).
The other half of my statement was a question.
I'd like to respectfully ask that this thread be taken offlist, perhaps to a wiki page or a private thread among those who are interested.
There's no active intent to change any licensing right now, and general discussion of software licenses and edge cases is pretty far off topic.
Thanks!
-- brion vibber (brion @ pobox.com / brion @ wikimedia.org)
On Fri, May 6, 2011 at 2:14 PM, Brion Vibber brion@pobox.com wrote:
I'd like to respectfully ask that this thread be taken offlist, perhaps to a wiki page or a private thread among those who are interested.
There's no active intent to change any licensing right now, and general discussion of software licenses and edge cases is pretty far off topic.
Thanks!
I'm going to have to agree with Brion and Bryan here. Please can the interested parties take this offlist?
-Chad
On Fri, May 6, 2011 at 2:37 PM, Chad innocentkiller@gmail.com wrote:
On Fri, May 6, 2011 at 2:14 PM, Brion Vibber brion@pobox.com wrote:
I'd like to respectfully ask that this thread be taken offlist, perhaps to a wiki page or a private thread among those who are interested.
There's no active intent to change any licensing right now, and general discussion of software licenses and edge cases is pretty far off topic.
Thanks!
I'm going to have to agree with Brion and Bryan here. Please can the interested parties take this offlist?
What are we taking offlist, and where offlist are we taking it?
On Fri, May 6, 2011 at 2:39 PM, Anthony wikimail@inbox.org wrote:
On Fri, May 6, 2011 at 2:37 PM, Chad innocentkiller@gmail.com wrote:
On Fri, May 6, 2011 at 2:14 PM, Brion Vibber brion@pobox.com wrote:
I'd like to respectfully ask that this thread be taken offlist, perhaps to a wiki page or a private thread among those who are interested.
There's no active intent to change any licensing right now, and general discussion of software licenses and edge cases is pretty far off topic.
Thanks!
I'm going to have to agree with Brion and Bryan here. Please can the interested parties take this offlist?
What are we taking offlist, and where offlist are we taking it?
The hypothetical license discussion, and anywhere but here.
-Chad
Possibly an etherpad? http://MeetingWords.com/LicenseDiscussion
On 06May2011, at 11:42 AM, Chad wrote:
On Fri, May 6, 2011 at 2:39 PM, Anthony wikimail@inbox.org wrote:
On Fri, May 6, 2011 at 2:37 PM, Chad innocentkiller@gmail.com wrote:
On Fri, May 6, 2011 at 2:14 PM, Brion Vibber brion@pobox.com wrote:
I'd like to respectfully ask that this thread be taken offlist, perhaps to a wiki page or a private thread among those who are interested.
There's no active intent to change any licensing right now, and general discussion of software licenses and edge cases is pretty far off topic.
Thanks!
I'm going to have to agree with Brion and Bryan here. Please can the interested parties take this offlist?
What are we taking offlist, and where offlist are we taking it?
The hypothetical license discussion, and anywhere but here.
-Chad
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
On Tue, May 3, 2011 at 6:56 PM, Aryeh Gregor Simetrical+wikilist@gmail.com wrote:
You can still link it with proprietary code as long as you don't distribute the result, so it would be fine for research projects or similar that rely on proprietary components.
What happens if one of your employees or volunteers distributes the result?
You can always *use* GPLd code however you like.
Does "use" include "prepare a derivative work"?
If you want to *distribute* proprietary (or otherwise GPL-incompatible) code that depends on my volunteer contributions, I'm happy to tell you to go jump off a bridge.
Copyright law gives the author an exclusive right to *prepare* derivative works, not just to *distribute* derivative works. What in the GPL gives you permission to prepare a proprietary derivative work which you do not distribute?
----- Original Message -----
From: "Anthony" wikimail@inbox.org
On Tue, May 3, 2011 at 6:56 PM, Aryeh Gregor Simetrical+wikilist@gmail.com wrote:
You can still link it with proprietary code as long as you don't distribute the result, so it would be fine for research projects or similar that rely on proprietary components.
What happens if one of your employees or volunteers distributes the result?
I don't believe there's any GPL caselaw on "span of adminstrative control".
You can always *use* GPLd code however you like.
Does "use" include "prepare a derivative work"?
As long as you don't distribute it, sure. The GPL was, is, and will always be *a license to distribute*. GPL doesn't even forbid you to modify and make available as a web app; you need to release under AGPL if you want to restrict that.
If you want to *distribute* proprietary (or otherwise GPL-incompatible) code that depends on my volunteer contributions, I'm happy to tell you to go jump off a bridge.
Copyright law gives the author an exclusive right to *prepare* derivative works, not just to *distribute* derivative works. What in the GPL gives you permission to prepare a proprietary derivative work which you do not distribute?
Citation? Note that such a citation must take into account whether there's any *use* in so doing in a non-computer-code environment.
Cheers, -- jra
Can we stop discussing this issue? I believe that most MediaWiki developers are in fact not interested in changing the status quo with regards to licensing, so there is no point in discussing it.
Bryan
On Fri, May 6, 2011 at 10:55 AM, Bryan Tong Minh bryan.tongminh@gmail.com wrote:
Can we stop discussing this issue? I believe that most MediaWiki developers are in fact not interested in changing the status quo with regards to licensing, so there is no point in discussing it.
Agreed. Furthermore, I don't think there's any sense at all in programmers arguing about things like whether the FSF's interpretation of the definition of derivative works is or is not correct. Nobody really knows, but the people who bear the responsibility of making an informed guess for MediaWiki-related issues (should it become necessary) are Wikimedia's lawyers, not us. Our opinions on the legal issues are not relevant here.
On Fri, May 6, 2011 at 10:55 AM, Bryan Tong Minh bryan.tongminh@gmail.com wrote:
Can we stop discussing this issue? I believe that most MediaWiki developers are in fact not interested in changing the status quo with regards to licensing, so there is no point in discussing it.
That there isn't going to be a license change is exactly *why* it needs to be discussed. If dynamic linking is fine, then a dynamically linked library is appropriate. On the other hand, if it isn't (or, at least, if it's not clear that it is), then something more like gzip would be better.
"Dynamic linking" implies we have something to dynamically link in the first place. A parser library consisting of compiled PHP in this particular case.
Let's just cross this hypothetical bridge when we come to it, shall we?
- Trevor
On Fri, May 6, 2011 at 11:14 AM, Anthony wikimail@inbox.org wrote:
On Fri, May 6, 2011 at 10:55 AM, Bryan Tong Minh bryan.tongminh@gmail.com wrote:
Can we stop discussing this issue? I believe that most MediaWiki developers are in fact not interested in changing the status quo with regards to licensing, so there is no point in discussing it.
That there isn't going to be a license change is exactly *why* it needs to be discussed. If dynamic linking is fine, then a dynamically linked library is appropriate. On the other hand, if it isn't (or, at least, if it's not clear that it is), then something more like gzip would be better.
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
On Fri, May 6, 2011 at 2:24 PM, Trevor Parscal tparscal@wikimedia.org wrote:
"Dynamic linking" implies we have something to dynamically link in the first place. A parser library consisting of compiled PHP in this particular case.
Let's just cross this hypothetical bridge when we come to it, shall we?
I guess, but I'm not sure it'll ever come up.
"Would it be useful to have a library that can convert wikitext to HTML? Yes."
Would it be even more useful to have a standalone program, with minimal dependencies, that can convert wikitext to HTML? Hell yeah.
Granted, that's only half the problem. The other (and much more difficult) problem is how to convert a *set* of pages (templates and whatnot) into a single chunk of wikitext, which can then be fed into the wikitext to HTML parser. But even without that part it would still be quite useful to have that standalone wikitext to HTML converter.
Anthony wrote:
Granted, that's only half the problem. The other (and much more difficult) problem is how to convert a *set* of pages (templates and whatnot) into a single chunk of wikitext, which can then be fed into the wikitext to HTML parser. But even without that part it would still be quite useful to have that standalone wikitext to HTML converter.
A XML dump?
On Tue, May 3, 2011 at 10:48 AM, Domas Mituzas midom.lists@gmail.comwrote:
It's slightly more difficult, but it definitely isn't any easier
It is much easier to embed it in other languages, once you get shared object with Parser methods exposed ;-)
Building it with HipHop will be harder -- but that's something that can be packaged.
However, I strongly agree that having only a poorly-specified single-implementation markup language for all of Wikipedia & Wikimedia's redistributable data is **not where we want to be** long term.
And even if the PHP-based parser is callable from elsewhere, it's not going to be a good convenient fit for every potential user. It's still worthwhile to hammer out clearer, more consistent document formats for the future, so that other people doing other things that we aren't even thinking of have the flexibility to do those things however they'll need to.
-- brion
On 05/03/2011 07:45 PM, Ryan Lane wrote:
It's slightly more difficult, but it definitely isn't any easier. The point here is that only having one implementation of the parser, which can change at any time, which also defines the spec (and I use the word spec here really loosely), is something that inhibits the ability to share knowledge.
I was thinking whether it would be possible to have two-tier parsing? Define what is valid wikitext, express it in BNF, write a parser in C and use it as a PHP extension. If the parser encounters invalid wikitext, enter the quirks mode AKA the current PHP parser.
I assume that >90% of wikis' contents would be valid wikitext, and so the speedup should be significant. And if someone needs to reuse the content outside of Wikipedia, they can use >90% of the content very easily, and the rest not harder than right now.
The only disadvantage that I see is that every addition to wikitext would have to be implemented in both parsers.
----- Original Message -----
From: "Nikola Smolenski" smolensk@eunet.rs
I was thinking whether it would be possible to have two-tier parsing? Define what is valid wikitext, express it in BNF, write a parser in C and use it as a PHP extension. If the parser encounters invalid wikitext, enter the quirks mode AKA the current PHP parser.
I assume that >90% of wikis' contents would be valid wikitext, and so the speedup should be significant. And if someone needs to reuse the content outside of Wikipedia, they can use >90% of the content very easily, and the rest not harder than right now.
Yeah, I made this suggestion, oh, 2 or 3 years ago... and I was never able to get the acceptable percentage down below 100.0%.
Cheers, -- jra
Chad wrote:
On Tue, May 3, 2011 at 2:15 PM, MZMcBride z@mzmcbride.com wrote:
I realize you have a dry wit, but I imagine this joke was lost on nearly everyone. You're not really suggesting that everyone who wants to parse MediaWiki wikitext compile and run HipHop PHP in order to do so.
And how is using the parser with HipHop going to be any more difficult than using it with Zend?
The point is that the wikitext and its parsing should be completely separate from MediaWiki/PHP/HipHop/Zend.
I think some of the bigger picture is getting lost here. Wikimedia produces XML dumps that contain wikitext. For most people, this is the only way to obtain and reuse large amounts of content from Wikimedia wikis (especially as the HTML dumps haven't been re-created since 2008). There needs to be a way for others to be able to very easily deal with this content.
Many people have suggested (with good reason) that this means that wikitext parsing needs to be reproducible in other programming languages. While HipHop may be the best thing since sliced bread, I've yet to see anyone put forward a compelling reason that the current state of affairs is acceptable. Saying "well, it'll soon be much faster for MediaWiki to parse" doesn't overcome the legitimate issues that re-users have (such as programming in a language other than PHP, banish the thought).
For me, the idea that all that's needed is a faster parser in PHP is a complete non-starter.
MZMcBride
On 03/05/11 19:44, MZMcBride wrote:
Chad wrote:
On Tue, May 3, 2011 at 2:15 PM, MZMcBridez@mzmcbride.com wrote:
I realize you have a dry wit, but I imagine this joke was lost on nearly everyone. You're not really suggesting that everyone who wants to parse MediaWiki wikitext compile and run HipHop PHP in order to do so.
And how is using the parser with HipHop going to be any more difficult than using it with Zend?
The point is that the wikitext and its parsing should be completely separate from MediaWiki/PHP/HipHop/Zend.
I think some of the bigger picture is getting lost here. Wikimedia produces XML dumps that contain wikitext. For most people, this is the only way to obtain and reuse large amounts of content from Wikimedia wikis (especially as the HTML dumps haven't been re-created since 2008). There needs to be a way for others to be able to very easily deal with this content.
Many people have suggested (with good reason) that this means that wikitext parsing needs to be reproducible in other programming languages. While HipHop may be the best thing since sliced bread, I've yet to see anyone put forward a compelling reason that the current state of affairs is acceptable. Saying "well, it'll soon be much faster for MediaWiki to parse" doesn't overcome the legitimate issues that re-users have (such as programming in a language other than PHP, banish the thought).
For me, the idea that all that's needed is a faster parser in PHP is a complete non-starter.
MZMcBride
I agree completely.
I think it cannot be emphasized enough that what's valuable about Wikipedia and other similar wikis is the hard-won _content_, not the software used to write and display it at any given, which is merely a means to that end.
Fashions in programming languages and data formats come and go, but the person-centuries of writing effort already embodied in Mediawiki's wikitext format needs to have a much longer lifespan: having a well-defined syntax for its current wikitext format will allow the content itself to continue to be maintained for the long term, beyond the restrictions of its current software or encoding format.
-- Neil
On 05/03/2011 08:28 PM, Neil Harris wrote:
On 03/05/11 19:44, MZMcBride wrote:
...
The point is that the wikitext and its parsing should be completely separate from MediaWiki/PHP/HipHop/Zend.
I think some of the bigger picture is getting lost here. Wikimedia produces XML dumps that contain wikitext. For most people, this is the only way to obtain and reuse large amounts of content from Wikimedia wikis (especially as the HTML dumps haven't been re-created since 2008). There needs to be a way for others to be able to very easily deal with this content.
Many people have suggested (with good reason) that this means that wikitext parsing needs to be reproducible in other programming languages. While HipHop may be the best thing since sliced bread, I've yet to see anyone put forward a compelling reason that the current state of affairs is acceptable. Saying "well, it'll soon be much faster for MediaWiki to parse" doesn't overcome the legitimate issues that re-users have (such as programming in a language other than PHP, banish the thought).
For me, the idea that all that's needed is a faster parser in PHP is a complete non-starter.
MZMcBride
I agree completely.
I think it cannot be emphasized enough that what's valuable about Wikipedia and other similar wikis is the hard-won _content_, not the software used to write and display it at any given, which is merely a means to that end.
Fashions in programming languages and data formats come and go, but the person-centuries of writing effort already embodied in Mediawiki's wikitext format needs to have a much longer lifespan: having a well-defined syntax for its current wikitext format will allow the content itself to continue to be maintained for the long term, beyond the restrictions of its current software or encoding format.
-- Neil
+1 to both MZMcBride and Neil.
So relieved to see things put so eloquently.
Dirk
----- Original Message -----
From: "Neil Harris" neil@tonal.clara.co.uk
I think it cannot be emphasized enough that what's valuable about Wikipedia and other similar wikis is the hard-won _content_, not the software used to write and display it at any given, which is merely a means to that end.
Fashions in programming languages and data formats come and go, but the person-centuries of writing effort already embodied in Mediawiki's wikitext format needs to have a much longer lifespan: having a well-defined syntax for its current wikitext format will allow the content itself to continue to be maintained for the long term, beyond the restrictions of its current software or encoding format.
The project of creating a formal specification for Mediawikitext was one of the primary reasons for the creation of the (largely dormant) wikitext-l list. I fell off shortly after it was created myself, so I don't know how far along that project got -- except that I know that it was decided that since MWtext was not -- and could not be -- a strict subset of Creole, that Creole was a Pretty Nice Idea... and we had no time for it.
For my money, that means the Creole folks lost[1], but what do I know.
Cheers, -- jra [1] Which is not incompatible with observations then that MWtext has some really unaccepable boners in itself...
----- Original Message -----
From: "MZMcBride" z@mzmcbride.com
Now that we have HipHop support, we have the ability to turn MediaWiki's core parser into a fast, reusable library. The performance reasons for limiting the amount of abstraction in the core parser will disappear. How many wikitext parsers does the world really need?
I realize you have a dry wit, but I imagine this joke was lost on nearly everyone. You're not really suggesting that everyone who wants to parse MediaWiki wikitext compile and run HipHop PHP in order to do so.
I'm fairly certain that his intention was "If the parser is HipHop compliant, then the performance improvements that will realize for those who need them will obviate the need to rewrite the parser in anything, while those who run small enough wikiae not to care, won't need to care."
That does *not*, of course, answer the "if you don't have more than one compliant parser, then the code is part of your formal spec, and you *will* get bitten eventually" problem.
Of course, Mediawiki's parser has *three* specs: whatever formal one has been ginned up, finally; the code; *and* 8 or 9 GB of MWtext on the Wikipedias.
Cheers, -- jra
On 11-05-03 08:46 PM, Jay Ashworth wrote:
----- Original Message -----
From: "MZMcBride" z@mzmcbride.com
Now that we have HipHop support, we have the ability to turn MediaWiki's core parser into a fast, reusable library. The performance reasons for limiting the amount of abstraction in the core parser will disappear. How many wikitext parsers does the world really need?
I realize you have a dry wit, but I imagine this joke was lost on nearly everyone. You're not really suggesting that everyone who wants to parse MediaWiki wikitext compile and run HipHop PHP in order to do so.
I'm fairly certain that his intention was "If the parser is HipHop compliant, then the performance improvements that will realize for those who need them will obviate the need to rewrite the parser in anything, while those who run small enough wikiae not to care, won't need to care."
That does *not*, of course, answer the "if you don't have more than one compliant parser, then the code is part of your formal spec, and you *will* get bitten eventually" problem.
Of course, Mediawiki's parser has *three* specs: whatever formal one has been ginned up, finally; the code; *and* 8 or 9 GB of MWtext on the Wikipedias.
Cheers, -- jra
I'm fairly certain myself that his intention was "With HipHop support since the C that HipHop compiles PHP to can be extracted and re-used we can turn that compiled C into a C library that can be used anywhere by abstracting the database calls and what not out of the php version of the parser. And because HipHop has better performance we will no longer have to worry about parser abstractions slowing down the parser and as a result increasing the load on large websites like Wikipedia where they are noticeable. So that won't be in the way of adding those abstractions anymore."
Naturally of course if it's a C library you can build at least an extension/plugin for a number of languages. You would of course have to install the ext/plug though so it's not a shared-hosting thing.
~Daniel Friesen (Dantman, Nadir-Seen-Fire) [http://daniel.friesen.name]
----- Original Message -----
From: "Daniel Friesen" lists@nadir-seen-fire.com
I'm fairly certain myself that his intention was "With HipHop support since the C that HipHop compiles PHP to can be extracted and re-used we can turn that compiled C into a C library that can be used anywhere by abstracting the database calls and what not out of the php version of the parser. And because HipHop has better performance we will no longer have to worry about parser abstractions slowing down the parser and as a result increasing the load on large websites like Wikipedia where they are noticeable. So that won't be in the way of adding those abstractions anymore."
What I get for not paying any attention to Facebook Engineering.
*That's* what HipHop does?
Naturally of course if it's a C library you can build at least an extension/plugin for a number of languages. You would of course have to install the ext/plug though so it's not a shared-hosting thing.
True.
But that's still a derivative work.
And from experience, I can tell you that you *don't* want to work with the *output* of a code generator/cross-compiler.
Cheers, -- jra
On 04/05/11 14:07, Daniel Friesen wrote:
I'm fairly certain myself that his intention was "With HipHop support since the C that HipHop compiles PHP to can be extracted and re-used we can turn that compiled C into a C library that can be used anywhere by abstracting the database calls and what not out of the php version of the parser. And because HipHop has better performance we will no longer have to worry about parser abstractions slowing down the parser and as a result increasing the load on large websites like Wikipedia where they are noticeable. So that won't be in the way of adding those abstractions anymore."
Yes that's right, more or less. HipHop generates C++ rather than C though.
Basically you would split the parser into several objects:
* A parser in the traditional sense. * An output callback object, which would handle generation of HTML or PDF or syntax trees or whatever. * A wiki environment interface object, which would handle link existence checks, template fetching, etc.
Then you would use HipHop to compile:
* The new parser class. * A few useful output classes, such as HTML. * A stub environment class which has no dependencies on the rest of MediaWiki.
Then to top it off, you would add:
* A HipHop extension which provides output and environment classes which pass their calls through to C-style function pointers. * A stable C ABI interface to the C++ library. * Interfaces between various high level languages and the new C library, such as Python, Ruby and Zend PHP.
Doing this would leverage the MediaWiki development community and the existing PHP codebase to provide a well-maintained, reusable reference parser for MediaWiki wikitext.
-- Tim Starling
On 4 May 2011 15:16, Tim Starling tstarling@wikimedia.org wrote:
On 04/05/11 14:07, Daniel Friesen wrote:
I'm fairly certain myself that his intention was "With HipHop support since the C that HipHop compiles PHP to can be extracted and re-used we can turn that compiled C into a C library that can be used anywhere by abstracting the database calls and what not out of the php version of the parser. And because HipHop has better performance we will no longer have to worry about parser abstractions slowing down the parser and as a result increasing the load on large websites like Wikipedia where they are noticeable. So that won't be in the way of adding those abstractions anymore."
Yes that's right, more or less. HipHop generates C++ rather than C though.
Basically you would split the parser into several objects:
- A parser in the traditional sense.
- An output callback object, which would handle generation of HTML or
PDF or syntax trees or whatever.
- A wiki environment interface object, which would handle link
existence checks, template fetching, etc.
Then you would use HipHop to compile:
- The new parser class.
- A few useful output classes, such as HTML.
- A stub environment class which has no dependencies on the rest of
MediaWiki.
Then to top it off, you would add:
- A HipHop extension which provides output and environment classes
which pass their calls through to C-style function pointers.
- A stable C ABI interface to the C++ library.
- Interfaces between various high level languages and the new C
library, such as Python, Ruby and Zend PHP.
Doing this would leverage the MediaWiki development community and the existing PHP codebase to provide a well-maintained, reusable reference parser for MediaWiki wikitext.
+1
This is the single most exciting news on the MediaWiki front since I started contributing to Wiktionary nine years ago (-:
Andrew Dunbar (hippietrail)
-- Tim Starling
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
* Daniel Friesen lists@nadir-seen-fire.com [Tue, 03 May 2011 21:07:07 -0700]:
Naturally of course if it's a C library you can build at least an extension/plugin for a number of languages. You would of course have
to
install the ext/plug though so it's not a shared-hosting thing.
~Daniel Friesen (Dantman, Nadir-Seen-Fire)
Latest generation of browser Javascript implementations are generally faster than Zend PHP, not sure about HipHop. Maybe client-side parsing can also reduce server load as well. Javascript is also extremly popular and wide-spread language, it has a good perspective at server side as well. Dmitriy
On 5/2/11 5:28 PM, Tim Starling wrote:
How many wikitext parsers does the world really need?
That's a tricky question. What MediaWiki calls parsing, the rest of the world calls
1. Parsing 2. Expansion (i.e. templates, magic) 3. Applying local state, preferences, context (i.e. $n, prefs) 4. Emitting
And phases 2 and 3 depend heavily upon the state of the local wiki at the time the parse is requested. If you've ever tried to set up a test wiki that works like Wikipedia or Wikimedia Commons you'll know what I'm talking about.
As for whether the rest of the world needs another wikitext parser: well, they keep writing them, so there must be some reason why this keeps happening. It's true that language chauvinism plays a part, but the inflexibility of the current approach is probably a big factor as well. The current system mashes parsing and emitting to HTML together, very intimately, and a lot of people would like those to be separate.
- if they're doing research or stats, and want a more "pure", more normalized form than HTML or Wikitext.
- if they're Google, and they want to get all the city infobox data and reuse it (this is a real request we've gotten)
- if they're OpenStreetMaps, and the same thing;
- if they're emitting to a different format (PDF, LaTeX, books);
- if they're emitting to HTML but with different needs (like mobile);
And then there's the stuff which you didn't know you wanted, but which becomes easy once you have a more flexible parser.
A couple of months ago I wrote a mini PEG-based wikitext parser in JavaScript, that Special:UploadWizard is using, today, live on Commons.
http://svn.wikimedia.org/viewvc/mediawiki/trunk/extensions/UploadWizard/reso...
http://svn.wikimedia.org/viewvc/mediawiki/trunk/extensions/UploadWizard/reso...
While it was a bit of a heavy download (7K compressed) this gave me the ability to do pluralizations in the frontend (e.g. "3 out of 5 uploads complete") even for difficult languages like Arabic. Great!
But the unexpected benefit was that it also made it a snap to add very complicated interface behaviour to our message strings. Actually, right now, with this library + the ingenious way that wikitext does i18n, we may have one of the best libraries out there for internationalized user interfaces. I'm considering splitting it off; it could be useful for any project that used translatewiki.
But I don't actually want to use JavaScript for anything but the final rendering stages (I'd rather move most of this parser to PHP) so stay tuned.
Anyway, I think it's obviously possible for us to do some RTE, and some of this stuff, with the current parser. But I'm optimistic that a new parsing strategy will be a huge benefit to our community, and our partners, and partners we didn't even know we could have. Imagine doing RTE with an implementation in a JS frontend, that is generated from some of the same sources that the PHP backend uses.
For what it's worth: whenever I meet with Wikia employees the topic is always about what MediaWiki and the WMF can do to make their RTE hacks obsolete. That doesn't mean that their RTE isn't the right way forward, but the people who wrote it don't seem to be very strong advocates for it. But I don't want to put words in their mouth; maybe one of them can add more to this thread?
On 04/05/11 06:38, Neil Kandalgaonkar wrote:
On 5/2/11 5:28 PM, Tim Starling wrote:
How many wikitext parsers does the world really need?
That's a tricky question. What MediaWiki calls parsing, the rest of the world calls
- Parsing
- Expansion (i.e. templates, magic)
- Applying local state, preferences, context (i.e. $n, prefs)
- Emitting
And phases 2 and 3 depend heavily upon the state of the local wiki at the time the parse is requested. If you've ever tried to set up a test wiki that works like Wikipedia or Wikimedia Commons you'll know what I'm talking about.
I wasn't saying that the current MediaWiki parser is suitable for reuse, I was saying that it may be possible to develop the MediaWiki parser into something which is reusable.
-- Tim Starling
----- Original Message -----
From: "Tim Starling" tstarling@wikimedia.org
I wasn't saying that the current MediaWiki parser is suitable for reuse, I was saying that it may be possible to develop the MediaWiki parser into something which is reusable.
Aren't there a couple of parsers already which claim 99% compliance or better?
Did anything ever come of trying to assemble a validation suite, All Those Years Ago? Or, alternatively, deciding how many pages it's acceptable to break in the definition of a formal spec?
Cheers, -- jra
wikitech-l@lists.wikimedia.org