Re: [Wikitech-l] Using wiki pages as databases

List overview All Threads
Download

newer

older

linking to wikidata pages

Categories in Module namespace

Tim Starling

19 Feb 2013 19 Feb '13

4:27 a.m.

On 19/02/13 21:11, MZMcBride wrote:

...

Hi.

In the context of https://bugzilla.wikimedia.org/show_bug.cgi?id=10621, the concept of using wiki pages as databases has come up. We're already beginning to see this:

https://en.wiktionary.org/wiki/Module:languages (over 30,000 lines)

https://en.wikipedia.org/wiki/Module:Convertdata (over 7,400 lines)

At large enough sizes, the in-browser syntax highlighting is currently problematic.

We can disable syntax highlighting over some size.

...

But it's also becoming clear that the larger underlying problem is that using a single request wiki page as a database isn't really scalable or sane.

The performance of #invoke should be OK for modules up to $wgMaxArticleSize (2MB). Whether the edit interface is usable at such a size is another question.

...

(ParserFunction #switch's performance used to prohibit most ideas of using a wiki page as a database, as I understand it.)

Both Lua and #switch have O(N) time order in this use case, but the constant you multiply by N is hundreds of times smaller for Lua.

...

Has any thought been given to what to do about this? Will it require manually paginating the data over collections of wiki pages? Will this be something to use Wikidata for?

Ultimately, I would like it to be addressed in Wikidata. In the meantime, multi-megabyte datasets will have to be split up, for $wgMaxArticleSize if nothing else.

-- Tim Starling

Show replies by date

Tyler Romeo

19 Feb 19 Feb

4:56 a.m.

New subject: Using wiki pages as databases

So unfortunately I don't have a clear idea of what the problem is, primarily because I don't know anything about the Parser and its inner workings, but as far as having all the data in one page, here's something. Maybe this is a bad idea, but how about having a PHP-array content type. In other words, MyNamespace:MyPage would render the entire data structure, but MyNamespace:MyPage/index/test/0 would take $arr['index']['test'][0]. In the database, it would be stored as individual sub-pages, and leaf sub-pages would render exactly like a normal page would, but non-leaf pages would build the array from all child sub-pages and display it to the user. Would this solve the problem? Because if so, I've put some thought into it and would be willing to maybe draft an extension giving such a capability.

*--* *Tyler Romeo* Stevens Institute of Technology, Class of 2015 Major in Computer Science www.whizkidztech.com | tylerromeo@gmail.com

On Tue, Feb 19, 2013 at 7:27 AM, Tim Starling tstarling@wikimedia.orgwrote:

...

On 19/02/13 21:11, MZMcBride wrote:

...
Hi.

In the context of <https://bugzilla.wikimedia.org/show_bug.cgi?id=10621 , the concept of using wiki pages as databases has come up. We're already beginning to see this:

https://en.wiktionary.org/wiki/Module:languages (over 30,000 lines)

https://en.wikipedia.org/wiki/Module:Convertdata (over 7,400 lines)

At large enough sizes, the in-browser syntax highlighting is currently problematic.

We can disable syntax highlighting over some size.

...
But it's also becoming clear that the larger underlying problem is that using a single request wiki page as a database isn't really scalable or sane.

The performance of #invoke should be OK for modules up to $wgMaxArticleSize (2MB). Whether the edit interface is usable at such a size is another question.

...
(ParserFunction #switch's performance used to prohibit most ideas of

using

...
a wiki page as a database, as I understand it.)

Both Lua and #switch have O(N) time order in this use case, but the constant you multiply by N is hundreds of times smaller for Lua.

...
Has any thought been given to what to do about this? Will it require manually paginating the data over collections of wiki pages? Will this be something to use Wikidata for?

Ultimately, I would like it to be addressed in Wikidata. In the meantime, multi-megabyte datasets will have to be split up, for $wgMaxArticleSize if nothing else.

-- Tim Starling

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Platonides

2:52 p.m.

New subject: Using wiki pages as databases

On 19/02/13 13:56, Tyler Romeo wrote:

...

So unfortunately I don't have a clear idea of what the problem is, primarily because I don't know anything about the Parser and its inner workings, but as far as having all the data in one page, here's something. Maybe this is a bad idea, but how about having a PHP-array content type. In other words, MyNamespace:MyPage would render the entire data structure, but MyNamespace:MyPage/index/test/0 would take $arr['index']['test'][0]. In the database, it would be stored as individual sub-pages, and leaf sub-pages would render exactly like a normal page would, but non-leaf pages would build the array from all child sub-pages and display it to the user. Would this solve the problem? Because if so, I've put some thought into it and would be willing to maybe draft an extension giving such a capability.

You can already use subpages to store data. Access is then O(1) The "problem" is that then you have one page per entry.

Tyler Romeo

6:41 p.m.

New subject: Using wiki pages as databases

...

You can already use subpages to store data. Access is then O(1) The "problem" is that then you have one page per entry.

I know. What I'm suggesting is an interface where the sub-pages aggregate up the hierarchy, meaning you can still edit the main top-level page, and the backend will simply update the sub-pages as appropriate.

*--* *Tyler Romeo* Stevens Institute of Technology, Class of 2015 Major in Computer Science www.whizkidztech.com | tylerromeo@gmail.com

On Tue, Feb 19, 2013 at 5:52 PM, Platonides Platonides@gmail.com wrote:

...

On 19/02/13 13:56, Tyler Romeo wrote:

...
So unfortunately I don't have a clear idea of what the problem is, primarily because I don't know anything about the Parser and its inner workings, but as far as having all the data in one page, here's

something.

...
Maybe this is a bad idea, but how about having a PHP-array content type.

In

...
other words, MyNamespace:MyPage would render the entire data structure,

but

...
MyNamespace:MyPage/index/test/0 would take $arr['index']['test'][0]. In

the

...
database, it would be stored as individual sub-pages, and leaf sub-pages would render exactly like a normal page would, but non-leaf pages would build the array from all child sub-pages and display it to the user.

Would

...
this solve the problem? Because if so, I've put some thought into it and would be willing to maybe draft an extension giving such a capability.

You can already use subpages to store data. Access is then O(1) The "problem" is that then you have one page per entry.

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Denny Vrandečić

5:31 a.m.

New subject: Using wiki pages as databases

2013/2/19 Tim Starling tstarling@wikimedia.org

...

On 19/02/13 21:11, MZMcBride wrote:> Has any thought been given to what to do about this? Will it require

...
manually paginating the data over collections of wiki pages? Will this be something to use Wikidata for?

Ultimately, I would like it to be addressed in Wikidata. In the meantime, multi-megabyte datasets will have to be split up, for $wgMaxArticleSize if nothing else.

I expect that, in time, Wikidata will be able to serve some of those usecase, e.g. the one given by the languages Module on Wiktionary. I am quite excited about the possibilities that access to Wikidata together with Lua will be enabling within a year or so... :)

Not all use cases though should and will be handled by Wikidata obviously, but some of those huge switches definitively can be saved in Wikidata items.

Brad Jorsch

6:12 a.m.

New subject: Using wiki pages as databases

In the long term, Wikidata is probably the way to go on something like this.

In the short term, as far as dividing things up, note that you can implement on-demand loading in Lua easily enough using the __index metamethod.

local obj = {}

setmetatable( obj, { __index = function ( t, k ) -- This will get called on access of obj[k] if it is not already set. -- Do whatever you might need, e.g. require() a submodule, -- assign things to t for future lookups, then return the requested k. end } )

return obj

Also note that you can save space at the expense of code complexity by accessing "obj.us_name or obj.name" rather than storing the same string in both fields; remember in Lua only nil (unset) and boolean false are considered "false", the number 0 and the empty string are both considered true.

Tim Starling

3:21 p.m.

New subject: Using wiki pages as databases

I wrote:

...

The performance of #invoke should be OK for modules up to $wgMaxArticleSize (2MB). Whether the edit interface is usable at such a size is another question.

The Wiktionary folk are gnashing their teeth today when they discovered that in fact, loading a 742KB module 1200 times in a single page does in fact take a long time, and it trips the CPU limit after about 450 invocations . So, sorry for raising expectations about that.

-- Tim Starling

Victor Vasiliev

8:07 p.m.

New subject: Using wiki pages as databases

On 02/19/2013 06:21 PM, Tim Starling wrote:

...

The Wiktionary folk are gnashing their teeth today when they discovered that in fact, loading a 742KB module 1200 times in a single page does in fact take a long time, and it trips the CPU limit after about 450 invocations . So, sorry for raising expectations about that.

-- Tim Starling

Aren't modules which are already loaded cached, so if they load it 1200 times on a single page, how does it manage to affect CPU time that badly?

-- Victor.

Tim Starling

8:54 p.m.

New subject: Using wiki pages as databases

On 20/02/13 15:07, Victor Vasiliev wrote:

...

On 02/19/2013 06:21 PM, Tim Starling wrote:

...
The Wiktionary folk are gnashing their teeth today when they discovered that in fact, loading a 742KB module 1200 times in a single page does in fact take a long time, and it trips the CPU limit after about 450 invocations . So, sorry for raising expectations about that.

-- Tim Starling

Aren't modules which are already loaded cached, so if they load it 1200 times on a single page, how does it manage to affect CPU time that badly?

Execution of the module chunk seems to be the main reason. I benchmarked it locally at 10.6ms, so 450 of those would be 4.8s.

Lua has a lot of O(N) work to do when a large table literal is executed. I'm experimenting with using large string literals instead:

https://en.wiktionary.org/w/index.php?title=Module:Languages_string_db&action=edit

That module takes about 2us for module chunk execution, when I run it locally, and around 30us for each lookup in a tight loop on the server side. But when I use it in a large article, it seems to use about 1.4ms per #invoke, so maybe there's still some overhead that needs to be tracked down.

The idea of storing a database in a large string literal could be made to be fairly efficient and user-friendly if a helper module was written to do parsing and a binary search.

-- Tim Starling

Johnuniq

22 Feb 22 Feb

1:41 a.m.

New subject: Using wiki pages as databases

On Feb 20, 2013 at 3:54 pm, Tim Starling wrote:

...

The idea of storing a database in a large string literal could be made to be fairly efficient and user-friendly if a helper module was written to do parsing and a binary search.

I have implemented the above suggestion with some promising results. Packing a large table in a string and unpacking it on demand appears to work well, and the data is accessed as if it were stored in a standard table. Using the table from Wiktionary Module:Languages mentioned earlier in this thread, testing shows that accessing the packed data is 20 times faster. Info is at

http://test2.wikipedia.org/wiki/User_talk:Johnuniq#Big_tables

Johnuniq

Brad Jorsch

5:57 a.m.

New subject: Using wiki pages as databases

On Fri, Feb 22, 2013 at 4:41 AM, Johnuniq wp.johnuniq@gmail.com wrote:

...

On Feb 20, 2013 at 3:54 pm, Tim Starling wrote:

...
The idea of storing a database in a large string literal could be made to be fairly efficient and user-friendly if a helper module was written to do parsing and a binary search.

I have implemented the above suggestion with some promising results. Packing a large table in a string and unpacking it on demand appears to work well, and the data is accessed as if it were stored in a standard table. Using the table from Wiktionary Module:Languages mentioned earlier in this thread, testing shows that accessing the packed data is 20 times faster. Info is at

http://test2.wikipedia.org/wiki/User_talk:Johnuniq#Big_tables

Note that https://gerrit.wikimedia.org/r/#/c/50299/ added a mw.loadData() function that should solve the problem for normal tables. It works like require, but can only handle simple data (no functions, tables with metatables, or tables with tables as keys), the returned data structure is made read-only, and it avoids having to re-execute the module chunk on every #invoke.

Speaking of which, I need to update the documentation.

-- Brad Jorsch Software Engineer Wikimedia Foundation

Ori Livneh

20 Feb 20 Feb

2:48 a.m.

New subject: Using wiki pages as databases

On Tuesday, February 19, 2013 at 4:27 AM, Tim Starling wrote:

...

On 19/02/13 21:11, MZMcBride wrote:

...
Hi.

In the context of https://bugzilla.wikimedia.org/show_bug.cgi?id=10621, the concept of using wiki pages as databases has come up. We're already beginning to see this:

https://en.wiktionary.org/wiki/Module:languages (over 30,000 lines)

https://en.wikipedia.org/wiki/Module:Convertdata (over 7,400 lines)

At large enough sizes, the in-browser syntax highlighting is currently problematic.

We can disable syntax highlighting over some size.

https://gerrit.wikimedia.org/r/#/c/49985/ disables the highlighting of symbols if it looks like there may be a lot of them. Patched in SyntaxHighlight_GeSHi since the problem is not specific to Lua or Scribunto.

-- Ori Livneh

4335

Age (days ago)

4338

Last active (days ago)

wikitech-l@lists.wikimedia.org

11 comments

8 participants

tags (0)

participants (8)

Brad Jorsch
Denny Vrandečić
Johnuniq
Ori Livneh
Platonides
Tim Starling
Tyler Romeo
Victor Vasiliev