Is there any way that extension developers can get some sort of notice for
breaking changes, e.g., https://gerrit.wikimedia.org/r/50138? Luckily my
extension's JobQueue implementation hasn't been merged yet, but if it had I
would have no idea that it had been broken by the core.
*--*
*Tyler Romeo*
Stevens Institute of Technology, Class of 2015
Major in Computer Science
www.whizkidztech.com | tylerromeo(a)gmail.com
Sorry, I forgot to mention that I have in mind the English wikipedia dump.
wiki writes:
> Hello.
>
> I'm a newbie who wants to start playing with the xml dumps. I've found
> instructions here and there on how to import these. I'd like to seek
> guidance though as to how much free disk space one is required to have for
> the MySql import to succeed? i.e. after I have already installed LAMP +
> Mediawiki, and already allocated space for the bzip file and the converted
> import statements file, roughly how much more space is needed?
>
> Thank you!
>
> - sam -
>
> _______________________________________________
> Wikitech-l mailing list
> Wikitech-l(a)lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Hello.
I'm a newbie who wants to start playing with the xml dumps. I've found
instructions here and there on how to import these. I'd like to seek
guidance though as to how much free disk space one is required to have for
the MySql import to succeed? i.e. after I have already installed LAMP +
Mediawiki, and already allocated space for the bzip file and the converted
import statements file, roughly how much more space is needed?
Thank you!
- sam -
Hi all!
I would like to ask for you input on the question how non-wikitext content can
be indexed by LuceneSearch.
Background is the fact that full text search (Special:Search) is nearly useless
on wikidata.org at the moment, see
<https://bugzilla.wikimedia.org/show_bug.cgi?id=42234>.
The reason for the problem appears to be that when rebuilding a Lucene index
from scratch, using an XML dump of wikidata.org, the raw JSON structure used by
Wikibase gets indexed. The indexer is blind, it just takes whatever "text" it
finds in the dump. Indexing JSON does not work at all for fulltext search,
especially not when non-ascii characters are represented as unicode escape
sequences.
Inside MediaWiki, in PHP, this work like this:
* wikidata.org (or rather, the Wikibase extension) stores non-text content in
wiki pages, using a ContentHandler that manages a JSON structure.
* Wikibase's EntityContent class implements Content::getTextForSearchIndex() so
it returns the labels and aliases of an entity. Data items thus get indexed by
their labels and aliases.
* getTextForSearchIndex() is used by the default MySQL search to build an index.
It's also (ab)used by things that can only operate on flat text, like the
AbuseFilter extension.
* The LuceneSearch index gets updated live using the OAI extension, which in
turn knows to use getTextForSearchIndex() to get the text for indexing.
So, for anything indexed live, this works, but for rebuilding the search index
from a dump, it doesn't - because the Java indexer knows nothing about content
types, and has no interface for an extension to register additional content types.
To improve this, I can think of a few options:
1) create a specialized XML dump that contains the text generated by
getTextForSearchIndex() instead of actual page content. However, that only works
if the dump is created using the PHP dumper. How are the regular dumps currently
generated on WMF infrastructure? Also, would be be feasible to make an extra
dump just for LuceneSearch (at least for wikidata.org)?
2) We could re-implement the ContentHandler facility in Java, and require
extensions that define their own content types to provide a Java based handler
in addition to the PHP one. That seems like a pretty massive undertaking of
dubious value. But it would allow maximum control over what is indexed how.
3) The indexer code (without plugins) should not know about Wikibase, but it may
have hard coded knowledge about JSON. It could have a special indexing mode for
JSON, in which the structure is deserialized and traversed, and any values are
added to the index (while the keys used in the structure would be ignored). We
may still be indexing useless interna from the JSON, but at least there would be
a lot fewer false negatives.
I personally would prefer 1) if dumps are created with PHP, and 3) otherwise. 2)
looks nice, but is hard to keep the Java and the PHP version from diverging.
So, how would you fix this?
thanks
daniel
Hello!
This is your friendly weekly deployments highlight email.
For the week of March 11th (next week), here are some things to be aware
of:
* Scribunto (Lua) will be available on all wikis as of Wed the 13th
* HTTPS for all logged in users
This is planned to happen next week, but the exact deployment window
is still to be determined. I will inform wikitech-l and -ambassadors
when it is scheduled.
See this bug for more info:
https://bugzilla.wikimedia.org/show_bug.cgi?id=39380
Best,
Greg
--
| Greg Grossmeier GPG: B2FA 27B1 F7EB D327 6B8E |
| identi.ca: @greg A18D 1138 8E47 FAC8 1C7D |
On Fri, Mar 8, 2013 at 8:35 AM, Tyler Romeo <tylerromeo(a)gmail.com> wrote:
> Is there any way that extension developers can get some sort of notice for
> breaking changes, e.g., https://gerrit.wikimedia.org/r/50138? Luckily my
> extension's JobQueue implementation hasn't been merged yet, but if it had I
> would have no idea that it had been broken by the core.
Hi Tyler,
Sorry to hear that there might be a problem here. It's been a pet
peeve of mine that we seem to be a little too eager to break backwards
compatibility in places where it may not be necessary. That said,
let's try to avoid a meta-process discussion before we collectively
understand the example you are bringing up, and focus on the JobQueue.
As near as I can tell from a quick skim of the changeset you're
referencing, Aaron's changes here are purely additive. Am I reading
this wrong? Is there some other changeset that changes/removes
existing interfaces that you meant to reference instead?
Rob
As you probably know, the search in Wikidata sucks big time.
Until we have created a proper Solr-based search and deployed on that
infrastructure, we would like to implement and set up a reasonable stopgap
solution.
The simplest and most obvious signal for sorting the items would be to
1) make a prefix search
2) weight all results by the number of Wikipedias it links to
This should usually provide the item you are looking for. Currently, the
search order is random. Good luck with finding items like California,
Wellington, or Berlin.
Now, what I want to ask is, what would be the appropriate index structure
for that table. The data is saved in the wb_terms table, which would need
to be extended by a "weight" field. There is already a suggestion (based on
discussions between Tim and Daniel K if I understood correctly) to change
the wb_terms table index structure (see here <
https://bugzilla.wikimedia.org/show_bug.cgi?id=45529> ), but since we are
changing the index structure anyway it would be great to get it right this
time.
Anyone who can jump in? (Looking especially at Asher and Tim)
Any help would be appreciated.
Cheers,
Denny
--
Project director Wikidata
Wikimedia Deutschland e.V. | Obentrautstr. 72 | 10963 Berlin
Tel. +49-30-219 158 26-0 | http://wikimedia.de
Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e.V.
Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg unter
der Nummer 23855 B. Als gemeinnützig anerkannt durch das Finanzamt für
Körperschaften I Berlin, Steuernummer 27/681/51985.
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256
GNU LibreJS blocks several Javascript sources around Wikipedia. I was
sent to this list by Kirk Billund. My issue as well as Kirk's replies
follows. I hope you are okay to read it in this form.
03/05/2013 11:16 - Alexander Berntsen wrote:
>>>> GNU LibreJs[0] reports that several of the Javascript sources
>>>> embedded by different parts of Wikipedia are proprietary[1].
>>>> Is this a conscious anti-social choice[2], or have you merely
>>>> not set up your source files to properly show their
>>>> licence[3]?
>>>>
>>>> If the latter is the case, please remedy this. If the former
>>>> is the case... please remedy this. It is extremely
>>>> important.[4] In any event I hope to get a reply, as the
>>>> distinction is important to me.
>>>>
>>>> [0] https://www.gnu.org/software/librejs/ [1]
>>>> https://www.gnu.org/philosophy/categories.html#ProprietarySoftware
>>>>
>>>>
[2] https://www.gnu.org/philosophy/javascript-trap.html
>>>> [3]
>>>> https://www.gnu.org/software/librejs/free-your-javascript.html
>>>>
>>>>
[4] https://www.gnu.org/philosophy/why-free.html
On 05/03/13 11:38, Wikipedia information team wrote:
>>> All of the MediaWiki[1] code base that Wikipedia is licensed
>>> under the GPL[2], including the JavaScript. Also included in
>>> that is the freely-licensed (MIT) jQuery[3] library. However
>>> some code is actually written by the invidual users, like
>>> English Wikipedia's custom javascript[4], which is licensed as
>>> CC-BY-SA-3.0 since all content pages are automatically licensed
>>> that way[5].
>>>
>>> Additionally, our JavaScript is minified[6] so adding comments
>>> is not possible. If you have further concerns, you can either
>>> respond to me, email the general Wikimedia technical list[7] or
>>> a general Mediawiki help list[8].
>>>
>>>
>>> [1] https://www.mediawiki.org/wiki/MediaWiki [2]
>>> https://www.mediawiki.org/wiki/License [3]
>>> https://en.wikipedia.org/wiki/JQuery [4]
>>> https://en.wikipedia.org/wiki/MediaWiki:Common.js [5]
>>> https://en.wikipedia.org/wiki/Wikipedia:Copyrights [6]
>>> https://www.mediawiki.org/wiki/ResourceLoader [7]
>>> https://lists.wikimedia.org/mailman/listinfo/wikitech-l [8]
>>> https://lists.wikimedia.org/mailman/listinfo/mediawiki-l
03/05/2013 11:16 - Alexander Berntsen wrote:
>> Is it not possible to insert the licence as part of your build
>> process? What I do with compiled or minified Javascript is to
>> build everything, and then insert the licence to all files using
>> BASH.
On 05/03/13 12:41, Wikipedia information team wrote:
> Unfortunately I don't fully understand how the minification process
> works, so it would probably be better if you asked your question on
> our technical mailing list
> <https://lists.wikimedia.org/mailman/listinfo/wikitech-l> and
> someone there would be able to give you a more specific answer.
- --
Alexander
alexander(a)plaimi.net
http://plaimi.net/~alexander
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.19 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/
iF4EAREIAAYFAlE13WcACgkQRtClrXBQc7VRwAEAhJLHhlpssJIze/B9IJ1un9kT
/ze8DysWeQHBpoGeKCQBALbfVL+yLy74dAEmncPrT3FAPB4WPjUDfOg8A7Vo/pXm
=peks
-----END PGP SIGNATURE-----
Interesting article I found about Redis and its poor performance with SSDs
as a swap medium. For whoever might be interested.
http://antirez.com/news/52
*--*
*Tyler Romeo*
Stevens Institute of Technology, Class of 2015
Major in Computer Science
www.whizkidztech.com | tylerromeo(a)gmail.com