Tabular data in Wikipedia (Wikispecies -> Wikicommons)

List overview All Threads
Download

newer

older

Re: [Wikipedia-l] Re: What is the...

new wikipedias: extremaduran and...

Anthony DiPierro

25 Aug 2004 25 Aug '04

1:29 a.m.

I think there is a growing sentiment that people do not want to fork Wikipedia to create Wikispecies. This makes a lot of sense. However, there is something important that is needed for Wikispecies which Wikipedia does not provide: efficient access of tabular data.

Think of what we would want to be able to do with Wikispecies. Yes, the simplest of them could be handled by categories, and taxoboxes are a kludgy solution to some of the data input, but now what if I want a list of all endangered species in the phylum chordata? There's just no efficient way to get that information from Wikipedia, even if I have access to the entire database dump.

It is possible that Wikipedia could adapt to handle this type of data, but this is a somewhat fundamental shift in the concept of a wiki. We would essentially need an open access database, where even the table structure itself can be modified, complete with a history mechanism which can somehow allow us to revert poorly thought changes.

There is another benefit to Wikispecies, and it is the same thing we're seeing with the proposal of Wikicommons. Species classification information is largely language-neutral. It would be nice if we could somehow have a single database for all this information, and simply use it on the language-specific pages. Once again, this could be done within Wikipedia though, and this change would be somewhat less of a fundamental shift. In essence, we would simply move the taxoboxes to a common database, in latin, and translate into the local language on the fly (regnum->kingdom, etc.). There is a bit of coding involved here, but once Wikicommons is properly set up a lot of it will already be done.

In the end, I'm opposed to creating a Wikispecies, at least as a Wikimedia project. I think our efforts are better focussed on creating a system which can incorporate Wikispecies, Wikipeople, and all the other wikiprojects together under one roof. I think there are a lot of steps along the way, and Wikicommons is probably the logical first one.

Anthony

Show replies by date

Daniel Mayer

25 Aug 25 Aug

2:23 a.m.

New subject: Tabular data in Wikipedia (Wikispecies -> Wikicommons)

--- Anthony DiPierro anthonydipierro@hotmail.com wrote:

...

I think there is a growing sentiment that people do not want to fork Wikipedia to create Wikispecies. This makes a lot of sense. However, there is something important that is needed for Wikispecies which Wikipedia does not provide: efficient access of tabular data.

Yep.

...

Think of what we would want to be able to do with Wikispecies. Yes, the simplest of them could be handled by categories, and taxoboxes are a kludgy solution to some of the data input, but now what if I want a list of all endangered species in the phylum chordata? There's just no efficient way to get that information from Wikipedia, even if I have access to the entire database dump.

Well that may be true in some cases but your example could be done if the current category system were extended and an advanced search function added. Such a search page could be used to SELECT ALL [endangered species] FROM [Chordates] to RETURN a [Species] list.

In this example [endangered species], [Chordates], and [Species] would all be categories. However since [endangered species] would be a subcategory of [Species] there would be no need to have the [Species] category in those articles. Nor would there be a need to have the [Chordates] category in those articles since they would all presumably have a sub-sub-sub category of [Chordates] that indicates wich genus the animal belongs to.

...

It is possible that Wikipedia could adapt to handle this type of data, but this is a somewhat fundamental shift in the concept of a wiki. We would essentially need an open access database, where even the table structure itself can be modified, complete with a history mechanism which can somehow allow us to revert poorly thought changes.

I think that this will be simpler and more wiki than it first appears.

...

There is another benefit to Wikispecies, and it is the same thing we're seeing with the proposal of Wikicommons. Species classification information is largely language-neutral. It would be nice if we could somehow have a single database for all this information, and simply use it on the language-specific pages. Once again, this could be done within Wikipedia though, and this change would be somewhat less of a fundamental shift. In essence, we would simply move the taxoboxes to a common database, in latin, and translate into the local language on the fly (regnum->kingdom, etc.). There is a bit of coding involved here, but once Wikicommons is properly set up a lot of it will already be done.

Putting the taxoboxes in a common database does sound interesting (linking to the scientific names shouldn't be a big deal since each species/taxon article should have the scientific name redirected to it). The element tables are very similar and would also benefit from a common database (sadly there was one mistake I made that affected about 50 element tables I created but I noticed and fixed it well-after other language Wikipedias started to copy and translate those tables).

We also have the same database design problems with interlanguage links, user accounts, and logins.

-- mav

__________________________________________________ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com

James R. Johnson

4:22 a.m.

New subject: New Wikipedia Formation

I have a question on new wikipedias. When a new wp is started, is it simply a blank space, or are there a few stubs for the translators to use? I was wondering, if the "100 necessary articles" were included as a basis for each new wikipedia, giving each a base size of at least 100 articles, and it would give them a head start in articles.

James

Angela_

4:28 a.m.

New subject: New Wikipedia Formation

On Tue, 24 Aug 2004 21:22:45 -0400, James R. Johnson modean52@comcast.net wrote:

...

I have a question on new wikipedias. When a new wp is started, is it simply a blank space, or are there a few stubs for the translators to use? I was wondering, if the "100 necessary articles" were included as a basis for each new wikipedia, giving each a base size of at least 100 articles, and it would give them a head start in articles.

There's an ongoing project to provide simplified versions of the 1000 articles every Wikipedia should have. These are not imported automatically to new wikis, but can be taken from http://simple.wikipedia.org/wiki/Wikipedia:List_of_articles_all_languages_sh... and translated.

Angela.

Pierre Abbat

4:52 a.m.

New subject: New Wikipedia Formation

On Tuesday 24 August 2004 21:22, James R. Johnson wrote:

...

I have a question on new wikipedias. When a new wp is started, is it simply a blank space, or are there a few stubs for the translators to use? I was wondering, if the "100 necessary articles" were included as a basis for each new wikipedia, giving each a base size of at least 100 articles, and it would give them a head start in articles.

AFAICT it just consists of [[Main Page]]. When I started writing jbo:, I promptly moved it to [[ralju papri]], translated general article names (such as "music" and "geography") from other Main Pages, and copied the coelacanth article (which I wrote the original version of in Lojban before translating to English). The taxobox had all the templates missing, so I added them.

phma

-- li fi'u vu'u fi'u fi'u du li pa

Magnus Manske

11:30 a.m.

New subject: Tabular data in Wikipedia (Wikispecies -> Wikicommons)

Daniel Mayer wrote:

...

We also have the same database design problems with interlanguage links, user accounts, and logins.

So we should fix these. Create a single database that contains * interlanguage links (for all languages) * user accounts/logins * raw data for species/elements/whatever

Interlanguage links would have to be extracted and removed from the articles and put into an extra textbox on editing, which IMHO would be an improvement.

Local user accounts can stay as they are, but have a "user_global_id" field that, when non-zero, uses the global instead of the local one. User accounts with the same name *and* password can be automatically merged into a global account once.

Raw data would at least need these fields * type (species, element, etc.) * ID (latin species name, for example) * key * value

Personally, I would include the wikicommons project into that design, so we have a central database for images and multimedia at the same place.

Magnus

James R. Johnson

5:34 p.m.

New subject: Wiki-login [was: RE: Tabular data in Wikipedia (Wikispecies-> Wikicommons)]

So, can I simply have one login, and access every wiki-project, and every language wiki, without having to re-register for every single one?

James

-----Original Message----- From: wikipedia-l-bounces@Wikimedia.org [mailto:wikipedia-l-bounces@Wikimedia.org] On Behalf Of Magnus Manske Sent: Wednesday, August 25, 2004 4:31 AM To: wikipedia-l@Wikimedia.org Subject: Re: [Wikipedia-l] Tabular data in Wikipedia (Wikispecies-> Wikicommons)

Daniel Mayer wrote:

...

We also have the same database design problems with interlanguage links, user accounts, and logins.

So we should fix these. Create a single database that contains * interlanguage links (for all languages) * user accounts/logins * raw data for species/elements/whatever

Interlanguage links would have to be extracted and removed from the articles and put into an extra textbox on editing, which IMHO would be an improvement.

Raw data would at least need these fields * type (species, element, etc.) * ID (latin species name, for example) * key * value

Personally, I would include the wikicommons project into that design, so we have a central database for images and multimedia at the same place.

Magnus _______________________________________________ Wikipedia-l mailing list Wikipedia-l@Wikimedia.org http://mail.wikipedia.org/mailman/listinfo/wikipedia-l

Magnus Manske

8:08 p.m.

New subject: Wiki-login [was: RE: Tabular data in Wikipedia (Wikispecies-> Wikicommons)]

James R. Johnson wrote:

...

So, can I simply have one login, and access every wiki-project, and every language wiki, without having to re-register for every single one?

That's the idea. Exception, of course, if your user name is already used by someone else on another wiki. For these cases, it would stay just as it is now.

Magnus

Tim Starling

6:14 a.m.

New subject: WikiDB (was Re: Tabular data in Wikipedia (Wikispecies -> Wikicommons))

Anthony DiPierro wrote:

...

I think there is a growing sentiment that people do not want to fork Wikipedia to create Wikispecies. This makes a lot of sense. However, there is something important that is needed for Wikispecies which Wikipedia does not provide: efficient access of tabular data.

I was thinking along similar lines a couple of months ago, when considering the needs of Wikiquote, and to a lesser extent Wiktionary. The basic plan I came up with was:

* Each record is an arbitrary list of key-value pairs. * User-designed edit forms and search forms constrain data entry to particular fields * User-designed display templates are used to format the data and integrate it with other parts of the wiki * Each field has an index, to allow fast searching and report generation. * Indexes may be sparse, with most records not containing a given field. I believe this can be efficiently handled by creating a separate table for each key, on demand at runtime.

The technical design is subject to constraints such as: * Robust against frivolous or malicious addition of new fields to a small number of records * Peer review and the reverting of any change must be easy * Capable of efficiently storing a number of different schemas (represented by different data entry, search and display forms) in the one database.

Challenges which still have to be addressed include: * Possibility of slow insert times * Malicious or accidental destruction of fields which are commonly used for indexed retrieval -- this may make records hard to find and revert.

Far be it from me to co-opt such a generic term as "WikiDB", I would call this *a* WikiDB module for MediaWiki. The applications for such a package would be extraordinarily broad. All that we need now is for me (or someone else) to get motivated and write it.

-- Tim Starling

Stan Shebs

26 Aug 26 Aug

12:34 a.m.

New subject: WikiDB (was Re: Tabular data in Wikipedia (Wikispecies -> Wikicommons))

Tim Starling wrote:

...

Anthony DiPierro wrote:

...
I think there is a growing sentiment that people do not want to fork Wikipedia to create Wikispecies. This makes a lot of sense. However, there is something important that is needed for Wikispecies which Wikipedia does not provide: efficient access of tabular data.

I was thinking along similar lines a couple of months ago, when considering the needs of Wikiquote, and to a lesser extent Wiktionary. The basic plan I came up with was:

Each record is an arbitrary list of key-value pairs.

User-designed edit forms and search forms constrain data entry to

particular fields

User-designed display templates are used to format the data and

integrate it with other parts of the wiki

Each field has an index, to allow fast searching and report generation.

Indexes may be sparse, with most records not containing a given

field. I believe this can be efficiently handled by creating a separate table for each key, on demand at runtime.

It would be cool if there was some way to differentiate language specific and language-neutral data. A 747's wingspan is the same irrespective of language, should be possible to record and edit in only one place, but the common name of Homo sapiens is per-language.

Stan

Tim Starling

3:54 a.m.

New subject: WikiDB (was Re: Tabular data in Wikipedia (Wikispecies -> Wikicommons))

Stan Shebs wrote:

...

It would be cool if there was some way to differentiate language specific and language-neutral data. A 747's wingspan is the same irrespective of language, should be possible to record and edit in only one place, but the common name of Homo sapiens is per-language.

You could have a separate field for each language, and create display templates which only display the local language.

An interesting extension which I forgot to mention in my original post would be to bring in some aspects of RDBMS design. The obvious one is to allow fields which contain a list of strings or article titles. This could be implemented underneath as a many-to-many relationship involving IDs. So if one of the items in the list is renamed, it appears to the user to have been instantly changed in all lists. Searching for all articles with a given item in its list would be a fast indexed operation. This would be useful for categories, keyword searches, and similar features.

-- Tim Starling

7428

Age (days ago)

7430

Last active (days ago)

wikipedia-l@lists.wikimedia.org

10 comments

8 participants

tags (0)

participants (8)

Angela_
Anthony DiPierro
Daniel Mayer
James R. Johnson
Magnus Manske
Pierre Abbat
Stan Shebs
Tim Starling