Wikidata tables

List overview All Threads
Download

newer

older

Tool Labs maintenance cancelled

Department of the reporting of...

Magnus Manske

18 Apr 2013 18 Apr '13

7:52 a.m.

Just wondering what the status of exposing all wikidata tables on the toolserver is. Currently, there are a few wb_* tables with item labels, descriptions, aliases, and language links. But the tables (whatever they are called) containing item-to-item connections appear to be missing. Maybe because they were added later? Magnus

Attachments:

attachment.htm (text/html — 458 bytes)

Show replies by date

Lydia Pintscher

18 Apr 18 Apr

9:21 a.m.

On Thu, Apr 18, 2013 at 9:52 AM, Magnus Manske <magnusmanske(a)googlemail.com> wrote:

...

Magnus Manske

10:44 a.m.

Huh. That could be ... problematic in the future. Thanks, Magnus On Thu, Apr 18, 2013 at 10:21 AM, Lydia Pintscher < lydia.pintscher(a)wikimedia.de> wrote:

...

On Thu, Apr 18, 2013 at 9:52 AM, Magnus Manske <magnusmanske(a)googlemail.com> wrote:

As far as I know they're only saved in JSON where usually the article text is stored and not in separate tables. Cheers Lydia -- Lydia Pintscher - http://about.me/lydia.pintscher Community Communications for Wikidata Wikimedia Deutschland e.V. Obentrautstr. 72 10963 Berlin www.wikimedia.de Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e. V. Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg unter der Nummer 23855 Nz. Als gemeinnützig anerkannt durch das Finanzamt für Körperschaften I Berlin, Steuernummer 27/681/51985. _______________________________________________ Toolserver-l mailing list (Toolserver-l(a)lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/toolserver-l Posting guidelines for this list: https://wiki.toolserver.org/view/Mailing_list_etiquette

Patricia Pintilie

12:49 p.m.

Imagine Magnus all the files are being looked over to determine human, machine, and non used accounts. Each needs to be looked over then proper deletion will clean out making more room. Problematic yes but will help if this is taken care of the right way now. On Apr 18, 2013 5:45 AM, "Magnus Manske" <magnusmanske(a)googlemail.com> wrote:

...

Huh. That could be ... problematic in the future. Thanks, Magnus On Thu, Apr 18, 2013 at 10:21 AM, Lydia Pintscher < lydia.pintscher(a)wikimedia.de> wrote:

On Thu, Apr 18, 2013 at 9:52 AM, Magnus Manske <magnusmanske(a)googlemail.com> wrote:

_______________________________________________ Toolserver-l mailing list (Toolserver-l(a)lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/toolserver-l Posting guidelines for this list: https://wiki.toolserver.org/view/Mailing_list_etiquette

Byrial Jensen

2:57 p.m.

Den 18-04-2013 11:21, Lydia Pintscher skrev:

...

On Thu, Apr 18, 2013 at 9:52 AM, Magnus Manske <magnusmanske(a)googlemail.com> wrote:

As far as I know they're only saved in JSON where usually the article text is stored and not in separate tables.

You can see in the pagelinks table which properties and which items an item is connected to by statements, but not how the properties and items are paired together.

Daniel Schwen

4:21 p.m.

I can (barely) understand why wikitext is not available on the toolserver, but the JSON data you are talking about does not seem copyrightable and much lower in volume. The usefulness of the wikidata mirror on the toolserver is rather low without the actual wikiDATA. Daniel On Thu, Apr 18, 2013 at 8:57 AM, Byrial Jensen <byrial(a)vip.cybercity.dk> wrote:

...

Den 18-04-2013 11:21, Lydia Pintscher skrev:

On Thu, Apr 18, 2013 at 9:52 AM, Magnus Manske <magnusmanske(a)googlemail.com> wrote:

As far as I know they're only saved in JSON where usually the article text is stored and not in separate tables.

You can see in the pagelinks table which properties and which items an item is connected to by statements, but not how the properties and items are paired together. _______________________________________________ Toolserver-l mailing list (Toolserver-l(a)lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/toolserver-l Posting guidelines for this list: https://wiki.toolserver.org/view/Mailing_list_etiquette

DaB.

8:05 p.m.

Hello, At Thursday 18 April 2013 22:00:54 DaB. wrote:

...

but the JSON data you are talking about does not seem copyrightable and much lower in volume.

if these JSON-data is stored where the normal wiki-text is, it is imposable for us to replicate it: Because we have no access to these wmf-servers, there would be no way to separate Wikidata from the rest and/or we have not enough disc-space. Sincerely, DaB. -- Userpage: [[:w:de:User:DaB.]] — PGP: 0x2d3ee2d42b255885

Daniel Schwen

10:43 p.m.

...

if these JSON-data is stored where the normal wiki-text is, it is imposable

To my understanding it is.

...

for us to replicate it: Because we have no access to these wmf-servers, there

IMO that was a questionable design decision. JSON plaintext storage in SQL is shoehorning a do-it-yourself object store onto a classical RDBMS. Postgres at least has hstore. This may be even a genuine usecase for one of those hipster databases (noSQL like mongdb etc.). But who knows what points were taken into consideration when making this decision.

...

would be no way to separate Wikidata from the rest

I don't understand why separating plaintext storage between different projects would be an issue. Is it all lumped into one storage "namespace"? I'm sure nobody at Wikimedia would be the least bit motivated to make this data available to the toolserver, but maybe it will be usable in labs. Otherwise it would be quite a waste of a great opportunity.

...

and/or we have not enough disc-space.

If you can separate it out I seriously doubt that wikidata would require storage any where near as large in magnitude as the other wikimedia projects (at least in the mid-term)

DaB.

11:19 p.m.

Hello, At Friday 19 April 2013 01:03:25 DaB. wrote:

...

would be no way to separate Wikidata from the rest

as you may know there is a rev_text_id-field in the revision-table. This field points to the text-table where the actual text is – or should be. Because the WMF doesn’t store the text here, but only a pointer ("DB://cluster25/11458305" for example). If you query different wikis you will see that most of them point to the same cluster or one with a number short by. That says me (and I was also told so before) that all text of all wmf-projects are stored together. The task would now to separate wikidata from the rest – but the storage-area has no clue from where a text is which makes the separating very hard. And there is another problem: Deleted texts are also in this area, so even more filtering would be needed. I very doubt that this situation will change at the TS and I also doubt that it will be different for WikiLabs. So I guess your best bet is the API here. Sincerely, DaB. -- Userpage: [[:w:de:User:DaB.]] — PGP: 0x2d3ee2d42b255885

Platonides

19 Apr 19 Apr

8:29 a.m.

On 19/04/13 01:19, DaB. wrote:

...

Magnus Manske

8:37 a.m.

FWIW, I have implemented a query-able stand-alone web server that keeps all of the wikidata property-item-links in memory. This uses the wikidata dumps which appear to be rather frequent. I'll try do deploy a test version on wikilabs (once I figure out how all that works); it seems to be more favourable to such services than the toolserver. On Fri, Apr 19, 2013 at 9:29 AM, Platonides <platonides(a)gmail.com> wrote:

...

On 19/04/13 01:19, DaB. wrote:

as you may know there is a rev_text_id-field in the revision-table. This

field

points to the text-table where the actual text is – or should be.

Because the

WMF doesn’t store the text here, but only a pointer

("DB://cluster25/11458305"

for example). If you query different wikis you will see that most of

them point

to the same cluster or one with a number short by. That says me (and I

was

also told so before) that all text of all wmf-projects are stored

together.

The task would now to separate wikidata from the rest – but the

storage-area

has no clue from where a text is which makes the separating very hard.

And

there is another problem: Deleted texts are also in this area, so even

filtering would be needed. I very doubt that this situation will change at the TS and I also doubt

that

it will be different for WikiLabs. So I guess your best bet is the API

here.

Sincerely, DaB.

I think the only hope would be if wikidata was stored under its own cluster (for easier differenciation) and at least one server of that group (the master?) only had that (so toolserver could get its binlogs). _______________________________________________ Toolserver-l mailing list (Toolserver-l(a)lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/toolserver-l Posting guidelines for this list: https://wiki.toolserver.org/view/Mailing_list_etiquette

Kolossos

9:07 a.m.

Sounds fine, but will it be possible to join the data with data from other tables and other projects? This joins are the base for a lot of tools on toolserver and I#m not sure how good joins on application level will work. BTW: With the project Templatetiger we already handle tons of informations of infoboxes in MYSQL on Toolserver, all these data are highly redundant because we support a lot of languages. So there should be enough space on toolserver in midterm. But I also think that labs are the right place to start such a project. Greetings Tim Am 19.04.2013 10:37, schrieb Magnus Manske:

...

Daniel Schwen

22 Apr 22 Apr

4:57 p.m.

...

FWIW, I have implemented a query-able stand-alone web server that keeps all of the wikidata property-item-links in memory. This uses the wikidata dumps

That does not sound too terribly scalable. I did the same thing (custom webserver, data kept in memory) for the map labels of my WikiMiniAtlas, and had to abandon this in order to be able to support more languages. And my data is only a subset of what I expect to show up in WikiData. On top of that not being able to join data from there with other DBs is a serious deficiency. Same goes for the suggestion to "just use the API". Daniel

Patricia Pintilie

19 Apr 19 Apr

9:14 a.m.

Ok so how about we recocnize what the overal goal is first . Then establish the point that its trying to convay. Only then can we meet in the middle and set a plan in motion. I can only assist when a plan of action is clear with a definite plan without it im lost on where to begin. It seems as if im doing my natural instinct research then I get mail from the people im reading about... Very interesting this is because im left to think my mind is linked to the problems at hand. TS is my old signature, my server os will pull up my IP searches. Which leads me to believe this is why I am always being brought up in the middle of these outstanding conversations you guys are having lol. Please send detailed instructions as to how I can help,there should be a file known as Mila.eu also known as ro.eula. Find it and run whatever it has, thanks. -patiently waiting your responce. -MilaStarX-TS On Apr 19, 2013 3:29 AM, "Platonides" <platonides(a)gmail.com> wrote:

...

On 19/04/13 01:19, DaB. wrote:

as you may know there is a rev_text_id-field in the revision-table. This

field

points to the text-table where the actual text is – or should be.

Because the

WMF doesn’t store the text here, but only a pointer

("DB://cluster25/11458305"

for example). If you query different wikis you will see that most of

them point

to the same cluster or one with a number short by. That says me (and I

was

also told so before) that all text of all wmf-projects are stored

together.

The task would now to separate wikidata from the rest – but the

storage-area

has no clue from where a text is which makes the separating very hard.

And

there is another problem: Deleted texts are also in this area, so even

filtering would be needed. I very doubt that this situation will change at the TS and I also doubt

that

it will be different for WikiLabs. So I guess your best bet is the API

here.

Sincerely, DaB.

Patricia Pintilie

9:41 a.m.

Waterfall the Anamorphic Development. On Apr 19, 2013 4:14 AM, "Patricia Pintilie" <pintilieempire(a)gmail.com> wrote:

...

On 19/04/13 01:19, DaB. wrote:

as you may know there is a rev_text_id-field in the revision-table.

This field

points to the text-table where the actual text is – or should be.

Because the

WMF doesn’t store the text here, but only a pointer

("DB://cluster25/11458305"

for example). If you query different wikis you will see that most of

them point

to the same cluster or one with a number short by. That says me (and I

was

also told so before) that all text of all wmf-projects are stored

together.

The task would now to separate wikidata from the rest – but the

storage-area

has no clue from where a text is which makes the separating very hard.

And

there is another problem: Deleted texts are also in this area, so even

filtering would be needed. I very doubt that this situation will change at the TS and I also doubt

that

it will be different for WikiLabs. So I guess your best bet is the API

here.

Sincerely, DaB.

Platonides

9:42 a.m.

Sorry, but your mail doesn't make much sense. El 19/04/13 11:14, Patricia Pintilie escribió:

...

TS was being used in this thread as an abbreviature of ToolServer. This mailing list is about the Wikimedia Toolserver, so no wonder that it's mentioned a lot here. :) I don't know if you have a toolserver account, or even if you're a wikimedian. I don't know what you refer to with “my server os will pull up my IP searches” nor where is that “file known as Mila.eu also known as ro.eula” supposed to be. If you were asking for someone to make a query for you in the toolserver, try including the request in the email, or at least a link to what you want. Best regards

Kolossos

8:53 a.m.

Hello, Postrgesql starts now to support also JSON, so we should try to find a way to bring Wikidata available for us and I would prefer to use furthermore SQL. One way could be minutely diff-files, that's the way OpenStreetMap use. Alternatively we could use API for each updated article. Every central service is better than let's fighting everyone with the problem alone. I think the support of hierarchical informations was the key to use JSON instead of a key-value store. A point that I can understand. Greetings Tim Am 19.04.2013 00:43, schrieb Daniel Schwen:

...

if these JSON-data is stored where the normal wiki-text is, it is imposable

To my understanding it is.

for us to replicate it: Because we have no access to these wmf-servers, there

would be no way to separate Wikidata from the rest

and/or we have not enough disc-space.

If you can separate it out I seriously doubt that wikidata would require storage any where near as large in magnitude as the other wikimedia projects (at least in the mid-term) _______________________________________________ Toolserver-l mailing list (Toolserver-l(a)lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/toolserver-l Posting guidelines for this list: https://wiki.toolserver.org/view/Mailing_list_etiquette

4045

days inactive

4049

days old

toolserver-l@lists.wikimedia.org

Manage subscription

16 comments

8 participants

tags (0)

participants (8)

Byrial Jensen
DaB.
Daniel Schwen
Kolossos
Lydia Pintscher
Magnus Manske
Patricia Pintilie
Platonides