Just wondering what the status of exposing all wikidata tables on the toolserver is.
Currently, there are a few wb_* tables with item labels, descriptions, aliases, and language links.
But the tables (whatever they are called) containing item-to-item connections appear to be missing. Maybe because they were added later?
Magnus
On Thu, Apr 18, 2013 at 9:52 AM, Magnus Manske magnusmanske@googlemail.com wrote:
Just wondering what the status of exposing all wikidata tables on the toolserver is.
Currently, there are a few wb_* tables with item labels, descriptions, aliases, and language links.
But the tables (whatever they are called) containing item-to-item connections appear to be missing. Maybe because they were added later?
As far as I know they're only saved in JSON where usually the article text is stored and not in separate tables.
Cheers Lydia
-- Lydia Pintscher - http://about.me/lydia.pintscher Community Communications for Wikidata
Wikimedia Deutschland e.V. Obentrautstr. 72 10963 Berlin www.wikimedia.de
Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e. V.
Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg unter der Nummer 23855 Nz. Als gemeinnützig anerkannt durch das Finanzamt für Körperschaften I Berlin, Steuernummer 27/681/51985.
Huh. That could be ... problematic in the future.
Thanks, Magnus
On Thu, Apr 18, 2013 at 10:21 AM, Lydia Pintscher < lydia.pintscher@wikimedia.de> wrote:
On Thu, Apr 18, 2013 at 9:52 AM, Magnus Manske magnusmanske@googlemail.com wrote:
Just wondering what the status of exposing all wikidata tables on the toolserver is.
Currently, there are a few wb_* tables with item labels, descriptions, aliases, and language links.
But the tables (whatever they are called) containing item-to-item connections appear to be missing. Maybe because they were added later?
As far as I know they're only saved in JSON where usually the article text is stored and not in separate tables.
Cheers Lydia
-- Lydia Pintscher - http://about.me/lydia.pintscher Community Communications for Wikidata
Wikimedia Deutschland e.V. Obentrautstr. 72 10963 Berlin www.wikimedia.de
Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e. V.
Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg unter der Nummer 23855 Nz. Als gemeinnützig anerkannt durch das Finanzamt für Körperschaften I Berlin, Steuernummer 27/681/51985.
Toolserver-l mailing list (Toolserver-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/toolserver-l Posting guidelines for this list: https://wiki.toolserver.org/view/Mailing_list_etiquette
Imagine Magnus all the files are being looked over to determine human, machine, and non used accounts. Each needs to be looked over then proper deletion will clean out making more room. Problematic yes but will help if this is taken care of the right way now. On Apr 18, 2013 5:45 AM, "Magnus Manske" magnusmanske@googlemail.com wrote:
Huh. That could be ... problematic in the future.
Thanks, Magnus
On Thu, Apr 18, 2013 at 10:21 AM, Lydia Pintscher < lydia.pintscher@wikimedia.de> wrote:
On Thu, Apr 18, 2013 at 9:52 AM, Magnus Manske magnusmanske@googlemail.com wrote:
Just wondering what the status of exposing all wikidata tables on the toolserver is.
Currently, there are a few wb_* tables with item labels, descriptions, aliases, and language links.
But the tables (whatever they are called) containing item-to-item connections appear to be missing. Maybe because they were added later?
As far as I know they're only saved in JSON where usually the article text is stored and not in separate tables.
Cheers Lydia
-- Lydia Pintscher - http://about.me/lydia.pintscher Community Communications for Wikidata
Wikimedia Deutschland e.V. Obentrautstr. 72 10963 Berlin www.wikimedia.de
Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e. V.
Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg unter der Nummer 23855 Nz. Als gemeinnützig anerkannt durch das Finanzamt für Körperschaften I Berlin, Steuernummer 27/681/51985.
Toolserver-l mailing list (Toolserver-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/toolserver-l Posting guidelines for this list: https://wiki.toolserver.org/view/Mailing_list_etiquette
Toolserver-l mailing list (Toolserver-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/toolserver-l Posting guidelines for this list: https://wiki.toolserver.org/view/Mailing_list_etiquette
Den 18-04-2013 11:21, Lydia Pintscher skrev:
On Thu, Apr 18, 2013 at 9:52 AM, Magnus Manske magnusmanske@googlemail.com wrote:
Just wondering what the status of exposing all wikidata tables on the toolserver is.
Currently, there are a few wb_* tables with item labels, descriptions, aliases, and language links.
But the tables (whatever they are called) containing item-to-item connections appear to be missing. Maybe because they were added later?
As far as I know they're only saved in JSON where usually the article text is stored and not in separate tables.
You can see in the pagelinks table which properties and which items an item is connected to by statements, but not how the properties and items are paired together.
I can (barely) understand why wikitext is not available on the toolserver, but the JSON data you are talking about does not seem copyrightable and much lower in volume. The usefulness of the wikidata mirror on the toolserver is rather low without the actual wikiDATA. Daniel
On Thu, Apr 18, 2013 at 8:57 AM, Byrial Jensen byrial@vip.cybercity.dk wrote:
Den 18-04-2013 11:21, Lydia Pintscher skrev:
On Thu, Apr 18, 2013 at 9:52 AM, Magnus Manske magnusmanske@googlemail.com wrote:
Just wondering what the status of exposing all wikidata tables on the toolserver is.
Currently, there are a few wb_* tables with item labels, descriptions, aliases, and language links.
But the tables (whatever they are called) containing item-to-item connections appear to be missing. Maybe because they were added later?
As far as I know they're only saved in JSON where usually the article text is stored and not in separate tables.
You can see in the pagelinks table which properties and which items an item is connected to by statements, but not how the properties and items are paired together.
Toolserver-l mailing list (Toolserver-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/toolserver-l Posting guidelines for this list: https://wiki.toolserver.org/view/Mailing_list_etiquette
Hello, At Thursday 18 April 2013 22:00:54 DaB. wrote:
but the JSON data you are talking about does not seem copyrightable and much lower in volume.
if these JSON-data is stored where the normal wiki-text is, it is imposable for us to replicate it: Because we have no access to these wmf-servers, there would be no way to separate Wikidata from the rest and/or we have not enough disc-space.
Sincerely, DaB.
if these JSON-data is stored where the normal wiki-text is, it is imposable
To my understanding it is.
for us to replicate it: Because we have no access to these wmf-servers, there
IMO that was a questionable design decision. JSON plaintext storage in SQL is shoehorning a do-it-yourself object store onto a classical RDBMS. Postgres at least has hstore. This may be even a genuine usecase for one of those hipster databases (noSQL like mongdb etc.). But who knows what points were taken into consideration when making this decision.
would be no way to separate Wikidata from the rest
I don't understand why separating plaintext storage between different projects would be an issue. Is it all lumped into one storage "namespace"? I'm sure nobody at Wikimedia would be the least bit motivated to make this data available to the toolserver, but maybe it will be usable in labs. Otherwise it would be quite a waste of a great opportunity.
and/or we have not enough disc-space.
If you can separate it out I seriously doubt that wikidata would require storage any where near as large in magnitude as the other wikimedia projects (at least in the mid-term)
Hello, At Friday 19 April 2013 01:03:25 DaB. wrote:
would be no way to separate Wikidata from the rest
I don't understand why separating plaintext storage between different projects would be an issue. Is it all lumped into one storage "namespace"? I'm sure nobody at Wikimedia would be the least bit motivated to make this data available to the toolserver, but maybe it will be usable in labs. Otherwise it would be quite a waste of a great opportunity.
as you may know there is a rev_text_id-field in the revision-table. This field points to the text-table where the actual text is – or should be. Because the WMF doesn’t store the text here, but only a pointer ("DB://cluster25/11458305" for example). If you query different wikis you will see that most of them point to the same cluster or one with a number short by. That says me (and I was also told so before) that all text of all wmf-projects are stored together. The task would now to separate wikidata from the rest – but the storage-area has no clue from where a text is which makes the separating very hard. And there is another problem: Deleted texts are also in this area, so even more filtering would be needed. I very doubt that this situation will change at the TS and I also doubt that it will be different for WikiLabs. So I guess your best bet is the API here.
Sincerely, DaB.
On 19/04/13 01:19, DaB. wrote:
as you may know there is a rev_text_id-field in the revision-table. This field points to the text-table where the actual text is – or should be. Because the WMF doesn’t store the text here, but only a pointer ("DB://cluster25/11458305" for example). If you query different wikis you will see that most of them point to the same cluster or one with a number short by. That says me (and I was also told so before) that all text of all wmf-projects are stored together. The task would now to separate wikidata from the rest – but the storage-area has no clue from where a text is which makes the separating very hard. And there is another problem: Deleted texts are also in this area, so even more filtering would be needed. I very doubt that this situation will change at the TS and I also doubt that it will be different for WikiLabs. So I guess your best bet is the API here.
Sincerely, DaB.
I think the only hope would be if wikidata was stored under its own cluster (for easier differenciation) and at least one server of that group (the master?) only had that (so toolserver could get its binlogs).
FWIW, I have implemented a query-able stand-alone web server that keeps all of the wikidata property-item-links in memory. This uses the wikidata dumps which appear to be rather frequent. I'll try do deploy a test version on wikilabs (once I figure out how all that works); it seems to be more favourable to such services than the toolserver.
On Fri, Apr 19, 2013 at 9:29 AM, Platonides platonides@gmail.com wrote:
On 19/04/13 01:19, DaB. wrote:
as you may know there is a rev_text_id-field in the revision-table. This
field
points to the text-table where the actual text is – or should be.
Because the
WMF doesn’t store the text here, but only a pointer
("DB://cluster25/11458305"
for example). If you query different wikis you will see that most of
them point
to the same cluster or one with a number short by. That says me (and I
was
also told so before) that all text of all wmf-projects are stored
together.
The task would now to separate wikidata from the rest – but the
storage-area
has no clue from where a text is which makes the separating very hard.
And
there is another problem: Deleted texts are also in this area, so even
more
filtering would be needed. I very doubt that this situation will change at the TS and I also doubt
that
it will be different for WikiLabs. So I guess your best bet is the API
here.
Sincerely, DaB.
I think the only hope would be if wikidata was stored under its own cluster (for easier differenciation) and at least one server of that group (the master?) only had that (so toolserver could get its binlogs).
Toolserver-l mailing list (Toolserver-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/toolserver-l Posting guidelines for this list: https://wiki.toolserver.org/view/Mailing_list_etiquette
Sounds fine, but will it be possible to join the data with data from other tables and other projects? This joins are the base for a lot of tools on toolserver and I#m not sure how good joins on application level will work.
BTW: With the project Templatetiger we already handle tons of informations of infoboxes in MYSQL on Toolserver, all these data are highly redundant because we support a lot of languages. So there should be enough space on toolserver in midterm. But I also think that labs are the right place to start such a project.
Greetings Tim
Am 19.04.2013 10:37, schrieb Magnus Manske:
FWIW, I have implemented a query-able stand-alone web server that keeps all of the wikidata property-item-links in memory. This uses the wikidata dumps which appear to be rather frequent. I'll try do deploy a test version on wikilabs (once I figure out how all that works); it seems to be more favourable to such services than the toolserver.
FWIW, I have implemented a query-able stand-alone web server that keeps all of the wikidata property-item-links in memory. This uses the wikidata dumps
That does not sound too terribly scalable. I did the same thing (custom webserver, data kept in memory) for the map labels of my WikiMiniAtlas, and had to abandon this in order to be able to support more languages. And my data is only a subset of what I expect to show up in WikiData. On top of that not being able to join data from there with other DBs is a serious deficiency. Same goes for the suggestion to "just use the API". Daniel
Ok so how about we recocnize what the overal goal is first . Then establish the point that its trying to convay. Only then can we meet in the middle and set a plan in motion. I can only assist when a plan of action is clear with a definite plan without it im lost on where to begin. It seems as if im doing my natural instinct research then I get mail from the people im reading about... Very interesting this is because im left to think my mind is linked to the problems at hand. TS is my old signature, my server os will pull up my IP searches. Which leads me to believe this is why I am always being brought up in the middle of these outstanding conversations you guys are having lol. Please send detailed instructions as to how I can help,there should be a file known as Mila.eu also known as ro.eula. Find it and run whatever it has, thanks. -patiently waiting your responce. -MilaStarX-TS On Apr 19, 2013 3:29 AM, "Platonides" platonides@gmail.com wrote:
On 19/04/13 01:19, DaB. wrote:
as you may know there is a rev_text_id-field in the revision-table. This
field
points to the text-table where the actual text is – or should be.
Because the
WMF doesn’t store the text here, but only a pointer
("DB://cluster25/11458305"
for example). If you query different wikis you will see that most of
them point
to the same cluster or one with a number short by. That says me (and I
was
also told so before) that all text of all wmf-projects are stored
together.
The task would now to separate wikidata from the rest – but the
storage-area
has no clue from where a text is which makes the separating very hard.
And
there is another problem: Deleted texts are also in this area, so even
more
filtering would be needed. I very doubt that this situation will change at the TS and I also doubt
that
it will be different for WikiLabs. So I guess your best bet is the API
here.
Sincerely, DaB.
I think the only hope would be if wikidata was stored under its own cluster (for easier differenciation) and at least one server of that group (the master?) only had that (so toolserver could get its binlogs).
Toolserver-l mailing list (Toolserver-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/toolserver-l Posting guidelines for this list: https://wiki.toolserver.org/view/Mailing_list_etiquette
Waterfall the Anamorphic Development. On Apr 19, 2013 4:14 AM, "Patricia Pintilie" pintilieempire@gmail.com wrote:
Ok so how about we recocnize what the overal goal is first . Then establish the point that its trying to convay. Only then can we meet in the middle and set a plan in motion. I can only assist when a plan of action is clear with a definite plan without it im lost on where to begin. It seems as if im doing my natural instinct research then I get mail from the people im reading about... Very interesting this is because im left to think my mind is linked to the problems at hand. TS is my old signature, my server os will pull up my IP searches. Which leads me to believe this is why I am always being brought up in the middle of these outstanding conversations you guys are having lol. Please send detailed instructions as to how I can help,there should be a file known as Mila.eu also known as ro.eula. Find it and run whatever it has, thanks. -patiently waiting your responce. -MilaStarX-TS On Apr 19, 2013 3:29 AM, "Platonides" platonides@gmail.com wrote:
On 19/04/13 01:19, DaB. wrote:
as you may know there is a rev_text_id-field in the revision-table.
This field
points to the text-table where the actual text is – or should be.
Because the
WMF doesn’t store the text here, but only a pointer
("DB://cluster25/11458305"
for example). If you query different wikis you will see that most of
them point
to the same cluster or one with a number short by. That says me (and I
was
also told so before) that all text of all wmf-projects are stored
together.
The task would now to separate wikidata from the rest – but the
storage-area
has no clue from where a text is which makes the separating very hard.
And
there is another problem: Deleted texts are also in this area, so even
more
filtering would be needed. I very doubt that this situation will change at the TS and I also doubt
that
it will be different for WikiLabs. So I guess your best bet is the API
here.
Sincerely, DaB.
I think the only hope would be if wikidata was stored under its own cluster (for easier differenciation) and at least one server of that group (the master?) only had that (so toolserver could get its binlogs).
Toolserver-l mailing list (Toolserver-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/toolserver-l Posting guidelines for this list: https://wiki.toolserver.org/view/Mailing_list_etiquette
Sorry, but your mail doesn't make much sense.
El 19/04/13 11:14, Patricia Pintilie escribió:
Ok so how about we recocnize what the overal goal is first . Then establish the point that its trying to convay. Only then can we meet in the middle and set a plan in motion. I can only assist when a plan of action is clear with a definite plan without it im lost on where to begin. It seems as if im doing my natural instinct research then I get mail from the people im reading about... Very interesting this is because im left to think my mind is linked to the problems at hand. TS is my old signature, my server os will pull up my IP searches. Which leads me to believe this is why I am always being brought up in the middle of these outstanding conversations you guys are having lol. Please send detailed instructions as to how I can help,there should be a file known as Mila.eu also known as ro.eula. Find it and run whatever it has, thanks. -patiently waiting your responce. -MilaStarX-TS
TS was being used in this thread as an abbreviature of ToolServer. This mailing list is about the Wikimedia Toolserver, so no wonder that it's mentioned a lot here. :) I don't know if you have a toolserver account, or even if you're a wikimedian. I don't know what you refer to with “my server os will pull up my IP searches” nor where is that “file known as Mila.eu also known as ro.eula” supposed to be. If you were asking for someone to make a query for you in the toolserver, try including the request in the email, or at least a link to what you want.
Best regards
Hello, Postrgesql starts now to support also JSON, so we should try to find a way to bring Wikidata available for us and I would prefer to use furthermore SQL.
One way could be minutely diff-files, that's the way OpenStreetMap use. Alternatively we could use API for each updated article.
Every central service is better than let's fighting everyone with the problem alone.
I think the support of hierarchical informations was the key to use JSON instead of a key-value store. A point that I can understand.
Greetings Tim
Am 19.04.2013 00:43, schrieb Daniel Schwen:
if these JSON-data is stored where the normal wiki-text is, it is imposable
To my understanding it is.
for us to replicate it: Because we have no access to these wmf-servers, there
IMO that was a questionable design decision. JSON plaintext storage in SQL is shoehorning a do-it-yourself object store onto a classical RDBMS. Postgres at least has hstore. This may be even a genuine usecase for one of those hipster databases (noSQL like mongdb etc.). But who knows what points were taken into consideration when making this decision.
would be no way to separate Wikidata from the rest
I don't understand why separating plaintext storage between different projects would be an issue. Is it all lumped into one storage "namespace"? I'm sure nobody at Wikimedia would be the least bit motivated to make this data available to the toolserver, but maybe it will be usable in labs. Otherwise it would be quite a waste of a great opportunity.
and/or we have not enough disc-space.
If you can separate it out I seriously doubt that wikidata would require storage any where near as large in magnitude as the other wikimedia projects (at least in the mid-term)
Toolserver-l mailing list (Toolserver-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/toolserver-l Posting guidelines for this list: https://wiki.toolserver.org/view/Mailing_list_etiquette
toolserver-l@lists.wikimedia.org