Hoi,

It is pointless to include automated descriptions when they are then saved in a fixed form. The point of automated descriptions is exactly that they change as new statements are made. This is one reason why they are superior to manual descriptions. The other is that when one label is added in a language, it immediately affects all items that include the associated item.

When the argument is that external users need the best descriptions available at whatever time, it is best to have the automated descriptions separate. We have enough experience of the disruption caused by failing dumps. Given that there is a need for descriptions for off line usage, it makes sense to consider caching such a file and removing the content that is changed and have it regenerated in a batch process. When a description is needed it can always be generated there and then. These can be used interactively as well.

Thanks,

GerardM

On 9 February 2015 at 13:21, Markus Kroetzsch <markus.kroetzsch@tu-dresden.de> wrote:

Hi Magnus, hi Daniel,

I don't think file size should be our primary concern here. What may seem big today will be negligible in a few years. Having all data in one place is just easier to work with. I am happy to wait for another 30min for a download if it saves me from implementing another Web service connector in my own code. Compute time is cheap, disk space is cheap, human labour is expensive.

Maybe the whole size discussion is a bit of a red herring here anyway. If we are worried about file size, there would maybe be better ways of reducing it. We can split the contents into several smaller dump files, not just for descriptions. We are already doing this when creating RDF dumps, and it would be easy for us to do the same for JSON. We could do this immediately if someone needs it (just let me know and we will set it up for you). However, if we want to provide smaller files, a more effective method would be to split by language rather than by term type: all labels in all languages would still be much bigger than labels+descriptions+aliases in English only, and many applications will not need labels in 300 languages.

Anyway, as I said, I do not mind whether the auto-descriptions are stored like normal descriptions or whether they are added to the dump files "last minute" when generating them. I just need the descriptions in the dumps.

Cheers,

Markus

On 09.02.2015 12:28, Daniel Kinzler wrote:

Am 09.02.2015 um 12:25 schrieb Magnus Manske:

But wouldn't it be better to keep the dump as it is, for those who don't want
triple size (just inventing a number here), and have one separate, or even
per-language, dump with just the automated descriptions, for those who want that?

Possibly. Depends on how much more data this would actually be. Which also
depends on whether we would omit descriptions in languages that can easily be
covered by language fallback (e.g. no separate descriptions in de-ch and de-at).

--
Markus Kroetzsch
Faculty of Computer Science
Technische Universität Dresden
+49 351 463 38486
http://korrekt.org/

_______________________________________________
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l