With respect to the RDF export I'd advocate for: 1) an RDF format with one fact per line. 2) the use of a mature/proven RDF generation framework.
Optimizing too early based on a limited and/or biased view of the potential use cases may not be a good idea in the long run. I'd rather keep it simple and standard at the data publishing level, and let consumers access data easily and optimize processing to their need.
Also, I should not have to run a preprocessing step for filtering out the pieces of data that do not follow the standardŠ
Note that I also understand the need for a format that groups every facts about an subject into one record, and serialize them one record per line. It sometime makes life easier for bulk processing of large datasets. But that's a different discussion.
-- Nicolas Torzec.
On 8/12/13 1:49 AM, "Markus Krötzsch" markus@semantic-mediawiki.org wrote:
On 11/08/13 22:29, Tom Morris wrote:
On Sat, Aug 10, 2013 at 2:30 PM, Markus Krötzsch <markus@semantic-mediawiki.org mailto:markus@semantic-mediawiki.org> wrote:
Anyway, if you restrict yourself to tools that are installed by default on your system, then it will be difficult to do many interesting things with a 4.5G RDF file ;-) Seriously, the RDF dump is really meant specifically for tools that take RDF inputs. It is not very straightforward to encode all of Wikidata in triples, and it leads to some inconvenient constructions (especially a lot of reification). If you don't actually want to use an RDF tool and you are just interested in the data, then there would be easier ways of getting it.
A single fact per line seems like a pretty convenient format to me. What format do you recommend that's easier to process?
I'd suggest some custom format that at least keeps single data values in one line. For example, in RDF, you have to do two joins to find all items that have a property with a date in the year 2010. Even with a line-by-line format, you will not be able to grep this. So I think a less normalised representation would be nicer for direct text-based processing. For text-based processing, I would probably prefer a format where one statement is encoded on one line. But it really depends on what you want to do. Maybe you could also remove some data to obtain something that is easier to process.
Markus
Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l