Hi. Currently, the dump service offers two different dumps for wikidata:
* XML: http://dumps.wikimedia.org/wikidatawiki/latest/ * JSON: http://dumps.wikimedia.org/wikidatawiki/entities/
According to http://www.wikidata.org/wiki/Wikidata:Database_download, the JSON dump is listed as the recommended dump format. Also, at the time of writing, the JSON dump has been generating regularly every week whereas the XML dump has been delayed for 2+ months.
Going forward, will both dumps continue to be supported? Or will the XML dump be phased out and only the JSON dump remain? Or are these plans still to be determined based on upcoming changes to the dumping infrastructure as per https://phabricator.wikimedia.org/T88728?
If the JSON dump is to be the sole data format, is there any way to address the following omissions?
* '''Non-JSON pages not available''': The JSON dump only provides JSON content-type pages in the main and property namespaces. Pages in other namespaces are not available, including the Main Page. For example, here are the counts from the 2015-03-30 dump
id name count ---- ------------ ----- 4 Wikidata 10280 8 MediaWiki 2244 10 Template 4701 12 Help 779 14 Category 3073 828 Module 175 1198 Translations 83524
* '''Page metadata not available''' : For the JSON pages, the page_touched and page_id is not available. * '''Other tables not provided''': Other tables are not provided, notably categorylinks and page_props
Thanks in advance for any information.
The JSON dump is the preferred if you want to process the entity data. From the JSON dumps, you can get the current entities (items and properties) in the current canonical format, for further processing.
The XML dumps are an "opaque" exchange format for mediawiki page content. They are designed to allow content from pages in one wiki to be imported into another wiki(*), including old revisions. It can also be used for backups, since it provides a future proof way to store your wiki's content. But the format of the page content in the XML dumps is not strictly specified. It can be wikitext, or JSON data, or whatever. The JSON you find embedded of the XML dumps of wikidata may or may not be compatible with the format in the JSON file, and is subject to change without notice. It's not designed for processing by 3rd parties.
Wikidata XML dumps will be generated, for all pages, including history, like it is done for all Wikimedia projects. However, this process often breaks, due to the large size of these dumps. If you want to process Wikidata items, you should use the JSON dumps.
HTH -- daniel
(*) this is usually disabled for wikibase entities, to avoid ID conflicts.
Am 14.06.2015 um 03:38 schrieb gnosygnu:
According to http://www.wikidata.org/wiki/Wikidata:Database_download, the JSON dump is listed as the recommended dump format. Also, at the time of writing, the JSON dump has been generating regularly every week whereas the XML dump has been delayed for 2+ months.
Going forward, will both dumps continue to be supported? Or will the XML dump be phased out and only the JSON dump remain? Or are these plans still to be determined based on upcoming changes to the dumping infrastructure as per https://phabricator.wikimedia.org/T88728?
If the JSON dump is to be the sole data format, is there any way to address the following omissions?
Thanks for the clarification. So it looks like the JSON dumps were designed only to have entity data. I guess it was never ever meant to have other MediaWiki data, such as other namespaces, page table metadata, other tables, etc..
I guess my main question is: are there any plans to phase out the XML dump? I assume no, but given the recent problems with the XML dump process, I just wanted to make sure.
Thanks again for the help.
On Sun, Jun 14, 2015 at 5:12 AM, Daniel Kinzler daniel.kinzler@wikimedia.de wrote:
The JSON dump is the preferred if you want to process the entity data. From the JSON dumps, you can get the current entities (items and properties) in the current canonical format, for further processing.
The XML dumps are an "opaque" exchange format for mediawiki page content. They are designed to allow content from pages in one wiki to be imported into another wiki(*), including old revisions. It can also be used for backups, since it provides a future proof way to store your wiki's content. But the format of the page content in the XML dumps is not strictly specified. It can be wikitext, or JSON data, or whatever. The JSON you find embedded of the XML dumps of wikidata may or may not be compatible with the format in the JSON file, and is subject to change without notice. It's not designed for processing by 3rd parties.
Wikidata XML dumps will be generated, for all pages, including history, like it is done for all Wikimedia projects. However, this process often breaks, due to the large size of these dumps. If you want to process Wikidata items, you should use the JSON dumps.
HTH -- daniel
(*) this is usually disabled for wikibase entities, to avoid ID conflicts.
Am 14.06.2015 um 03:38 schrieb gnosygnu:
According to http://www.wikidata.org/wiki/Wikidata:Database_download, the JSON dump is listed as the recommended dump format. Also, at the time of writing, the JSON dump has been generating regularly every week whereas the XML dump has been delayed for 2+ months.
Going forward, will both dumps continue to be supported? Or will the XML dump be phased out and only the JSON dump remain? Or are these plans still to be determined based on upcoming changes to the dumping infrastructure as per https://phabricator.wikimedia.org/T88728?
If the JSON dump is to be the sole data format, is there any way to address the following omissions?
-- Daniel Kinzler Senior Software Developer
Wikimedia Deutschland Gesellschaft zur Förderung Freien Wissens e.V.
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
On Sun, Jun 14, 2015 at 11:56 AM, gnosygnu gnosygnu@gmail.com wrote:
Thanks for the clarification. So it looks like the JSON dumps were designed only to have entity data. I guess it was never ever meant to have other MediaWiki data, such as other namespaces, page table metadata, other tables, etc..
I guess my main question is: are there any plans to phase out the XML dump? I assume no, but given the recent problems with the XML dump process, I just wanted to make sure.
Thanks again for the help.
There are no plans to phase them out from the Wikidata side. But please be aware that you can't rely on the entity format in there being stable as Daniel said. I don't know what the WMF's plans for all XML dumps are or if there are any.
Cheers Lydia
Am 14.06.2015 um 11:56 schrieb gnosygnu:
Thanks for the clarification. So it looks like the JSON dumps were designed only to have entity data. I guess it was never ever meant to have other MediaWiki data, such as other namespaces, page table metadata, other tables, etc..
I guess my main question is: are there any plans to phase out the XML dump? I assume no, but given the recent problems with the XML dump process, I just wanted to make sure.
There are no plans to phase out the XML dumps. However, the processes and infrastructure involved in making them could use a bit of a boost. I can't say much about that though - while the JSON dumps are managed by the Wikidata team at WMDE, XML dumps are managed by the WMF as part of the general wikimedia platform.
Thanks for the confirmation Lydia and Daniel. I just wanted to make sure that the Wikidata XML dump wasn't slotted for obsolescence. If there are further issues, I'll send them to the WMF dumping team.
On Sun, Jun 14, 2015 at 6:42 AM, Daniel Kinzler daniel.kinzler@wikimedia.de wrote:
Am 14.06.2015 um 11:56 schrieb gnosygnu:
Thanks for the clarification. So it looks like the JSON dumps were designed only to have entity data. I guess it was never ever meant to have other MediaWiki data, such as other namespaces, page table metadata, other tables, etc..
I guess my main question is: are there any plans to phase out the XML dump? I assume no, but given the recent problems with the XML dump process, I just wanted to make sure.
There are no plans to phase out the XML dumps. However, the processes and infrastructure involved in making them could use a bit of a boost. I can't say much about that though - while the JSON dumps are managed by the Wikidata team at WMDE, XML dumps are managed by the WMF as part of the general wikimedia platform.
-- Daniel Kinzler Senior Software Developer
Wikimedia Deutschland Gesellschaft zur Förderung Freien Wissens e.V.
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata