Hi! I have tried to get a list of all .svg-files on commons.wikipedia.
Of course I could just parse through commons, but * if there would be any way to provide a dump with the names of the really existing .svg-files, that would be a tremendeous help for me, * and in my estimation it would reduce the download size and cpu-burden and most importantly the HD-burden about 70% to 80% compared to browse and parse through commons. (Wheres the cpu-usage and heat problems because of the HD-burden on my notebook would be much more adverse than the burden on wikipedias server, I assume. I lost already two HDs over the years when downloading larger amounts of files in one go.)
Though I have asked at various places so far I haven't found a good solution. One suggestion was to downloadcommonswiki-20150417-all-titles, which I did.
But this file does contain deleted names and renamed names, and names the partly have "File:" and some that don't have "File:" or a similiar indicator at the start. Doing just a small sample resulted in 5 correct names, and around 7 deleted and 7 renamed names. I have asked at various places, and especially one person tried to help me, but this even he couldn't solve. Beside this I didn't much feedback.
Is it possible to get such a dump? Or to get another dump that I could use to update and crosscheck the all-titles file?
Greetings D. Hansen
I can run a database report Monday. But keep in mind that the wiki isn't static and what you want changes on a very rapid rate
On Friday, June 19, 2015, D. Hansen sammelaccount@tageskurier.de wrote:
Hi! I have tried to get a list of all .svg-files on commons.wikipedia.
Of course I could just parse through commons, but
- if there would be any way to provide a dump with the names of the really
existing .svg-files, that would be a tremendeous help for me, * and in my estimation it would reduce the download size and cpu-burden and most importantly the HD-burden about 70% to 80% compared to browse and parse through commons. (Wheres the cpu-usage and heat problems because of the HD-burden on my notebook would be much more adverse than the burden on wikipedias server, I assume. I lost already two HDs over the years when downloading larger amounts of files in one go.)
Though I have asked at various places so far I haven't found a good solution. One suggestion was to downloadcommonswiki-20150417-all-titles, which I did.
But this file does contain deleted names and renamed names, and names the partly have "File:" and some that don't have "File:" or a similiar indicator at the start. Doing just a small sample resulted in 5 correct names, and around 7 deleted and 7 renamed names. I have asked at various places, and especially one person tried to help me, but this even he couldn't solve. Beside this I didn't much feedback.
Is it possible to get such a dump? Or to get another dump that I could use to update and crosscheck the all-titles file?
Greetings D. Hansen
Xmldatadumps-l mailing list Xmldatadumps-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l
D. Hansen, 19/06/2015 23:09:
One suggestion was to downloadcommonswiki-20150417-all-titles, which I did.
But this file does contain deleted names and renamed names, and names the partly have "File:" and some that don't have "File:" or a similiar indicator at the start. Doing just a small sample resulted in 5 correct names, and around 7 deleted and 7 renamed names.
That shouldn't happen... Was your sample all from the bottom (or top?) of the list? If so, maybe these are recent files, which are typically more liable to deletion. If this happens throughout the list of titles, then there's something wrong in the query used and the bug should be filed. You can also query https://www.mediawiki.org/wiki/Manual:Page_table#page_title yourself on labsdb, e.g. via http://quarry.wmflabs.org/
Nemo
Στις 20-06-2015, ημέρα Σαβ, και ώρα 00:46 +0200, ο/η Federico Leva (Nemo) έγραψε:
D. Hansen, 19/06/2015 23:09:
One suggestion was to downloadcommonswiki-20150417-all-titles, which I did.
But this file does contain deleted names and renamed names, and names the partly have "File:" and some that don't have "File:" or a similiar indicator at the start. Doing just a small sample resulted in 5 correct names, and around 7 deleted and 7 renamed names.
That shouldn't happen... Was your sample all from the bottom (or top?) of the list? If so, maybe these are recent files, which are typically
more liable to deletion. If this happens throughout the list of titles, then there's something wrong in the query used and the bug should be filed. You can also query https://www.mediawiki.org/wiki/Manual:Page_table#page_title yourself on labsdb, e.g. via http://quarry.wmflabs.org/
Nemo
The files in this directory may be interesting to you (and I need to do some cleanup on them some day too):
http://dumps.wikimedia.org/other/imageinfo
They are produced a few times a month. For each wiki, the names of all images uploaded locally to the project are saved in the <wikiname> -local-wikiqueries.gz file, and those stored on commons but used on the wiki are in <wikiname>-remote-wikiqueries.gz
You may still find later that some titles have been renamed or removed by the time you look at the contents.
Hope that helps,
Ariel
OK, reports a few minutes old: http://tools.wmflabs.org/betacommand-dev/reports/commonswiki_svg_list.txt.7z
On Sat, Jun 20, 2015 at 1:38 AM, Ariel T. Glenn aglenn@wikimedia.org wrote:
Στις 20-06-2015, ημέρα Σαβ, και ώρα 00:46 +0200, ο/η Federico Leva (Nemo) έγραψε:
D. Hansen, 19/06/2015 23:09:
One suggestion was to downloadcommonswiki-20150417-all-titles, which I did.
But this file does contain deleted names and renamed names, and names the partly have "File:" and some that don't have "File:" or a similiar indicator at the start. Doing just a small sample resulted in 5 correct names, and around 7 deleted and 7 renamed names.
That shouldn't happen... Was your sample all from the bottom (or top?) of the list? If so, maybe these are recent files, which are typically
more liable to deletion. If this happens throughout the list of titles, then there's something wrong in the query used and the bug should be filed. You can also query https://www.mediawiki.org/wiki/Manual:Page_table#page_title yourself on labsdb, e.g. via http://quarry.wmflabs.org/
Nemo
The files in this directory may be interesting to you (and I need to do some cleanup on them some day too):
http://dumps.wikimedia.org/other/imageinfo
They are produced a few times a month. For each wiki, the names of all images uploaded locally to the project are saved in the <wikiname> -local-wikiqueries.gz file, and those stored on commons but used on the wiki are in <wikiname>-remote-wikiqueries.gz
You may still find later that some titles have been renamed or removed by the time you look at the contents.
Hope that helps,
Ariel
Xmldatadumps-l mailing list Xmldatadumps-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l
xmldatadumps-l@lists.wikimedia.org