weblinchecker.py working file deadlinks-wikipedia-pl.dat has a binary format in core. In Compat it was a text file. the script works after copying it to the new directory but it looks it does not recognize information about previously reported links. As I ran it since some years on pl.wiki I need to preserve this info to not to report the links.
is there a way to migrate the data to the new format or a descritpion of both formats so I can do a file conversion?
masti
Hi masti,
On 7 August 2015 at 23:11, masti mastigm@gmail.com wrote:
weblinchecker.py working file deadlinks-wikipedia-pl.dat has a binary format in core. In Compat it was a text file.
This confuses me. As far as I can see, both compat and core use 'pickle' to write the working file (and as far as I can see, compat always has). There are, however, different versions of the pickle format, which could explain the binary/text difference. Could you post an excerpt of the old file?
the script works after copying it to the new directory but it looks it does not recognize information about previously reported links.
Is there an error message posted, is the file just being overwritten, or is there some other kind of issue?
Best regards, Merlijn
On 08/07/2015 11:50 PM, Merlijn van Deen wrote:
Hi masti,
On 7 August 2015 at 23:11, masti <mastigm@gmail.com mailto:mastigm@gmail.com> wrote:
weblinchecker.py working file deadlinks-wikipedia-pl.dat has a binary format in core. In Compat it was a text file.
This confuses me. As far as I can see, both compat and core use 'pickle' to write the working file (and as far as I can see, compat always has). There are, however, different versions of the pickle format, which could explain the binary/text difference. Could you post an excerpt of the old file?
(dp0 Vhttp://www.european-athletics.org/european-athletics-awards-night/baldini-sa... p1 (lp2 (VStefano Baldini p3 F1404150662.121833 S'404 Not Found' p4 tp5 a(V2010 w lekkoatletyce p6 F1404569856.34401 S'404 Not Found' p7 tp8 a(VStefano Baldini p9 F1404729661.197063 S'404 Not Found' p10 tp11 a(V2010 w lekkoatletyce p12 F1404761790.987255 S'404 Not Found' p13 tp14 a(VStefano Baldini p15 F1413308613.299339 S'404 Not Found' p16 tp17
the script works after copying it to the new directory but it looks it does not recognize information about previously reported links.
Is there an error message posted, is the file just being overwritten, or is there some other kind of issue?
no. no error, it does not overwrite the file. I am still testing wether it adds the records. But it tries to report deadlinks the were previously reported. So this makes me think the script does not recognize properly old records.
Best regards, Merlijn
pywikibot mailing list pywikibot@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/pywikibot
It might be due to the page object format changes (posting while traveling, so can't confirm my thoughts) and how it looks up the information in the pickled object
On Friday, August 7, 2015, masti mastigm@gmail.com wrote:
On 08/07/2015 11:50 PM, Merlijn van Deen wrote:
Hi masti,
On 7 August 2015 at 23:11, masti <mastigm@gmail.com mailto:mastigm@gmail.com> wrote:
weblinchecker.py working file deadlinks-wikipedia-pl.dat has a binary format in core. In Compat it was a text file.
This confuses me. As far as I can see, both compat and core use 'pickle' to write the working file (and as far as I can see, compat always has). There are, however, different versions of the pickle format, which could explain the binary/text difference. Could you post an excerpt of the old file?
(dp0 Vhttp:// www.european-athletics.org/european-athletics-awards-night/baldini-says-goodbye-at-the-giro-al-sas.html p1 (lp2 (VStefano Baldini p3 F1404150662.121833 S'404 Not Found' p4 tp5 a(V2010 w lekkoatletyce p6 F1404569856.34401 S'404 Not Found' p7 tp8 a(VStefano Baldini p9 F1404729661.197063 S'404 Not Found' p10 tp11 a(V2010 w lekkoatletyce p12 F1404761790.987255 S'404 Not Found' p13 tp14 a(VStefano Baldini p15 F1413308613.299339 S'404 Not Found' p16 tp17
the script works after copying it to the new directory but it looks it does not recognize information about previously reported links.
Is there an error message posted, is the file just being overwritten, or is there some other kind of issue?
no. no error, it does not overwrite the file. I am still testing wether it adds the records. But it tries to report deadlinks the were previously reported. So this makes me think the script does not recognize properly old records.
Best regards, Merlijn
pywikibot mailing list pywikibot@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/pywikibot
pywikibot mailing list pywikibot@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/pywikibot
On 8 August 2015 at 00:01, masti mastigm@gmail.com wrote:
(dp0 Vhttp:// www.european-athletics.org/european-athletics-awards-night/baldini-says-goodbye-at-the-giro-al-sas.html p1
(...)
This is indeed pickle format 1, representing a python dict that looks like:
{u'http:// www.european-athletics.org/european-athletics-awards-night/baldini-says-goodbye-at-the-giro-al-sas.html ': [(u'Stefano Baldini', 1404150662.121833, '404 Not Found'), (u'2010 w lekkoatletyce', 1404569856.34401, '404 Not Found')]
}
which is the same as the format I get when I run weblinkchecker now.
But it tries to report deadlinks the were previously reported. So this
makes me think the script does not recognize properly old records.
Based on the code, I don't think there is a check for that. The code only checks if the talk page already contains the URL, and this is the same for both compat and core.
Merlijn
The pickle protocol is configurable using config 'pickle_protocol'. In core, it defaults to protocol 2
On Sat, Aug 8, 2015 at 8:27 PM, Merlijn van Deen valhallasw@arctus.nl wrote:
On 8 August 2015 at 00:01, masti mastigm@gmail.com wrote:
(dp0
Vhttp://www.european-athletics.org/european-athletics-awards-night/baldini-sa... p1
(...)
This is indeed pickle format 1, representing a python dict that looks like:
{u'http://www.european-athletics.org/european-athletics-awards-night/baldini-sa...': [(u'Stefano Baldini', 1404150662.121833, '404 Not Found'), (u'2010 w lekkoatletyce', 1404569856.34401, '404 Not Found')]
}
which is the same as the format I get when I run weblinkchecker now.
But it tries to report deadlinks the were previously reported. So this makes me think the script does not recognize properly old records.
Based on the code, I don't think there is a check for that. The code only checks if the talk page already contains the URL, and this is the same for both compat and core.
Merlijn
pywikibot mailing list pywikibot@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/pywikibot
On 08/08/2015 12:27 PM, Merlijn van Deen wrote:
On 8 August 2015 at 00:01, masti <mastigm@gmail.com mailto:mastigm@gmail.com> wrote:
(dp0 Vhttp://www.european-athletics.org/european-athletics-awards-night/baldini-says-goodbye-at-the-giro-al-sas.html <http://www.european-athletics.org/european-athletics-awards-night/baldini-says-goodbye-at-the-giro-al-sas.html> p1
(...)
This is indeed pickle format 1, representing a python dict that looks like:
{u'http://www.european-athletics.org/european-athletics-awards-night/baldini-sa... http://www.european-athletics.org/european-athletics-awards-night/baldini-says-goodbye-at-the-giro-al-sas.html': [(u'Stefano Baldini', 1404150662.121833, '404 Not Found'), (u'2010 w lekkoatletyce', 1404569856.34401, '404 Not Found')]
}
which is the same as the format I get when I run weblinkchecker now.
But it tries to report deadlinks the were previously reported. So this makes me think the script does not recognize properly old records.
Based on the code, I don't think there is a check for that. The code only checks if the talk page already contains the URL, and this is the same for both compat and core.
thanks for help. Now I now what to look for
masti