I've been working with a number of colleagues getting ready to turn HTTPS on by default for various loc.gov domains. This has been fairly successful and we're working through the old legacy apps now.
When that work completes, we'll have somewhere around half a million links which differ only in the URL scheme. What would be the best way to rewrite all of those URLs? I'd like to reduce the window during which users transit from HTTPS -> HTTP -> HTTPS.
If anyone's curious, I've been collecting the links for a few dozen wikis in a somewhat oversized Git repo:
https://github.com/acdha/lc-wikipedia-links
The first site which has completely migrated is the much smaller World Digital Library which has just under four thousand links: https://gist.github.com/acdha/f785b22b356a9842439e
Thanks, Chris
If you use Apache, a rewrite rule is the simplest approach and instructions can be found by searching for "rewrite http to https Apache". A similar process will work with nginx.
On Wed, 13 Jan 2016, 17:09 Chris Adams chris@improbable.org wrote:
I've been working with a number of colleagues getting ready to turn HTTPS on by default for various loc.gov domains. This has been fairly successful and we're working through the old legacy apps now.
When that work completes, we'll have somewhere around half a million links which differ only in the URL scheme. What would be the best way to rewrite all of those URLs? I'd like to reduce the window during which users transit from HTTPS -> HTTP -> HTTPS.
If anyone's curious, I've been collecting the links for a few dozen wikis in a somewhat oversized Git repo:
https://github.com/acdha/lc-wikipedia-links
The first site which has completely migrated is the much smaller World Digital Library which has just under four thousand links: https://gist.github.com/acdha/f785b22b356a9842439e
Thanks, Chris _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
On 01/13/2016 09:09 AM, Chris Adams wrote:
I've been working with a number of colleagues getting ready to turn HTTPS on by default for various loc.gov domains. This has been fairly successful and we're working through the old legacy apps now.
Awesome!
When that work completes, we'll have somewhere around half a million links which differ only in the URL scheme. What would be the best way to rewrite all of those URLs? I'd like to reduce the window during which users transit from HTTPS -> HTTP -> HTTPS.
You can use Pywikbot's replace.py[1], which lets you provide regex find/replace and can get a list of pages from the API equivalent of Special:LinkSearch.
You should also consider setting up HSTS[2] so regardless if users click on an HTTP link, they'll be sent to the HTTPS version of the site.
[1] https://www.mediawiki.org/wiki/Manual:Pywikibot/replace.py [2] https://en.wikipedia.org/wiki/HTTP_Strict_Transport_Security
-- Legoktm
On Wed, Jan 13, 2016 at 12:47 PM, Legoktm legoktm.wikipedia@gmail.com wrote:
You can use Pywikbot's replace.py[1], which lets you provide regex find/replace and can get a list of pages from the API equivalent of Special:LinkSearch.
Thanks - I'll look into that as we get various batches of URLs ready for testing.
You should also consider setting up HSTS[2] so regardless if users click on an HTTP link, they'll be sent to the HTTPS version of the site.
Yes – that's on the plan as soon as we finishing remediating the older legacy content. I've been using lists from Wikipedia, a sampling of web access logs, etc. to feed a script[1] to find cases where someone used an absolute URL in a <script> tag, etc. We have a couple of subdomains which should be ready to HSTS quickly since they were only used for a single application.
Chris
On Wed, Jan 13, 2016 at 12:47 PM, Legoktm legoktm.wikipedia@gmail.com wrote:
When that work completes, we'll have somewhere around half a million
links
which differ only in the URL scheme. What would be the best way to
rewrite
all of those URLs? I'd like to reduce the window during which users
transit
from HTTPS -> HTTP -> HTTPS.
You can use Pywikbot's replace.py[1], which lets you provide regex find/replace and can get a list of pages from the API equivalent of Special:LinkSearch.
Thanks – I gave this a test using our simplest site ( https://gist.github.com/acdha/77354c76bf503b6f455f) to produce a minor edit like this:
https://en.wikipedia.org/w/index.php?title=World_Digital_Library&diff=70...
I had a question about etiquette: is a one-time operation like this considered a bot for the purposes of needing to go through the approval process? I anticipate running this multiple times as each application is migrated but it would be a one-time process and since there will be permanent redirects there won't be a need for this to run automatically in the future since users won't be seeing http: URLs any more.
Chris
I imagine you would need to go through the process, yep, since it's kind of a lot of edits that'd need clearing up if something went wrong.
On 15 January 2016 at 13:32, Chris Adams chris@improbable.org wrote:
On Wed, Jan 13, 2016 at 12:47 PM, Legoktm legoktm.wikipedia@gmail.com wrote:
When that work completes, we'll have somewhere around half a million
links
which differ only in the URL scheme. What would be the best way to
rewrite
all of those URLs? I'd like to reduce the window during which users
transit
from HTTPS -> HTTP -> HTTPS.
You can use Pywikbot's replace.py[1], which lets you provide regex find/replace and can get a list of pages from the API equivalent of Special:LinkSearch.
Thanks – I gave this a test using our simplest site ( https://gist.github.com/acdha/77354c76bf503b6f455f) to produce a minor edit like this:
https://en.wikipedia.org/w/index.php?title=World_Digital_Library&diff=70...
I had a question about etiquette: is a one-time operation like this considered a bot for the purposes of needing to go through the approval process? I anticipate running this multiple times as each application is migrated but it would be a one-time process and since there will be permanent redirects there won't be a need for this to run automatically in the future since users won't be seeing http: URLs any more.
Chris _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Fix them with a bot, for example AWB https://en.wikipedia.org/wiki/Wikipedia:AutoWikiBrowser.
On Wed, Jan 13, 2016 at 9:09 AM, Chris Adams chris@improbable.org wrote:
I've been working with a number of colleagues getting ready to turn HTTPS on by default for various loc.gov domains. This has been fairly successful and we're working through the old legacy apps now.
When that work completes, we'll have somewhere around half a million links which differ only in the URL scheme. What would be the best way to rewrite all of those URLs? I'd like to reduce the window during which users transit from HTTPS -> HTTP -> HTTPS.
If anyone's curious, I've been collecting the links for a few dozen wikis in a somewhat oversized Git repo:
https://github.com/acdha/lc-wikipedia-links
The first site which has completely migrated is the much smaller World Digital Library which has just under four thousand links: https://gist.github.com/acdha/f785b22b356a9842439e
Thanks, Chris _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Before properly answering this question, it's important to know how many links we're talking about. If it's 5000, the fallout is probably manageable; but if it's in the hundreds of thousands on any project (most likely enwiki) there will be renting of garments and gnashing of teeth. All those changes show up on people's watchlists, after all.
Please also ensure that if you're changing the URL, it's not just a http --> https swap, but that the new URL is tested to verify it lands on a real page. There are no doubt plenty of bad links in amongst all those URLs - even government websites rearrange themselves periodically - and replacing a bad link with a more secure bad link is not really helpful.
Risker/Anne
On 13 January 2016 at 13:32, Max Semenik maxsem.wiki@gmail.com wrote:
Fix them with a bot, for example AWB https://en.wikipedia.org/wiki/Wikipedia:AutoWikiBrowser.
On Wed, Jan 13, 2016 at 9:09 AM, Chris Adams chris@improbable.org wrote:
I've been working with a number of colleagues getting ready to turn HTTPS on by default for various loc.gov domains. This has been fairly
successful
and we're working through the old legacy apps now.
When that work completes, we'll have somewhere around half a million
links
which differ only in the URL scheme. What would be the best way to
rewrite
all of those URLs? I'd like to reduce the window during which users
transit
from HTTPS -> HTTP -> HTTPS.
If anyone's curious, I've been collecting the links for a few dozen wikis in a somewhat oversized Git repo:
https://github.com/acdha/lc-wikipedia-links
The first site which has completely migrated is the much smaller World Digital Library which has just under four thousand links: https://gist.github.com/acdha/f785b22b356a9842439e
Thanks, Chris _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
-- Best regards, Max Semenik ([[User:MaxSem]]) _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Question; are LOC links handled in a standardised way using a template? Because if so this could be one change, not hundreds of thousands.
(If it's not I'd really suggest using the same edit sets and opportunity to restructure them that way, if LOC links are consistent enough for it to be done. That way you'll both avoid this problem in the future if something goes kooky - only have to make one edit! - and have a much easier way of identifying how many there are and where they live)
On 13 January 2016 at 10:49, Risker risker.wp@gmail.com wrote:
Before properly answering this question, it's important to know how many links we're talking about. If it's 5000, the fallout is probably manageable; but if it's in the hundreds of thousands on any project (most likely enwiki) there will be renting of garments and gnashing of teeth. All those changes show up on people's watchlists, after all.
Please also ensure that if you're changing the URL, it's not just a http --> https swap, but that the new URL is tested to verify it lands on a real page. There are no doubt plenty of bad links in amongst all those URLs - even government websites rearrange themselves periodically - and replacing a bad link with a more secure bad link is not really helpful.
Risker/Anne
On 13 January 2016 at 13:32, Max Semenik maxsem.wiki@gmail.com wrote:
Fix them with a bot, for example AWB https://en.wikipedia.org/wiki/Wikipedia:AutoWikiBrowser.
On Wed, Jan 13, 2016 at 9:09 AM, Chris Adams chris@improbable.org wrote:
I've been working with a number of colleagues getting ready to turn HTTPS on by default for various loc.gov domains. This has been fairly
successful
and we're working through the old legacy apps now.
When that work completes, we'll have somewhere around half a million
links
which differ only in the URL scheme. What would be the best way to
rewrite
all of those URLs? I'd like to reduce the window during which users
transit
from HTTPS -> HTTP -> HTTPS.
If anyone's curious, I've been collecting the links for a few dozen wikis in a somewhat oversized Git repo:
https://github.com/acdha/lc-wikipedia-links
The first site which has completely migrated is the much smaller World Digital Library which has just under four thousand links: https://gist.github.com/acdha/f785b22b356a9842439e
Thanks, Chris _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
-- Best regards, Max Semenik ([[User:MaxSem]]) _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
On Wed, Jan 13, 2016 at 1:49 PM, Risker risker.wp@gmail.com wrote:
Before properly answering this question, it's important to know how many links we're talking about. If it's 5000, the fallout is probably manageable; but if it's in the hundreds of thousands on any project (most likely enwiki) there will be renting of garments and gnashing of teeth. All those changes show up on people's watchlists, after all.
Yes, that's exactly what I'd like to avoid. The first batch of URLs which is ready to go is small (~4K) but the full list is significantly larger and many of those are used on multiple pages so the edit churn would be non-trivial.
Please also ensure that if you're changing the URL, it's not just a http --> https swap, but that the new URL is tested to verify it lands on a real page. There are no doubt plenty of bad links in amongst all those URLs - even government websites rearrange themselves periodically - and replacing a bad link with a more secure bad link is not really helpful.
Yes – part of this project on our side is setting permanent redirects not just for the protocol but also for pages which have moved into a different application. This is the other side of what Oliver Keyes was asking about where there are a mix of legacy applications which are non-trivial to rewrite but also many thousands of URLs where a simple regex could handle both the protocol change and switching to the canonical item page in the modern unified app instead of continuing to use a long-deprecated legacy view. Internally we've been working to chunk that list of URLs into patterns by application / project so they can be reviewed and tested in a reasonable amount of time.
Chris
wikitech-l@lists.wikimedia.org