Handling of usernames in imported edits in MediaWiki has long been weird (T9240[1] was filed in 2006!).
If the local user doesn't exist, we get a strange row in the revision table where rev_user_text refers to a valid name while rev_user is 0 which typically indicates an IP edit. Someone can later create the name, but rev_user remains 0, so depending on which field a tool looks at the revision may or may not be considered to actually belong to the newly-created user.
If the local user does exist when the import is done, the edit is attributed to that user regardless of whether it's actually the same user. See T179246[2] for an example where imported edits got attributed to the wrong account in pre-SUL times.
In Gerrit change 386625[3] I propose to change that.
- If revisions are imported using the "Upload XML data" method, it will be required to fill in a new field to indicate the source of the edits, which is intended to be interpreted as an interwiki prefix. - If revisions are imported using the."Import from another wiki" method, the specified source wiki will be used as the source. - During the import, any usernames that don't exist locally (and can't be auto-created via CentralAuth[4]) will be imported as an otherwise-invalid name, e.g. an edit by User:Example from source 'en' would be imported as "en>Example".[5] - There will be a checkbox on Special:Import to specify whether the same should be done for usernames that do exist locally (or can be created) or whether those edits should be attributed to the existing/autocreated local user. - On history pages, log pages, and the like, these usernames will be displayed as interwiki links, much as might be generated by wikitext like " [[:en:User:Example|en>Example]]". No parenthesized 'tool' links (talk, block, and so on) will be generated for these rows. - On WMF wikis, we'll run a maintenance script to clean up the existing rows with valid usernames and rev_user = 0. The current plan there is to attribute these edits to existing SUL users where possible and to prefix them with a generic prefix otherwise, but we could as easily prefix them all. - Unfortunately it's impossible to retroactively determine the actual source of old imports automatically or to automatically do anything about imports that were misattributed to a different local user in pre-SUL times (e.g. T179246[2]). - The same will be done for CentralAuth's global suppression blocks. In this case, on WMF wikis we can safely point them all at Meta.
If you have comments on this proposal, please reply here or on https://gerrit.wikimedia.org/r/#/c/386625/.
Background: The upcoming actor table changes[6] require some change to the handling of these imported names because we can't have separate attribution to "Example as a non-registered user" and "Example as a registered user" with the new schema. The options we've identified are:
1. This proposal, or something much like it. 2. All the existing rows with rev_user = 0 would have to be attributed to the existing local user (if any), and in the future when a new user is created any existing edits attributed to that name will be automatically attributed to that new account. 3. All the existing rows with rev_user = 0 and an existing local user would have to be re-attributed to different *valid* usernames, probably randomly-generated in some manner, and in the future when a new user is created any existing edits for that name would have to be similarly re-attributed. 4. Like #2, except the creation (including SUL auto-creation) of the same-named account would not be allowed. Thus, an import before the local name exists would forever block that name from being used for an actual local account. 5. Some less consistent combination of the "all the existing rows" and "when a new user is created" options from #2–4.
Of these options, this proposal seems like the best one.
[1]: https://phabricator.wikimedia.org/T9240 [2]: https://phabricator.wikimedia.org/T179246 [3]: https://gerrit.wikimedia.org/r/#/c/386625/ [4]: https://phabricator.wikimedia.org/T111605 [5]: ">" was chosen rather than the more typical ":" because the former is already invalid in all usernames (and page titles). While a colon is *now* disallowed in new usernames, existing names created before that restriction was added can continue to be used (and there are over 12000 such usernames in WMF's SUL) and we decided it'd be better not to suddenly break them. [6]: https://phabricator.wikimedia.org/T167246
Hi Brad,
2017-10-31 16:52 GMT+02:00 Brad Jorsch (Anomie) bjorsch@wikimedia.org:
Handling of usernames in imported edits in MediaWiki has long been weird (T9240[1] was filed in 2006!).
If the local user doesn't exist, we get a strange row in the revision table where rev_user_text refers to a valid name while rev_user is 0 which typically indicates an IP edit. Someone can later create the name, but rev_user remains 0, so depending on which field a tool looks at the revision may or may not be considered to actually belong to the newly-created user.
If the local user does exist when the import is done, the edit is attributed to that user regardless of whether it's actually the same user. See T179246[2] for an example where imported edits got attributed to the wrong account in pre-SUL times.
In Gerrit change 386625[3] I propose to change that.
- If revisions are imported using the "Upload XML data" method, it will
be required to fill in a new field to indicate the source of the edits, which is intended to be interpreted as an interwiki prefix.
What if that is not possible? How are imports between non-related websites handled? I've just recently encountered a situation when a MediaWiki upgrade was considered easier to be done by exporting the old wiki and importing it in the new one.
- If revisions are imported using the."Import from another wiki" method,
the specified source wiki will be used as the source.
- During the import, any usernames that don't exist locally (and can't
be auto-created via CentralAuth[4]) will be imported as an otherwise-invalid name, e.g. an edit by User:Example from source 'en' would be imported as "en>Example".[5]
Why not use "~" like when merging accounts? Sounds like yet another "code" is growing for no obvious reason. If you are worried about conflicts, there shouldn't be any, as the interwiki prefix is different from the shortcut used on SUL.
- There will be a checkbox on Special:Import to specify whether the same
should be done for usernames that do exist locally (or can be created) or whether those edits should be attributed to the existing/autocreated local user.
That sounds good. Ideally we should have a way to match local users to remote users but to generate that might be overkill, especially for large imports.
- On history pages, log pages, and the like, these usernames will be
displayed as interwiki links, much as might be generated by wikitext like " [[:en:User:Example|en>Example]]". No parenthesized 'tool' links (talk, block, and so on) will be generated for these rows.
- On WMF wikis, we'll run a maintenance script to clean up the existing
rows with valid usernames and rev_user = 0. The current plan there is to attribute these edits to existing SUL users where possible and to prefix them with a generic prefix otherwise, but we could as easily prefix them all. - Unfortunately it's impossible to retroactively determine the actual source of old imports automatically or to automatically do anything about imports that were misattributed to a different local user in pre-SUL times (e.g. T179246[2]). - The same will be done for CentralAuth's global suppression blocks. In this case, on WMF wikis we can safely point them all at Meta.
If you have comments on this proposal, please reply here or on https://gerrit.wikimedia.org/r/#/c/386625/.
Background: The upcoming actor table changes[6] require some change to the handling of these imported names because we can't have separate attribution to "Example as a non-registered user" and "Example as a registered user" with the new schema. The options we've identified are:
- This proposal, or something much like it.
- All the existing rows with rev_user = 0 would have to be attributed
to the existing local user (if any), and in the future when a new user is created any existing edits attributed to that name will be automatically attributed to that new account. 3. All the existing rows with rev_user = 0 and an existing local user would have to be re-attributed to different *valid* usernames, probably randomly-generated in some manner, and in the future when a new user is created any existing edits for that name would have to be similarly re-attributed. 4. Like #2, except the creation (including SUL auto-creation) of the same-named account would not be allowed. Thus, an import before the local name exists would forever block that name from being used for an actual local account. 5. Some less consistent combination of the "all the existing rows" and "when a new user is created" options from #2–4.
Of these options, this proposal seems like the best one.
[5]: ">" was chosen rather than the more typical ":" because the former is already invalid in all usernames (and page titles). While a colon is *now* disallowed in new usernames, existing names created before that restriction was added can continue to be used (and there are over 12000 such usernames in WMF's SUL) and we decided it'd be better not to suddenly break them. [6]: https://phabricator.wikimedia.org/T167246
Strainu
-- Brad Jorsch (Anomie) Senior Software Engineer Wikimedia Foundation _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
On Thu, Nov 2, 2017 at 4:46 PM, Strainu strainu10@gmail.com wrote:
2017-10-31 16:52 GMT+02:00 Brad Jorsch (Anomie) bjorsch@wikimedia.org:
- If revisions are imported using the "Upload XML data" method, it
will
be required to fill in a new field to indicate the source of the
edits,
which is intended to be interpreted as an interwiki prefix.
What if that is not possible? How are imports between non-related websites handled?
It's always possible to enter in something, whether an actual interwiki link is defined or not. But why not define one?
I've just recently encountered a situation when a MediaWiki upgrade was considered easier to be done by exporting the old wiki and importing it in the new one.
That seems like a strange situation. But in a case like that, recreate the user table first and no edits should need prefixing.
- If revisions are imported using the."Import from another wiki"
method,
the specified source wiki will be used as the source.
- During the import, any usernames that don't exist locally (and can't
be auto-created via CentralAuth[4]) will be imported as an otherwise-invalid name, e.g. an edit by User:Example from source 'en'
would
be imported as "en>Example".[5]
Why not use "~" like when merging accounts? Sounds like yet another "code" is growing for no obvious reason. If you are worried about conflicts, there shouldn't be any, as the interwiki prefix is different from the shortcut used on SUL.
You mean like the appended "~enwiki" used during SUL finalization? Because legitimate usernames, including those from SUL finalization, can contain '~', thus recognition is much more difficult and we'd have to do a lot more work to handle conflicts when they arise.
The proposal was approved by TechCom, the code has been merged, and it's live now on the Beta Cluster. I'm running the maintenance script now. Please test things there and report any bugs you encounter, either by replying to this message or by filing it in Phabricator and adding me as a subscriber. Assuming no major errors turn up that can't be quickly fixed, I'll probably start running the maintenance script on the production wikis the week of December 11 (and perhaps on mediawiki.org and testwiki the week before).
If you're curious as to what the history of an existing imported page might look like after the maintenance script is run, see https://commons.wikimedia.beta.wmflabs.org/wiki/Template:Documentation?actio... for an example.
On Tue, Oct 31, 2017 at 10:52 AM, Brad Jorsch (Anomie) < bjorsch@wikimedia.org> wrote:
Handling of usernames in imported edits in MediaWiki has long been weird (T9240[1] was filed in 2006!).
If the local user doesn't exist, we get a strange row in the revision table where rev_user_text refers to a valid name while rev_user is 0 which typically indicates an IP edit. Someone can later create the name, but rev_user remains 0, so depending on which field a tool looks at the revision may or may not be considered to actually belong to the newly-created user.
If the local user does exist when the import is done, the edit is attributed to that user regardless of whether it's actually the same user. See T179246[2] for an example where imported edits got attributed to the wrong account in pre-SUL times.
In Gerrit change 386625[3] I propose to change that.
- If revisions are imported using the "Upload XML data" method, it
will be required to fill in a new field to indicate the source of the edits, which is intended to be interpreted as an interwiki prefix.
- If revisions are imported using the."Import from another wiki"
method, the specified source wiki will be used as the source.
- During the import, any usernames that don't exist locally (and can't
be auto-created via CentralAuth[4]) will be imported as an otherwise-invalid name, e.g. an edit by User:Example from source 'en' would be imported as "en>Example".[5]
- There will be a checkbox on Special:Import to specify whether the
same should be done for usernames that do exist locally (or can be created) or whether those edits should be attributed to the existing/autocreated local user.
- On history pages, log pages, and the like, these usernames will be
displayed as interwiki links, much as might be generated by wikitext like " [[:en:User:Example|en>Example]]". No parenthesized 'tool' links (talk, block, and so on) will be generated for these rows.
- On WMF wikis, we'll run a maintenance script to clean up the
existing rows with valid usernames and rev_user = 0. The current plan there is to attribute these edits to existing SUL users where possible and to prefix them with a generic prefix otherwise, but we could as easily prefix them all. - Unfortunately it's impossible to retroactively determine the actual source of old imports automatically or to automatically do anything about imports that were misattributed to a different local user in pre-SUL times (e.g. T179246[2]). - The same will be done for CentralAuth's global suppression blocks. In this case, on WMF wikis we can safely point them all at Meta.
If you have comments on this proposal, please reply here or on https://gerrit.wikimedia.org/r/#/c/386625/.
Background: The upcoming actor table changes[6] require some change to the handling of these imported names because we can't have separate attribution to "Example as a non-registered user" and "Example as a registered user" with the new schema. The options we've identified are:
- This proposal, or something much like it.
- All the existing rows with rev_user = 0 would have to be attributed
to the existing local user (if any), and in the future when a new user is created any existing edits attributed to that name will be automatically attributed to that new account. 3. All the existing rows with rev_user = 0 and an existing local user would have to be re-attributed to different *valid* usernames, probably randomly-generated in some manner, and in the future when a new user is created any existing edits for that name would have to be similarly re-attributed. 4. Like #2, except the creation (including SUL auto-creation) of the same-named account would not be allowed. Thus, an import before the local name exists would forever block that name from being used for an actual local account. 5. Some less consistent combination of the "all the existing rows" and "when a new user is created" options from #2–4.
Of these options, this proposal seems like the best one.
[5]: ">" was chosen rather than the more typical ":" because the former is already invalid in all usernames (and page titles). While a colon is *now* disallowed in new usernames, existing names created before that restriction was added can continue to be used (and there are over 12000 such usernames in WMF's SUL) and we decided it'd be better not to suddenly break them. [6]: https://phabricator.wikimedia.org/T167246
-- Brad Jorsch (Anomie) Senior Software Engineer Wikimedia Foundation
I suggested it on T20209#3535024 back in August, thanks Brad for taking care for it :)
Just to add a sidenote regarding user=0 and user_text with some non IP value - I saw it was quite common in Wikidata recentchanges table few months ago with rc_type=5 (RC_EXTERNAL), though I can't see such anymore.
On Thu, Nov 30, 2017 at 7:31 PM, Brad Jorsch (Anomie) <bjorsch@wikimedia.org
wrote:
The proposal was approved by TechCom, the code has been merged, and it's live now on the Beta Cluster. I'm running the maintenance script now. Please test things there and report any bugs you encounter, either by replying to this message or by filing it in Phabricator and adding me as a subscriber. Assuming no major errors turn up that can't be quickly fixed, I'll probably start running the maintenance script on the production wikis the week of December 11 (and perhaps on mediawiki.org and testwiki the week before).
If you're curious as to what the history of an existing imported page might look like after the maintenance script is run, see https://commons.wikimedia.beta.wmflabs.org/wiki/ Template:Documentation?action=history for an example.
On Tue, Oct 31, 2017 at 10:52 AM, Brad Jorsch (Anomie) < bjorsch@wikimedia.org> wrote:
Handling of usernames in imported edits in MediaWiki has long been weird (T9240[1] was filed in 2006!).
If the local user doesn't exist, we get a strange row in the revision table where rev_user_text refers to a valid name while rev_user is 0
which
typically indicates an IP edit. Someone can later create the name, but rev_user remains 0, so depending on which field a tool looks at the revision may or may not be considered to actually belong to the newly-created user.
If the local user does exist when the import is done, the edit is attributed to that user regardless of whether it's actually the same
user.
See T179246[2] for an example where imported edits got attributed to the wrong account in pre-SUL times.
In Gerrit change 386625[3] I propose to change that.
- If revisions are imported using the "Upload XML data" method, it
will be required to fill in a new field to indicate the source of the edits, which is intended to be interpreted as an interwiki prefix.
- If revisions are imported using the."Import from another wiki"
method, the specified source wiki will be used as the source.
- During the import, any usernames that don't exist locally (and can't
be auto-created via CentralAuth[4]) will be imported as an otherwise-invalid name, e.g. an edit by User:Example from source 'en'
would
be imported as "en>Example".[5]
- There will be a checkbox on Special:Import to specify whether the
same should be done for usernames that do exist locally (or can be
created)
or whether those edits should be attributed to the
existing/autocreated
local user.
- On history pages, log pages, and the like, these usernames will be
displayed as interwiki links, much as might be generated by wikitext
like "
[[:en:User:Example|en>Example]]". No parenthesized 'tool' links
(talk,
block, and so on) will be generated for these rows.
- On WMF wikis, we'll run a maintenance script to clean up the
existing rows with valid usernames and rev_user = 0. The current plan
there
is to attribute these edits to existing SUL users where possible and
to
prefix them with a generic prefix otherwise, but we could as easily
prefix
them all. - Unfortunately it's impossible to retroactively determine the actual source of old imports automatically or to automatically do
anything
about imports that were misattributed to a different local user in
pre-SUL
times (e.g. T179246[2]). - The same will be done for CentralAuth's global suppression
blocks. In this case, on WMF wikis we can safely point them all at
Meta.
If you have comments on this proposal, please reply here or on https://gerrit.wikimedia.org/r/#/c/386625/.
Background: The upcoming actor table changes[6] require some change to
the
handling of these imported names because we can't have separate
attribution
to "Example as a non-registered user" and "Example as a registered user" with the new schema. The options we've identified are:
- This proposal, or something much like it.
- All the existing rows with rev_user = 0 would have to be attributed
to the existing local user (if any), and in the future when a new
user is
created any existing edits attributed to that name will be
automatically
attributed to that new account. 3. All the existing rows with rev_user = 0 and an existing local user would have to be re-attributed to different *valid* usernames, probably randomly-generated in some manner, and in the future when a
new
user is created any existing edits for that name would have to be
similarly
re-attributed. 4. Like #2, except the creation (including SUL auto-creation) of the same-named account would not be allowed. Thus, an import before the
local
name exists would forever block that name from being used for an
actual
local account. 5. Some less consistent combination of the "all the existing rows" and "when a new user is created" options from #2–4.
Of these options, this proposal seems like the best one.
[5]: ">" was chosen rather than the more typical ":" because the former
is
already invalid in all usernames (and page titles). While a colon is
*now*
disallowed in new usernames, existing names created before that
restriction
was added can continue to be used (and there are over 12000 such
usernames
in WMF's SUL) and we decided it'd be better not to suddenly break them. [6]: https://phabricator.wikimedia.org/T167246
-- Brad Jorsch (Anomie) Senior Software Engineer Wikimedia Foundation
-- Brad Jorsch (Anomie) Senior Software Engineer Wikimedia Foundation _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
On Thu, Nov 30, 2017 at 12:31 PM, Brad Jorsch (Anomie) < bjorsch@wikimedia.org> wrote:
The proposal was approved by TechCom, the code has been merged, and it's live now on the Beta Cluster. I'm running the maintenance script now. Please test things there and report any bugs you encounter, either by replying to this message or by filing it in Phabricator and adding me as a subscriber. Assuming no major errors turn up that can't be quickly fixed, I'll probably start running the maintenance script on the production wikis the week of December 11 (and perhaps on mediawiki.org and testwiki the week before).
I've now run the script on mediawiki.org, testwiki, test2wiki, and testwikidatawiki. Please let me know about any related errors.
Assuming no error reports, I'll run the script on the rest of the wikis next week.
wikitech-l@lists.wikimedia.org