Halfak added a comment.
@Jsalsman
The libraries I linked to are at a higher level of abstraction than I think you are imagining. The implementation details (like using binary search and indexes) are up to you. But since you asked for some code, here's a simple strategy that would generate the answer on demand. Note that I reworked the example to use a real snippet from an article so I could run it to make sure it worked too.
from mw import api from mw.lib import persistence
# Initialize api session and page state session = api.Session("https://en.wikipedia.org/w/api.php") page_state = persistence.State()
# Query for the page's revisions rev_docs = session.revisions.query(titles={"Antoine Beauvilliers"}, properties={"content", "user", "timestamp", "sha1"}, direction="newer")
# Use the page_state to process the revisions (and store the revision's timestamps) last_tokens = None for rev_doc in rev_docs: tokens, _, _ = page_state.process(rev_doc.get("*", ""), rev_doc['timestamp'], checksum=rev_doc['sha1']) last_tokens = tokens
# This gnarely bit of code is just used to find the specific tokens we are looking for expected = "Of humble parentage, Beauvilliers worked his way up from kitchen boy" len_expected = len(persistence.tokenization.wikitext_split(expected)) match_ranges = [(i, i+len_expected) for i in range(len(last_tokens)) if "".join(t.text for t in last_tokens[i:i+len_expected]) == expected]
# Print out the tokens and the first revision they appeared in for start, end in match_ranges: for token in last_tokens[start:end]: if len(token.text.strip()) == 0: continue print("'{0}' was added {1}".format(token.text, token.revisions[0]))
The output looks like this:
'Of' was added 2013-05-24T20:07:27Z 'humble' was added 2013-06-01T05:39:29Z 'parentage' was added 2013-05-24T20:07:27Z ',' was added 2013-05-24T20:07:27Z 'Beauvilliers' was added 2013-05-24T20:07:27Z 'worked' was added 2014-08-29T08:26:55Z 'his' was added 2014-08-29T08:26:55Z 'way' was added 2014-08-29T08:26:55Z 'up' was added 2014-08-29T08:26:55Z 'from' was added 2014-08-29T08:26:55Z 'kitchen' was added 2014-08-29T08:26:55Z 'boy' was added 2014-08-29T08:26:55Z
TASK DETAIL https://phabricator.wikimedia.org/T89763
REPLY HANDLER ACTIONS Reply to comment or attach files, or !close, !claim, !unsubscribe or !assign <username>.
EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/
To: Halfak Cc: Halfak, Jsalsman, jayvdb, Aklapper, Evanontario, pywikipedia-bugs