Halfak added a comment.
@Jsalsman
The libraries I linked to are at a higher level of abstraction than I think you are
imagining. The implementation details (like using binary search and indexes) are up to
you. But since you asked for some code, here's a simple strategy that would generate
the answer on demand. Note that I reworked the example to use a real snippet from an
article so I could run it to make sure it worked too.
from mw import api
from mw.lib import persistence
# Initialize api session and page state
session =
api.Session("https://en.wikipedia.org/w/api.php")
page_state = persistence.State()
# Query for the page's revisions
rev_docs = session.revisions.query(titles={"Antoine Beauvilliers"},
properties={"content", "user",
"timestamp", "sha1"},
direction="newer")
# Use the page_state to process the revisions (and store the revision's timestamps)
last_tokens = None
for rev_doc in rev_docs:
tokens, _, _ = page_state.process(rev_doc.get("*", ""),
rev_doc['timestamp'],
checksum=rev_doc['sha1'])
last_tokens = tokens
# This gnarely bit of code is just used to find the specific tokens we are looking for
expected = "Of humble parentage, Beauvilliers worked his way up from kitchen
boy"
len_expected = len(persistence.tokenization.wikitext_split(expected))
match_ranges = [(i, i+len_expected) for i in range(len(last_tokens))
if "".join(t.text for t in
last_tokens[i:i+len_expected]) == expected]
# Print out the tokens and the first revision they appeared in
for start, end in match_ranges:
for token in last_tokens[start:end]:
if len(token.text.strip()) == 0: continue
print("'{0}' was added {1}".format(token.text,
token.revisions[0]))
The output looks like this:
'Of' was added 2013-05-24T20:07:27Z
'humble' was added 2013-06-01T05:39:29Z
'parentage' was added 2013-05-24T20:07:27Z
',' was added 2013-05-24T20:07:27Z
'Beauvilliers' was added 2013-05-24T20:07:27Z
'worked' was added 2014-08-29T08:26:55Z
'his' was added 2014-08-29T08:26:55Z
'way' was added 2014-08-29T08:26:55Z
'up' was added 2014-08-29T08:26:55Z
'from' was added 2014-08-29T08:26:55Z
'kitchen' was added 2014-08-29T08:26:55Z
'boy' was added 2014-08-29T08:26:55Z
TASK DETAIL
https://phabricator.wikimedia.org/T89763
REPLY HANDLER ACTIONS
Reply to comment or attach files, or !close, !claim, !unsubscribe or !assign
<username>.
EMAIL PREFERENCES
https://phabricator.wikimedia.org/settings/panel/emailpreferences/
To: Halfak
Cc: Halfak, Jsalsman, jayvdb, Aklapper, Evanontario, pywikipedia-bugs