Hi Denny, thanks for the questions!
1) The time unit is article revision (namespace 0). This means that in your
example, the article would be available at T2 and T4. Adding the pages also
at T1 or T3 would mean to regenerate all the pages that include the
article, and the resulting dataset would be significantly larger than the
current 7 TB. If there is a specific need to have the complete history at
such a level of granularity, the code could be adapted to store every
possible change.
2) No, we used only the Wikitext available in the static XML dump. The date
match is applied to templates and LUA modules. Regarding the UI message
strings, if you are referring to Mediawiki interface labels, consider that
we included only the content of the article as if you retrieved the page
with the parameter *action=render*
3) Thank you for these pointers. I confirm that WikiPDA can be seen and a
downloadable version of Memento with the bonus to have the templates
matched at the time of revision creation.
On Sat, Sep 12, 2020 at 12:32 AM Denny Vrandečić <vrandecic(a)gmail.com>
wrote:
Three questions:
1) assume a page P with a Template T.
P has been modified at time T2 and T4.
T has been modified at T1 and T3.
Will P be available as of T2 and T4 only, or also as of T3? (at which point
it will be different than at T2 or T4).
2) What about changes to Wikidata, Commons, or UI message strings?
3) Possibly interesting to look into TimeMachine, Memento, and related work
https://www.mediawiki.org/wiki/Extension:TimeMachine
https://www.mediawiki.org/wiki/Extension:Memento
On Fri, Sep 11, 2020 at 2:59 PM Tiziano Piccardi <tiziano.piccardi(a)epfl.ch
wrote:
> Thanks Federico and WSC for the interest!
> I want to specify that we used only
public data released in the XML dump.
> As WSC said, deleted content is not always permanently removed from the
> database, but it is available only to users with privilege access. Our
goal
is not only to release the dataset, but also to
give anyone the
possibility to (1) reproduce the results, and (2) generate the HTML
history
> in other languages without any special access requirements.
> Tiziano
> On Fri, Sep 11, 2020 at 9:47 PM
WereSpielChequers <
> werespielchequers(a)gmail.com> wrote:
> > I wouldn't use the phrase
"Wikipedia’s deliberate policy of permanently
> > deleting the
> > entire history of deleted pages". Quite a few "deleted" pages
do
actually
> get restored, and depending on the deletion
process it can be quite
easy
to
> get much deleted content back. Especially if someone volunteers to
> reference an unreferenced page or a budding footballer actually gets to
> play at professional or international level, or indeed a political
> candidate is elected. Almost all "deleted" content still exists and
could
> be restored by a volunteer admin in the
right circumstances. However
> Wikipedia's deletion processes are more than a little complex, many
> articles have incomplete histories because admins have revision deleted
> particular revisions that include copyright violations and or some
really
> > libellous stuff. Some of the really nasty stuff gets "oversighted" -
> those
> > revisions are not even visible to administrators.
>
> > There is also the issue
that some of the earliest material is not
> > available. stats on admin actions only go back to December 2004, and
> while
> > there is some content from before then, I am not sure if all the stuff
> > deleted before then is available.
>
> > Regards
>
> > WSC
>
> > On Fri, 11 Sep 2020 at
10:22, Federico Leva (Nemo) <nemowiki(a)gmail.com
> > wrote:
>
> > > Robert West, 11/09/20
11:29:
> > > > local instances of MediaWiki,
> > > > enhanced with the capacity of correct historical macro expansion.
> >
> > > Interesting. I
see this doesn't include deleted templates. Have you
> > > considered using historical dumps?
> >
> > > «We emphasize
that the limitation of deleted pages, tem- plates, and
> > > modules is not introduced by our parsing process. Rather, it is
> > > inherited from Wikipedia’s deliberate policy of permanently deleting
> the
> > > entire history of deleted pages.»
> >
> > > A relevant task
is
> > >
https://phabricator.wikimedia.org/T2851
> >
> > > See also the
various discussions about Memento, like
> > >
https://phabricator.wikimedia.org/T164654
> >
> > > Federico
> >
> > >
_______________________________________________
> > > Wiki-research-l mailing list
> > > Wiki-research-l(a)lists.wikimedia.org
> > >
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
> >
> >
_______________________________________________
> > Wiki-research-l mailing list
> > Wiki-research-l(a)lists.wikimedia.org
> >
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>
>
_______________________________________________
> Wiki-research-l mailing list
> Wiki-research-l(a)lists.wikimedia.org
>
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
_______________________________________________
Wiki-research-l mailing list
Wiki-research-l(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l