Thanks, Isaac and Federico. These notes and links are very helpful--and will require some time to process. As for how many years I have to work on this, I'm retired! In truth, I keep hoping that someone on this list will express interest in working on these matters. The questions are all very interesting and quite relevant. The idea of studying removed citations is both complex and compelling.
Greg
On Mon, Aug 26, 2019 at 6:49 AM Isaac Johnson isaac@wikimedia.org wrote:
Regarding data, I have not been a part of these projects but I think that I can help a bit with working links:
- The (I believe) original dataset can also be found here:
https://analytics.wikimedia.org/datasets/archive/public-datasets/all/mwrefs/
- A newer version of this dataset was produced that also included
information about whether the source was openly available and its topic: ** Meta page: https://meta.wikimedia.org/wiki/Research:Towards_Modeling_Citation_Quality ** Figshare: https://figshare.com/articles/Accessibility_and_topics_of_citations_with_ide...
On Mon, Aug 26, 2019 at 3:53 AM Federico Leva (Nemo) nemowiki@gmail.com wrote:
Greg, 22/08/19 06:19:
I do not know the current status of wikicite or if/when this could be used for this inquiry--either to examine all, or a sensible
subset
of the citations.
If I see correctly, you still did not receive an answer on the data available.
It's true that the Figshare item for < https://meta.wikimedia.org/wiki/Research:Scholarly_article_citations_in_Wiki...
was deleted (I've asked about it on the talk page), but it's trivial to run https://pypi.org/project/mwcites/ and extract the data yourself, at least for citations which use an identifier.
Some example datasets produced this way: https://zenodo.org/record/15871 https://zenodo.org/record/55004 https://zenodo.org/record/54799
Once you extract the list of works, the fun begins. You'll need to intersect with other data sources (Wikidata, ORCID, other?) and account for a number of factors until you manage to find a subset of the data which has a sufficiently high signal:noise ratio. For instance you might need to filter or normalise by
- year of publication (some year recent enough to have good data but old
enough to allow the work to be cited elsewhere, be archived after embargos);
- country or institution (some probably have better ORCID coverage);
- field/discipline and language;
- open access status (per Unpaywall);
- number of expected pageviews and clicks (for instance using
https://wikitech.wikimedia.org/wiki/Analytics/AQS/Pageviews and https://meta.wikimedia.org/wiki/Research:Wikipedia_clickstream#Releases;
a link from 10k articles on asteroids or proteins is not the same as being the lone link from a popular article which is not the same as a link buried among a thousand others on a big article);
- time or duration of the addition (with one of the various diff
extraction libraries, content persistence data or possibly historical eventstream if such a thing is available).
To avoid having to invent everything yourself, maybe you can reuse the method of some similar study, for instance the one on the open access citation advantage or one of the many which studied the gender imbalance of citations and peer review in journals.
However, it's very possible that the noise is just too much for a general computational method. You might consider a more manual approach on a sample of relevant events, for instance the *removal* of citations, which is in my opinion more significant than the addition.* You might extract all the diffs which removed a citation from an article in the last N years (probably they'll be in the order of 10^5 rather than 10^6), remove some massive events or outliers, sample 500-1000 of them randomly and verify the required data manually.
As usual it will be impossible to have an objective assessment of whether that citation was really (in)appropriate in that context according to the (English or whatever) Wikipedia guidelines. To test that too, you should replicate one of the various studies of the gender imbalance of peer review, perhaps one of those which tried to assess the impact of a double blind peer review system on the gender imbalance. However, because the sources are already published, you'd need to provide the agendered information yourself and make sure the participants perform their assessment in some controlled environment where they don't have access to any gendered information (i.e. where you cut them off the internet).
How many years do you have to work on this project? :-)
Federico
(*) I might add a citation just because it's the first result a popular search engine gives me, after glancing at the abstract and maybe the journal home page; but if I remove an existing citation, hopefully I've at least assessed its content and made a judgement about it, apart from cases of mass removals for specific problems with certain articles or publication venues.
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
-- Isaac Johnson -- Research Scientist -- Wikimedia Foundation