Hi all,

I’m happy to announce the outcome of an Outreachy internship that I’m finishing up. It is a new tool and public dataset named Citation Detective which tool developers and researchers can now use for their projects.

Citation Detective contains sentences that have been identified as needing a citation using a machine learning-based classifier published earlier last year by WMF researchers and collaborators. As part of Outreachy, I developed a tool (hosted on Toolforge) to run through Wikipedia and extract high-scoring sentences along with contextual information.

As an example use case for this data, I also created a proof of concept for integrating Citation Detective and Citation Hunt. Check out my prototype Citation Hunt, which uses Citation Detective to import sentences that would not normally be featured in Citation Hunt. The repository for that is here.

This dataset currently includes sentences from ~120,000 randomly selected articles from the English Wikipedia. In future work, we hope to expand this to more language Wikipedia projects and a greater number of articles. It is also possible to expand the database to contain more fields in a future version according to feedback from tool developers and researchers. More use cases for this type of data were identified in a design research project conducted last year by Jonathan Morgan.

You can find more information in our Wiki Workshop submission and in my blog which documented the whole journey.

Thank you very much!

Kind regard,
Aiko