Hi all,
I’m happy to announce the outcome of an Outreachy internship https://phabricator.wikimedia.org/T233707 that I’m finishing up. It is a new tool and public dataset named Citation Detective which tool developers and researchers can now use for their projects.
Citation Detective https://meta.wikimedia.org/wiki/Citation_Detective contains sentences that have been identified as needing a citation using a machine learning-based classifier published earlier last year https://arxiv.org/pdf/1902.11116.pdf by WMF researchers and collaborators. As part of Outreachy, I developed a tool https://github.com/AikoChou/citationdetective (hosted on Toolforge https://tools.wmflabs.org) to run through Wikipedia and extract high-scoring sentences along with contextual information.
As an example use case for this data, I also created a proof of concept for integrating Citation Detective and Citation Hunt https://tools.wmflabs.org/citationhunt. Check out my prototype Citation Hunt https://tools.wmflabs.org/aiko-citationhunt, which uses Citation Detective to import sentences that would not normally be featured in Citation Hunt. The repository for that is here https://github.com/AikoChou/citationhunt.
This dataset currently includes sentences from ~120,000 randomly selected articles from the English Wikipedia. In future work, we hope to expand this to more language Wikipedia projects and a greater number of articles. It is also possible to expand the database to contain more fields in a future version according to feedback from tool developers and researchers. More use cases for this type of data were identified in a design research project https://meta.wikimedia.org/wiki/Research:Identification_of_Unsourced_Statements/API_design_research conducted last year by Jonathan Morgan.
You can find more information in our Wiki Workshop submission https://commons.wikimedia.org/wiki/File:Citation_Detective_WikiWorkshop2020.pdf and in my blog https://rollingmist.home.blog/ which documented the whole journey.
Thank you very much!
Kind regard, Aiko
This looks really cool and valuable. Thanks for your work on it, Aiko!
I took your prototype for a test run, but was a little surprised by the first three tasks it gave me: https://tools.wmflabs.org/aiko-citationhunt/en?id=90bb8e4a https://tools.wmflabs.org/aiko-citationhunt/en?id=d0f3447f https://tools.wmflabs.org/aiko-citationhunt/en?id=d49f1b38
For all three of them, the text appears to already be referenced. (For the second one, the second sentence doesn't have the reference immediately following it, so I can see why that would be a problem, but the tool highlighted the first sentence as well.) Is this a bug, or is the tool telling me that the sources used are unreliable, or am I just misunderstanding something?
Emufarmers
On Sat, Mar 7, 2020 at 9:03 AM Ai-Jou Chou qwanqwanro@gmail.com wrote:
Hi all,
I’m happy to announce the outcome of an Outreachy internship https://phabricator.wikimedia.org/T233707 that I’m finishing up. It is a new tool and public dataset named Citation Detective which tool developers and researchers can now use for their projects.
Citation Detective https://meta.wikimedia.org/wiki/Citation_Detective contains sentences that have been identified as needing a citation using a machine learning-based classifier published earlier last year https://arxiv.org/pdf/1902.11116.pdf by WMF researchers and collaborators. As part of Outreachy, I developed a tool https://github.com/AikoChou/citationdetective (hosted on Toolforge https://tools.wmflabs.org) to run through Wikipedia and extract high-scoring sentences along with contextual information.
As an example use case for this data, I also created a proof of concept for integrating Citation Detective and Citation Hunt https://tools.wmflabs.org/citationhunt. Check out my prototype Citation Hunt https://tools.wmflabs.org/aiko-citationhunt, which uses Citation Detective to import sentences that would not normally be featured in Citation Hunt. The repository for that is here https://github.com/AikoChou/citationhunt.
This dataset currently includes sentences from ~120,000 randomly selected articles from the English Wikipedia. In future work, we hope to expand this to more language Wikipedia projects and a greater number of articles. It is also possible to expand the database to contain more fields in a future version according to feedback from tool developers and researchers. More use cases for this type of data were identified in a design research project < https://meta.wikimedia.org/wiki/Research:Identification_of_Unsourced_Stateme...
conducted last year by Jonathan Morgan.
You can find more information in our Wiki Workshop submission < https://commons.wikimedia.org/wiki/File:Citation_Detective_WikiWorkshop2020....
and in my blog https://rollingmist.home.blog/ which documented the whole journey.
Thank you very much!
Kind regard, Aiko _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
[Invalid UTF-8]
[Sorry for the previous message. Not sure what my email client did.]
What I originally wanted to write:
Thanks for that interesting tool! I tried and got https://tools.wmflabs.org/citationhunt/en?id=3712ed79 Then I found two books from 2014 and 2018 which include entire sections of that English Wikipedia article without citation. So this might be a future case of "no reference in Wikipedia -> Wikipedia content is copied into a book or article -> someone adds that book or article as a reference to Wikipedia -> it must be true!"
Worrisome, but of course not the fault of this useful tool. :)
andre
wikitech-l@lists.wikimedia.org