[Feel free to blame me if you read this more than once]
To whom it may interest,
Full of delight, I would like to announce the first beta release of *StrepHit*:
https://github.com/Wikidata/StrepHit
TL;DR: StrepHit is an intelligent reading agent that understands text and translates it into *referenced* Wikidata statements. It is a IEG project funded by the Wikimedia Foundation.
Key features: -Web spiders to harvest a collection of documents (corpus) from reliable sources -automatic corpus analysis to understand the most meaningful verbs -sentences and semi-structured data extraction -train a machine learning classifier via crowdsourcing -*supervised and rule-based fact extraction from text* -Natural Language Processing utilities -parallel processing
You can find all the details here: https://meta.wikimedia.org/wiki/Grants:IEG/StrepHit:_Wikidata_Statements_Val... https://meta.wikimedia.org/wiki/Grants:IEG/StrepHit:_Wikidata_Statements_Val...
If you like it, star it on GitHub!
Best,
Marco
Hi Marco,
Where might we find some statistics on the current accuracy of the automated claim and reference extractors? I assume that information must be in there somewhere, but I had trouble finding it.
This is a very ambitious project covering a very large technical territory (which I applaud). It would be great if your results could be synthesized a bit more clearly so we can understand where the weak/strong points are and where we might be able to help improve or make use of what you have done in other domains.
-Ben
On Wed, Jun 15, 2016 at 9:06 AM, Marco Fossati fossati@spaziodati.eu wrote:
[Feel free to blame me if you read this more than once]
To whom it may interest,
Full of delight, I would like to announce the first beta release of *StrepHit*:
https://github.com/Wikidata/StrepHit
TL;DR: StrepHit is an intelligent reading agent that understands text and translates it into *referenced* Wikidata statements. It is a IEG project funded by the Wikimedia Foundation.
Key features: -Web spiders to harvest a collection of documents (corpus) from reliable sources -automatic corpus analysis to understand the most meaningful verbs -sentences and semi-structured data extraction -train a machine learning classifier via crowdsourcing -*supervised and rule-based fact extraction from text* -Natural Language Processing utilities -parallel processing
You can find all the details here:
https://meta.wikimedia.org/wiki/Grants:IEG/StrepHit:_Wikidata_Statements_Val...
https://meta.wikimedia.org/wiki/Grants:IEG/StrepHit:_Wikidata_Statements_Val...
If you like it, star it on GitHub!
Best,
Marco
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Hi Ben,
On 6/15/16 18:24, Benjamin Good wrote:
Hi Marco,
Where might we find some statistics on the current accuracy of the automated claim and reference extractors? I assume that information must be in there somewhere, but I had trouble finding it.
The StrepHit pipeline (codebase) is ready, while the project is ongoing. We are not there yet, and will publish performance values in the final report.
This is a very ambitious project covering a very large technical territory (which I applaud). It would be great if your results could be synthesized a bit more clearly so we can understand where the weak/strong points are and where we might be able to help improve or make use of what you have done in other domains.
Sure, this will be done in the final report. Up to now, you can have a look at the midpoint report summary: https://meta.wikimedia.org/wiki/Grants:IEG/StrepHit:_Wikidata_Statements_Val...
Best,
Marco
-Ben
On Wed, Jun 15, 2016 at 9:06 AM, Marco Fossati <fossati@spaziodati.eu mailto:fossati@spaziodati.eu> wrote:
[Feel free to blame me if you read this more than once] To whom it may interest, Full of delight, I would like to announce the first beta release of *StrepHit*: https://github.com/Wikidata/StrepHit TL;DR: StrepHit is an intelligent reading agent that understands text and translates it into *referenced* Wikidata statements. It is a IEG project funded by the Wikimedia Foundation. Key features: -Web spiders to harvest a collection of documents (corpus) from reliable sources -automatic corpus analysis to understand the most meaningful verbs -sentences and semi-structured data extraction -train a machine learning classifier via crowdsourcing -*supervised and rule-based fact extraction from text* -Natural Language Processing utilities -parallel processing You can find all the details here: https://meta.wikimedia.org/wiki/Grants:IEG/StrepHit:_Wikidata_Statements_Validation_via_References https://meta.wikimedia.org/wiki/Grants:IEG/StrepHit:_Wikidata_Statements_Validation_via_References/Midpoint If you like it, star it on GitHub! Best, Marco _______________________________________________ Wikidata mailing list Wikidata@lists.wikimedia.org <mailto:Wikidata@lists.wikimedia.org> https://lists.wikimedia.org/mailman/listinfo/wikidata
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata