Hello,
I am working a on project that used to be called wikimark but now is simply
called sensimark because it more general that just wikipedia.
Anyway, the project is as follow:
1. Gather annotated hierarchical dataset of documents that categories,
subcategories (and maybe sub categories again) and leaf nodes are documents
2. Train an algorithm on this documents.
3. Guess the most probable labels on a new document.
The algorithm works on paragraph, the goal is be able to have something
like the following:
out = wikimark("""Peter Hintjens wrote about the relation between
technology and culture.Without using a scientifical tone of
state-of-the-art review of the anthroposcene antropology,he gives a
fair amount of food for thought. According to Hintjens, technology is
doomed tobecome cheap. As matter of fact, intelligence tools will
become more and more accessible whichwill trigger a revolution to
rebalance forces in society.""")
for category, score in out:
print('{} ~ {}'.format(category, score))
And the output would be:
Art ~ 0.2
Science ~ 0.8
Society ~ 0.4
That is the goal of the project but were are not there yet.
I read classification in encyclopedia terms is complex matter but I have
settled on wikimedia vital articles on run experiments on level 3 which are
encouraging.
Here is the output of my program before post processing:
$ curl
https://github.com/cultureandempire/cultureandempire.github.io/blob/master/…
| ./wikimark.py guess build/
similarity
+-- Technology ~ 0.09932770275501317
| +-- General ~ 0.09932770275501317
+-- Science ~ 0.09905069171042175
| +-- General ~ 0.09905069171042175
+-- Geography ~ 0.09897996204391411
| +-- Continents and regions ~ 0.09914627336640339
| +-- General ~ 0.09881365072142484
+-- Mathematics ~ 0.09897542847422805
| +-- Other ~ 0.09911568655298664
| +-- Arithmetic ~ 0.09883517039546945
+-- Society and social sciences ~ 0.09886767613461538
| +-- Social issues ~ 0.09886767613461538
+-- History ~ 0.09886377525104235
+-- General ~ 0.09893293240770612
+-- History by subject matter ~ 0.09884456012696491
+-- Post-classical history ~ 0.09881383321845605
The algorithm selected the 10 most relevant subcategories out of 73.
Now I need to scale this to level 5
<https://en.wikipedia.org/wiki/Wikipedia:Vital_articles/Level/5>. But
it poorly organized.
Can wikidata help wikipedia vital articles?
ref:
https://en.wikipedia.org/wiki/Wikipedia_talk:Vital_articles#Wikidata_integr…
ref:
https://github.com/amirouche/sensimark