Hello,

I am working a on project that used to be called wikimark but now is simply called sensimark because it more general that just wikipedia.

Anyway, the project is as follow:

1. Gather annotated hierarchical dataset of documents that categories, subcategories (and maybe sub categories again) and leaf nodes are documents
2. Train an algorithm on this documents.
3. Guess the most probable labels on a new document.

The algorithm works on paragraph, the goal is be able to have something like the following:

out = wikimark("""Peter Hintjens wrote about the relation between technology and culture.
Without using a scientifical tone of state-of-the-art review of the anthroposcene antropology,
he gives a fair amount of food for thought. According to Hintjens, technology is doomed to
become cheap. As matter of fact, intelligence tools will become more and more accessible which
will trigger a revolution to rebalance forces in society.""")

for category, score in out:
    print('{} ~ {}'.format(category, score))

And the output would be:
Art ~ 0.2
Science ~ 0.8
Society ~ 0.4

That is the goal of the project but were are not there yet.

I read classification in encyclopedia terms is complex matter but I have settled on wikimedia vital articles on run experiments on level 3 which are encouraging.

Here is the output of my program before post processing:

$ curl https://github.com/cultureandempire/cultureandempire.github.io/blob/master/culture.md | ./wikimark.py guess build/
similarity
 +-- Technology ~ 0.09932770275501317
 |   +-- General ~ 0.09932770275501317
 +-- Science ~ 0.09905069171042175
 |   +-- General ~ 0.09905069171042175
 +-- Geography ~ 0.09897996204391411
 |   +-- Continents and regions ~ 0.09914627336640339
 |   +-- General ~ 0.09881365072142484
 +-- Mathematics ~ 0.09897542847422805
 |   +-- Other ~ 0.09911568655298664
 |   +-- Arithmetic ~ 0.09883517039546945
 +-- Society and social sciences ~ 0.09886767613461538
 |   +-- Social issues ~ 0.09886767613461538
 +-- History ~ 0.09886377525104235
     +-- General ~ 0.09893293240770612
     +-- History by subject matter ~ 0.09884456012696491
     +-- Post-classical history ~ 0.09881383321845605

The algorithm selected the 10 most relevant subcategories out of 73.

Now I need to scale this to level 5. But it poorly organized.

Can wikidata help wikipedia vital articles?

ref: https://en.wikipedia.org/wiki/Wikipedia_talk:Vital_articles#Wikidata_integration_to_help
ref: https://github.com/amirouche/sensimark