One of ORES [1] applications is determining article quality. For example, What would be the best assessment of an article in the given revision. Users in wikiprojects use ORES data to check if articles need re-assessment. e.g. if an article is in "Start" level and now good it's enough to be a "B" article.
As part of Q4 goals, we made a dataset of article quality scores of all articles in English Wikipedia [2] (Here's the link to download the dataset [3]) and we are publishing it in figshare as something you can cite [4] also we are working on publishing monthly data for researchers to track article quality data change over time. [5]
As a pet project of mine, I always wanted to put these data in a database. So we can query the database and get much more useful data. For example quality of articles in category 'History_of_Essex' [6] [7]. The weighed sum is a measure of quality which is a decimal number between 0 (really stub) to 5 (a definitely featured article). We have also prediction column which is a number in this map [8] for example if prediction is 5, it means ORES thinks it should be a featured article.
I leave more use cases to your imagination :)
I'm looking for a more permanent place to put these data, please tell me if it's useful for you. [1] ORES is not a anti-vandalism tool, it's an infrastructure to use AI in Wikipedia. [2] https://phabricator.wikimedia.org/T135684 [3] (117 MBs) https://datasets.wikimedia.org/public-datasets/enwiki/article_quality/wp10-s... [4] https://phabricator.wikimedia.org/T145332 [5] https://phabricator.wikimedia.org/T145655 [6] https://quarry.wmflabs.org/query/12647 [7] https://quarry.wmflabs.org/query/12662 [8] https://github.com/wiki-ai/wikiclass/blob/3ff2f6c44c52905c7202515c5c8b525fb1...
Have fun! Amir
Thanks Amir!
It's been a long since I wanted to include wp10 in our search indices to experiment with this data as a relevance signal.
This is now possible with your dataset and I've built a test index[1] which uses the following signals to rank results:
- incoming links
- weekly pageviews
- wp10
The weights for these signals have not been properly tuned yet but they can be adjusted at query time with uri query param:
- cirrusIncLinksW: weight for a value that ranges from 0 to 1
- cirrusPageViewsW: weight for a value that ranges from 0 to 1
- cirrusWP10W: weight for a value that ranges from 0 to 5
Examples:
- articles in category 'History_of_Essex' sorted by WP10 best first [1]
- articles in category 'History_of_Essex' sorted by WP10 worst first [2]
I'd love to make this data available in a more convenient way with query keywords like wp10:0 and then allow playing other signals like pageviews.
Concerning internal search ranking we will soon evaluate how wp10 compares with existing signals (inclinks/pageviews) and I'd like to use it as a replacement for the naive scoring method we use for autocomplete searches.
Well... everything is at an early stage but I believe we can do very interesting things with wp10 and search, I still don't know exactly what, nor how :)
Thanks!
[1] http://en-wp-bm25-wp10-relforge.wmflabs.org/wiki/Special:Search
[2] http://en-wp-bm25-wp10-relforge.wmflabs.org/w/index.php?search=incategory%3A...
[3] http://en-wp-bm25-wp10-relforge.wmflabs.org/w/index.php?search=incategory%3A...
Le 21/09/2016 à 11:11, Amir Ladsgroup a écrit :
One of ORES [1] applications is determining article quality. For example, What would be the best assessment of an article in the given revision. Users in wikiprojects use ORES data to check if articles need re-assessment. e.g. if an article is in "Start" level and now good it's enough to be a "B" article.
As part of Q4 goals, we made a dataset of article quality scores of all articles in English Wikipedia [2] (Here's the link to download the dataset [3]) and we are publishing it in figshare as something you can cite [4] also we are working on publishing monthly data for researchers to track article quality data change over time. [5]
As a pet project of mine, I always wanted to put these data in a database. So we can query the database and get much more useful data. For example quality of articles in category 'History_of_Essex' [6] [7]. The weighed sum is a measure of quality which is a decimal number between 0 (really stub) to 5 (a definitely featured article). We have also prediction column which is a number in this map [8] for example if prediction is 5, it means ORES thinks it should be a featured article.
I leave more use cases to your imagination :)
I'm looking for a more permanent place to put these data, please tell me if it's useful for you. [1] ORES is not a anti-vandalism tool, it's an infrastructure to use AI in Wikipedia. [2] https://phabricator.wikimedia.org/T135684 [3] (117 MBs) https://datasets.wikimedia.org/public-datasets/enwiki/article_quality/wp10-s... [4] https://phabricator.wikimedia.org/T145332 [5] https://phabricator.wikimedia.org/T145655 [6] https://quarry.wmflabs.org/query/12647 [7] https://quarry.wmflabs.org/query/12662 [8] https://github.com/wiki-ai/wikiclass/blob/3ff2f6c44c52905c7202515c5c8b525fb1...
Have fun! Amir _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l