Bumping this thread - has anyone made progress on this, for example to determine the percentage of enwiki articles that contain one of these standard sections?
(I'm also curious how Danny B - BCCed - generates the lists at https://cs.wiktionary.org/wiki/User:Danny_B./Datamining/Nadpisy that Petr mentioned earlier in this thread.)
On Wed, Jul 22, 2015 at 8:51 AM, Aaron Halfaker ahalfaker@wikimedia.org wrote:
If I were going to do this analysis[1], I'd use mediawiki-utilities to build an xml reader script that would use mwparserfromhell to parse a random sample of articles (1/1000 or so) and extract all headers by level to get a dataset with <page_id>, <header_level>, <heading>, <normal_header>
I'd do some simple normalization to lower case, remove punctuation and reduce all contiguous whitespace to a single space char between "words". Then I'd run an aggregation over that dataset to get your answer.
If anyone wants to pick this up, I'm happy to advise.
- which I might, but I'm unlikely to find time soon
-Aaron
On Mon, Jul 13, 2015 at 4:39 PM, Jonathan Morgan jmorgan@wikimedia.org wrote:
You can get section titles (and hierarchy) directly from the API, though I don't know if this approach scales the way you need it to: https://en.wikipedia.org/w/api.php?action=parse&page=Albania&prop=se...
On Mon, Jul 13, 2015 at 1:52 PM, Amir E. Aharoni amir.aharoni@mail.huji.ac.il wrote:
Yes, that's the idea more or less, but I'm not sure that our search engine is able to search for headings, though I might be wrong. I suspect, however, that it will be required to process dumps article by article (or at least a random sample), and in big projects this could be extremely time consuming.But maybe there's a faster way of which I am not aware?
-- Amir Elisha Aharoni · אָמִיר אֱלִישָׁע אַהֲרוֹנִי http://aharoni.wordpress.com “We're living in pieces, I want to live in peace.” – T. Moore
2015-07-13 23:41 GMT+03:00 Pine W wiki.pine@gmail.com:
Would it be possible to run a search on the full text of Wikipedias for lines that start and end with "==", "===", "====", and lines that start with ";", then make a list of those strings, and count the number of times that each title appears in the list?
Pine
On Jul 13, 2015 10:29 AM, "Jonathan Morgan" jmorgan@wikimedia.org wrote:
Cross-posting this request to wiki-research-l. Anyone have data on frequently used section titles in articles (any language), or know of datasets/publications that examined this?
I'm not aware of any off the top of my head, Amir.
- Jonathan
---------- Forwarded message ---------- From: Amir E. Aharoni amir.aharoni@mail.huji.ac.il Date: Sat, Jul 11, 2015 at 3:29 AM Subject: [Wikitech-l] statistics about frequent section titles To: Wikimedia developers wikitech-l@lists.wikimedia.org
Hi,
Did anybody ever try to collect statistics about frequent section titles in Wikimedia projects?
For Wikipedia, for example, titles such as "Biography", "Early life", "Bibliography", "External links", "References", "History", etc., appear in a lot of articles, and their counterparts appear in a lot of languages.
There are probably similar things in Wikivoyage, Wiktionary and possibly other projects.
Did anybody ever try to collect statistics of the most frequent section titles in each language and project?
-- Amir Elisha Aharoni · אָמִיר אֱלִישָׁע אַהֲרוֹנִי http://aharoni.wordpress.com “We're living in pieces, I want to live in peace.” – T. Moore _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
-- Jonathan T. Morgan Senior Design Researcher Wikimedia Foundation User:Jmorgan (WMF)
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
-- Jonathan T. Morgan Senior Design Researcher Wikimedia Foundation User:Jmorgan (WMF)
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l