Hoi Dario,
You ask if I want to help. <grin> I do and, I have things to give and I have things to ask, so let us do a bit of both for best effect </grin>
On research data. Much of the research data has equivalent information in Wikidata. When you research for gender diversity for instance, articles are identified to be about "human" and "sex, gender". Where Wikidata does NOT have that information, it should be updated as a matter of principle. The reason is that with such an update in Wikidata the information for other languages through the "inter language links" will grow the gender information for other languages as well. This enables the same analysis to some extend for those other languages.
When you need to query Wikidata, WDQ the tool that does query Wikidata for many, many months was updated today and it allows you to query on the "qualifiers" as well [1]. This is why there is an argument to be made to use Wikidata for data analysis and research exclusively.
In the previous research newsletter, the research on Wikidata and interwiki links between English and Portuguese Wikipedia was largely dismissed because "Wikidata had changed the game". Wikidata does not change the game when you compare only between two languages. What I think I observe in Wikidata is that there are fewer people working on inter language links, not more. I also notice that the number of Wikipedia articles without an item in Wikidata is growing. We have had bots run on the Indian Wikipedias to add items and they took surprisingly long to run.
When you consider " gender diversity" for instance as a subject for research, what I observe is that the same research is repeated and repeated again. For me it hardly qualifies as relevant; when using WDQ you can have up to date information whenever you want it. It start to qualify for me when it states that the baseline had a percentage and a number of males/females combined with a moment where the percentage has changed and the number of males/females identified have changed.
When you want to research a specific language, any language at that, all articles need to be represented with an item. It is best Wikidata practice anyway. The way to work is then to first set the base line, get the numbers that are relevant to the research and then do the analysis on the raw data (ie Wikipedia) this results in updates in Wikidata and this allows for the same queries to be run to understand what the numbers mean. Yes, I do understand that you make use of subsets of data to do research. It just happens that WDQ uses its own database that gets updated from Wikidata. It would be totally unreasonable to think that this database cannot be manipulated. Also you can have your own instances of this database and have WDQ run on that (you will be the first one to actually try this but hey this is research). So yes, you can preserve your dataset and yes you can compare it to what happens in the wild (ie outside of the chosen subset as well).
When you research the smaller languages, their needs and their coverage, you have to appreciate that English cannot be the yardstick to measure by. The rest of the world uses meters and, en.wp does not even cover 50% of the subjects that are known to Wikidata. The WMF does know what people search for and do not find. That is to say, the numbers exist but are not available for analysis. When you rank them, you learn what people are looking for. Making Wikidata items out of them is the quickest way to provide initial information for that language and on that subject when "Wikidata search" is enabled on a Wikipedia.
Dario, this is actionable information that we do not have. Research that leads to actionable results is imho the most relevant research.
As to studying things to death, given that en.wp is what research is about, the numbers are only relevant to the extend that en.wp is relevant. My point is very much that its relevance is decreasing in favour of all the other languages. The consequence is that investments that are en.wp centred do not have the effect that is expected elsewhere. Investment in other languages, cultures and countries are likely to have a bigger return on investment. Particularly when the investments, the research is about stimulating growth and growth.
Thanks,
GerardM