On Wed, Jun 10, 2015 at 9:48 AM, Luca Bartek <luca.x.bartek(a)gsk.com> wrote:
I work for Glaxo SmithKline, a pharmaceutical
company, and currently
preparing for a project we are planning to do with the use of Wikipedia.
Specifically the English-language Wikipedia, I presume? Or are you going to
check other languages as well?
We will use our computational tools to assess the
accuracy of the
drug-related information that can be found on Wikipedia. We will use the
open database Open PHACTS as a “gold standard” and compare the information
on Wikipedia to this.
This reminds me of a report
<https://en.wikipedia.org/wiki/Wikipedia:Wikipedia_Signpost/2015-05-27/Recent_research#German_study_finds_Wikipedia.27s_pharma_articles_accurate_and_largely_complete>
on a similar study I heard about recently.
On the drug pages of Wikipedia there is normally a
data table on the
right side with the drug/chemical/pharma related information. Our plan is,
if this is possible to carry out, to assess the accuracy of this
information and if necessary, correct/update it from our database. If the
time constrainsts allow us, I would like to also automatically write some
very basic articles on drugs which currently do not have an entry on
Wikipedia.
Personally, as a volunteer editor on the English Wikipedia, I like the idea
of verifying and correcting our information with reference to reliable
sources!
Do be aware of the editing policies if you perform any edits, though. For
the English Wikipedia, I recommend:
- Talk to the people at WikiProject Pharmacology
<https://en.wikipedia.org/wiki/Wikipedia:WikiProject_Pharmacology>! That
would probably be the best place to connect with the people who are already
editing the articles you're interested in, and they can help you avoid
various pitfalls.
- Don't try to make a "company" account, do the edits as an individual
person. See this part of the username policy
<https://en.wikipedia.org/wiki/Wikipedia:Username_policy#Usernames_implying_shared_use>
in particular for details.
- Review the policy on financial conflicts of interest
<https://en.wikipedia.org/wiki/Wikipedia:Conflict_of_interest#Financial_conflict_of_interest>
and follow it as closely as possible. Expect close scrutiny, as Wikipedia
editors are very wary of corporations trying to use Wikipedia as a vehicle
for advertisement or other PR purposes.
- If you're planning on having a program make these edits in an
automated manner (i.e. without human review of each one to ensure it's
correct and properly formatted), review the bot policy
<https://en.wikipedia.org/wiki/Wikipedia:Bot_policy>.
- Automatic article creation in particular can be problematic as it can
strain the capacity of our editors. Again, the people at WikiProject
Pharmacology
<https://en.wikipedia.org/wiki/Wikipedia:WikiProject_Pharmacology>
should be able to help, work with them to ensure the to-be-created articles
will be properly formatted and useful and to have a plan for reviewing them.
For other-language Wikipedias you'd want to look for similar things, but I
don't know them well enough to give you links or to advise you on how their
policies might differ.
My questions are the following: How do I obtain an
API key? On the api
home page I saw that I may need a special key if I would like to do so many
queries.
There aren't any special API keys for the usual API accessed via api.php.
Please reply with a link to the page that recommended one?
The closest thing I can think of is that certain limits such as the number
of pages that can be retrieved in a single request can be raised by using a
logged-in account that has been granted the "bot flag". But this doesn't
give any additional access, it just allows for fetching information using
fewer requests.
What are the limitations?
In general, follow the advice in
https://www.mediawiki.org/wiki/API:Etiquette and you should be good.
Is it possible to carry out the process I have
described or should I find
a different approach?
Besides fetching the page content via the API, your other option is to
download a database dump <https://dumps.wikimedia.org/> to process articles
offline.
Do note that the API accessed via api.php is concerned with fetching the
content of the page itself; you'll likely have to write your own code to
extract the data you need from the wikitext, or adapt client code from
elsewhere (such as this python library
<https://github.com/earwig/mwparserfromhell>, for example) to do so.
--
Brad Jorsch (Anomie)
Software Engineer
Wikimedia Foundation