On Tue, Jul 20, 2010 at 11:56 AM, Jodi Schneider jodi.schneider@deri.orgwrote:
Hi Brian,
On 20 Jul 2010, at 18:02, Brian J Mingus wrote:
On Mon, Jul 19, 2010 at 4:06 PM, Finn Aarup Nielsen fn@imm.dtu.dk wrote:
Hi Brian and others,
I also think that it would be interesting with some bibliographic support, for two-way citation tracking and commenting on articles (for example), but I furthermore find that particular in science article we often find data that is worth structuring and put in a database or a structured wiki, so that we can extract the data for meta-analysis and specialized information retrieval. That is what I also do in the Brede Wiki. I use the templates to store such data. So if such a system as yours is implemented we should not just think of it as a bibliographic database but in more broader terms: A data wiki.
Although the technology required to make a WikiCite happen will be applicable to a more generalized wiki for storing data I think that is too broad for the current proposal. A WMF analogue to Google Base is an entirely new beast that has its own requirements. I certainly think it's an interesting and worthwhile idea, but I don't feel that we are there yet.
As the 'key' (the wiki page title) I use the (lowercase) title of the
article. That might be more reader friendly - but usually longer. I think that KangHsuKrajbichEtAl09 is too camel-cased. Neither the title nor author list + year will be unique, so we need some predictable disambig.
I noticed that AcaWiki is using the title, but I am personally not a fan of it. The motivation for using a key comes from BibTeX. When you cite an entry in a publication in LaTeX, you type \cite{key}. Also, I think most bibliographic formats support such a key. The idea is that there is a universal token that you can type into Google that will lead you to the right item. The predictable disambig is in the format I sent out (which likely needs modification for other kinds of sources). The format is Author1Author2Author3EtAlYYb. Here is a real world example from a pair of very prolific scientists, Deco & Rolls, who published at least three papers together in 2005. In our lab we have really come to love these keys - they are very memorable tokens that you can verbally pass on to other scientists in the midst of a discussion. Eventually, if they enter the key you have given them into Google, they will get the right entry at "WikiCite".
DecoRolls05 - Synaptic and spiking dynamics underlying reward reversal in the orbitofrontal cortex. DecoRolls05b - Sequential memory: a putative neural and synaptic dynamical mechanism. DecoRolls05c - Attention, short-term memory, and action selection: a unifying theory.
Citation keys of this sort work, but they have to be decided on by some external system. Who decides which paper is -, b, and c? Publication order would be one way to do it -- but that's complicated, especially with online first publication, or overlapping conferences.
I think whether they're memorable tokens might vary by person... Sure, the author and year will be identifiable, even memorable. But the a, b, c?
If you want to support more than recent works, I'd urge YYYY instead of YY. Then we only have an issue for pre-0 stuff. :)
Also consider differentiating authors from title and year, perhaps with slashes. author1-author2-author3-etal/YYYY/b I'm not convinced that -'s are better than capital letters (author last names can have both)...
The key seems to be a very important point, so it's important that we get it right. My thinking is guided by several constraints. First, I strongly dislike the numeric keys used at sites such as CiteULike and most database sites (such as 7523225). To the greatest degree possible I believe the key should actually convey what is behind the link. On the other hand, the key should not be too long. Numeric keys maximize the shortness while telling you nothing , whereas titles as keys are very long and don't give you some of the most important information - the authors and the year it was published. The key format I have suggested does seem to have a flaw, being that it easily becomes ambiguous and you must resort to a token that is not easily memorable. Then again, even though many authors and sets of authors will publish multiple items in a year, the vast majority of works have a unique set of authors for a given year.
I like your suggestion that the abc disambiguator be chosen based on the first date of publication, and I also like the prospect of using slashes since they can't be contained in names. Using the full year is a good idea too. We can combine these to come up with a key that, in principle, is guaranteed to be unique. This key would contain:
1) The first three author names separated by slashes 2) If there are more than three authors, an EtAl 3) Some or all of the date. For instance, if there is only one source by this set of authors that year, we can just use YYYY. However, once another source by those set of authors is added, the key should change to MMDDYYYY or similar. If there are multiple publications on the same day, we can resort to abc. Redirects and disambiguation pages can be set up when a key changes.
Since the slashes are somewhat cumbersome, perhaps we can not make them mandatory, but similarly use them only when they are necessary in order to "escape" a name. In the case that one of the authors does not have a slash in their name - the dominant case - we can stick to the easily legible and niecly compact CamelCase format.
Example keys generated by this algorithm:
KangHsuKrajbichEtAl2009 Author1Author2/Author-Three/2009 Author1Author2AuthorThree10032009 Author1Author2AuthorThree12312009
I have one field to each author so that I can automatically link authors.
This is accomplished via Semantic Forms, using the arraymap parser function. You just provide a comma-separated list of authors, and they each get semantic property definitions and deep linking to all papers published by that author.
Sure -- unless authors have the same name, or use different forms of the name.
One of my coauthors goes by John G. Breslin for disambiguration since his name is common -- but on the institute website he's credited as John Breslin, since that's the only name the system recognizes.
In other words, some authority control will be needed. Libraries have a long history with this. Groups of booklovers do it, too. For instance, here's the LibraryThing page for John Smith: http://www.librarything.com/author/smithjohn Notice that you can split and join authors -- LibraryThing's way of giving users the ability to join and separate. Or see http://www.librarything.com/author/carrolllewis Sometimes there are difficult questions -- such as "Is Lewis Carroll the same as Charles Dodgson?" - which depends on what you mean by "same".
For the scope of the potential problem, look at highly published authors -- for instance the "alternative names" list for Dante: http://www.worldcat.org/identities/lccn-n78-95495
LibraryThing is a great example of how to do disambiguation. We can only hope that we can likewise someday have a user community as pedantic and dedicated as theirs ;-) A big part of their success is in providing their users with straightforward tools for doing the disambig work.
I do not include abstracts in my CC-by-sa'ed wiki, since I am not sure how
publishers regard the copyright for abstracts. Neither I am sure about the forward cites. Most commerical publishers hide the cites for unpaid viewing. Including cites in CC-by-sa material on a large-scale may infringe publishers' copyright. Perhaps it is possible to negotiate with some publishers. We need some talk with 'closed access' publishers before we add a such data.
Yes, I have added many nice features to WikiPapers that can unfortunately not make it into the proposed WMF project. Some can, some can't. For example, adding papers to the wiki is via a one click bookmarklet. First, you highlight the title of a paper anywhere on the web, be it a webpage, e-mail, or journal site. Then, you click your "Add to wiki" bookmarklet. On my webserver I am running the citation scraping software from Connotea, CiteULike, and Zotero. I also have a Google Scholar scraper and PubMed importer. You can choose to use one of those sources, or you can choose to merge all of the metadata together. It's automatically added to the wiki for you. Additionally, I have written a bash script that is very adept at getting the pdfs from journals, so it automatically tries to download the pdf and upload it to the wiki for you. I have also implemented the ability to compute the articles that an article cites, and vice versa. With respect to abstracts these scrapers aren't that great. Abstracts usually come from PubMed, whose database you can license, but you cannot change their metadata IIRC.
Ultimately, I think the community will have to take a very careful look at what data can be added to the wiki and design policies accordingly. On Wikipedia I believe copyright enforcement has largely been up to the community, and it takes a long time to converge on appropriate policies. Needless to say, much of the technologies I described in the last paragraph would not be found legal on a public wiki.
I am not sure what 'owner' is in your format. Surely you cant have owners
in Wikimedia/MediaWiki wiki? And 'dateadded' would already be recorded in the revision history.
The 'owner' field is a misnomer, but in lieu of mysql support it lets you know which individuals have that entry in their personal bibliographies. dateadded is needed due to what at least used to be a bug in Semantic MediaWiki.
We probably need to check on the final format of the bibliographic template
to make sure it is easy translatable to the most common bibliographic formats: bibtex, refman, Z3988 microformat, pubmed, etc.
I have written extensive amounts of Python interchange code between wiki template syntax and BibTeX. I chose BibTeX because it is rather standard, our lab uses it, and it is very similar to template syntax. Also, I use Bibutils to convert from BibTeX to most popular formats, and vice versa for mass import of bibliographies: http://www.scripps.edu/~cdputnam/software/bibutils/
BibTeX is good for backwards compatibility, but I'd urge a richer data format -- probably based on bibo RDF: http://bibliontology.com/ It's already widely used: http://bibliontology.com/projects
It was probably a mistake for me to describe WikiPapers as designed around BibTeX. In fact, it's designed around mediawiki templates. From templates as your start, you can support any other format for both import and export.
As I understand there are issue with Semantic MediaWiki with respect to
performance and security that needs to be resolved before a large scale deployment within Wikimedia Foundation projects. I heard that Markus Krötzsch is going to Oxford to work on core SMW, so there might come some changes to SMW in the future. Code audit of SMW lacks.
As I was writing a custom Lucene search engine for WikiPapers I realized that it is a perfect replacement for Semantic MediaWiki. Lucene has fields, it supports boolean operators and you can format its output. All that is needed is to write the Lucene backend (perhaps just modifying MWLucene) and write a parser function that supports using templates for formatting of the output of queries. Lucene is extremely fast and can scale to whatever we can imagine doing. That's my proposed plan.
It not 'necessarily necessary' to make a new Wikimedia project. There has
been a suggestion (in the meta or strategy wiki) just to use a namespace in Wikipedia. You could then have a page called http://en.wikipedia.org/wiki/Bib:The_wick_in_the_candle_of_learning
I believe it is necessary. First, the idea is for any mediawiki anywhere (and any software with appropriate extensions) to be able to cite the same source. Secondly, the project would be multilingual.
I think somebody's mentioned OpenLibrary on this thread. In case not: http://openlibrary.org/ Its scope is limited to books, but their interests are similar.
-Jodi
Cheers,
Brian Mingus Graduate Student Computational Cognitive Neuroscience Lab University of Colorado at Boulder
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Hi guys! I'm glad my little post helped re-start such a productive conversation.
Since some people are replying only to the research-l list and some to both research-l and foundation-l (my fault for cc'ing both) maybe we should centralize this discussion (at least of the nitty gritty metadata issues) on the research list for now? thread here: http://lists.wikimedia.org/pipermail/wiki-research-l/2010-July/thread.html
Of course the perennial issue of how to propose a new WMF project is very much a foundation-l topic.
regards, phoebe
On Tue, Jul 20, 2010 at 12:26 PM, Brian J Mingus Brian.Mingus@colorado.edu wrote:
On Tue, Jul 20, 2010 at 11:56 AM, Jodi Schneider jodi.schneider@deri.org wrote:
Hi Brian, On 20 Jul 2010, at 18:02, Brian J Mingus wrote:
On Mon, Jul 19, 2010 at 4:06 PM, Finn Aarup Nielsen fn@imm.dtu.dk wrote:
Hi Brian and others,
I also think that it would be interesting with some bibliographic support, for two-way citation tracking and commenting on articles (for example), but I furthermore find that particular in science article we often find data that is worth structuring and put in a database or a structured wiki, so that we can extract the data for meta-analysis and specialized information retrieval. That is what I also do in the Brede Wiki. I use the templates to store such data. So if such a system as yours is implemented we should not just think of it as a bibliographic database but in more broader terms: A data wiki.
On Tue, Jul 20, 2010 at 9:26 PM, Brian J Mingus Brian.Mingus@colorado.edu wrote:
I like your suggestion that the abc disambiguator be chosen based on the first date of publication, and I also like the prospect of using slashes since they can't be contained in names. Using the full year is a good idea too. We can combine these to come up with a key that, in principle, is guaranteed to be unique. This key would contain:
- The first three author names separated by slashes
why not separate by pluses? they don't form part of names either, and don't cause problems with wiki page titles.
- If there are more than three authors, an EtAl
don't think that's necessary if we get the abc part right.
- Some or all of the date. For instance, if there is only one source by
this set of authors that year, we can just use YYYY. However, once another source by those set of authors is added, the key should change to MMDDYYYY or similar.
I don't think it is a good idea to change one key as a function of updates on another, except for a generic disambiguation tag.
If there are multiple publications on the same day, we can resort to abc. Redirects and disambiguation pages can be set up when a key changes.
As Jodi pointed out already, the exact date is often not clearly identifiable, so I would go simply for the year. Instead of an alphabetic abc, one could use some function of the article title (e.g. the first three words thereof, or the initials of the first three words), always in lower case.
An even less ambiguous abc would be starting page (for printed stuff) or article number (for online only) but this brings us back to the 7523225 problem you mentioned above.
Since the slashes are somewhat cumbersome, perhaps we can not make them mandatory, but similarly use them only when they are necessary in order to "escape" a name. In the case that one of the authors does not have a slash in their name - the dominant case - we can stick to the easily legible and niecly compact CamelCase format.
Example keys generated by this algorithm:
KangHsuKrajbichEtAl2009
Kang+Hsu+Krajbich+2009+the+wick+in or Kang+Hsu+Krajbich+2009+twi
also note that the CamelCase key does not yield results in a google search, whereas the first plused variant brings up the right work correctly, while the plused one with initialed title tends to bring at least something written by or cited from these authors.
Author1Author2/Author-Three/2009
Author1+Author2+Author-Three+2009+just+another+article or Author1+Author2+Author-Three+2009+jat
Of course, it does not have to be _exactly_ three authors, nor three words from the title, and it does not solve the John Smith (or Zheng Wang) problem.
Daniel
wikimedia-l@lists.wikimedia.org