Thank you! I am sure that this will help the Wikidata team to make the right decision. Also, very interesting numbers.
One stupid question: due to the length of these identifiers, and since they are not simple intransparent identifiers but rather encode semantics - if I understand it correctly - could a single such identifier be encoding content or ideas which are potentially covered by copyright or patent law? Is there some background available on that?
On Fri, Sep 23, 2016 at 3:27 AM Egon Willighagen egon.willighagen@gmail.com wrote:
Sebastian, great you found time for it! I didn't :/ (Stats are worth a tweet, IMHO :)
Egon
On Fri, Sep 23, 2016 at 12:20 PM, Sebastian Burgstaller < sebastian.burgstaller@gmail.com> wrote:
Hi Denny, Sorry, I missed this email. just did the calculation for InChI string lengths on the 92 Mio PubChem compounds: 99% 99.9% 100% 311 676 4502
That said, there is not upper limit for the length, but 4502 is the longest string in the PubChem database. The other IDs, canonical and isomeric SMILES have the same distribution shape, but are overall slightly shorter.
Best, Sebastian
On Sun, Sep 18, 2016 at 9:19 PM, Denny Vrandečić vrandecic@gmail.com wrote:
Can you figure out what a good limit would be for these two use cases?
I.e.
what would support 99%, 99.9%, and 100%?
On Sun, Sep 18, 2016, 12:27 Egon Willighagen <
egon.willighagen@gmail.com>
wrote:
Hi all,
sorry for joining the party late...
On Tue, Sep 13, 2016 at 11:39 AM, Sebastian Burgstaller sebastian.burgstaller@gmail.com wrote:
I think this topic might have been discussed many months ago. For certain data types in the chemical compound space (P233, canonical smiles, P2017 isomeric smiles and P234 Inchi key) a higher character limit than 400 would be really helpful (1500 to 2000 chars (I sense that this might cause problems with SPARQL)). Are there any plans on implementing this? In general, for quality assurance, many string property types would profit from a fixed max string length.
400 characters is not a lot for chemicals... InChIs can be a lot larger indeed. 2k would allow us to capture a lot more chemicals. BTW, this also applies to the canonical SMILES, which also doesn't have an upper bound. Tannic acid (Q427956) is an example (which looking at the InChIKey came up when running the bot :) From working with ChEMBL as RDF I know it has InChIs of length > 1024, which was the max length in Virtuoso... I think it's important for the biology and chemistry to increase the limit.
Egon
-- E.L. Willighagen Department of Bioinformatics - BiGCaT Maastricht University (http://www.bigcat.unimaas.nl/) Homepage: http://egonw.github.com/ LinkedIn: http://se.linkedin.com/in/egonw Blog: http://chem-bla-ics.blogspot.com/ PubList: http://www.citeulike.org/user/egonw/tag/papers ORCID: 0000-0001-7542-0286 ImpactStory: https://impactstory.org/EgonWillighagen
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
-- E.L. Willighagen Department of Bioinformatics - BiGCaT Maastricht University (http://www.bigcat.unimaas.nl/) Homepage: http://egonw.github.com/ LinkedIn: http://se.linkedin.com/in/egonw Blog: http://chem-bla-ics.blogspot.com/ PubList: http://www.citeulike.org/user/egonw/tag/papers ORCID: 0000-0001-7542-0286 ImpactStory: https://impactstory.org/u/egonwillighagen _______________________________________________ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata