Sebastian, great you found time for it! I didn't :/ (Stats are worth a tweet, IMHO :)

Egon

On Fri, Sep 23, 2016 at 12:20 PM, Sebastian Burgstaller <sebastian.burgstaller@gmail.com> wrote:
Hi Denny,
Sorry, I missed this email. just did the calculation for InChI string
lengths on the 92 Mio PubChem compounds:
  99% 99.9%  100%
  311   676  4502

That said, there is not upper limit for the length, but 4502 is the
longest string in the PubChem database. The other IDs, canonical and
isomeric SMILES have the same distribution shape, but are overall
slightly shorter.

Best,
Sebastian

On Sun, Sep 18, 2016 at 9:19 PM, Denny Vrandečić <vrandecic@gmail.com> wrote:
> Can you figure out what a good limit would be for these two use cases? I.e.
> what would support 99%, 99.9%, and 100%?
>
>
> On Sun, Sep 18, 2016, 12:27 Egon Willighagen <egon.willighagen@gmail.com>
> wrote:
>>
>> Hi all,
>>
>> sorry for joining the party late...
>>
>> On Tue, Sep 13, 2016 at 11:39 AM, Sebastian Burgstaller
>> <sebastian.burgstaller@gmail.com> wrote:
>> > I think this topic might have been discussed many months ago. For
>> > certain data types in the chemical compound space (P233, canonical
>> > smiles, P2017 isomeric smiles and P234 Inchi key) a higher character
>> > limit than 400 would be really helpful (1500 to 2000 chars (I sense
>> > that this might cause problems with SPARQL)). Are there any plans on
>> > implementing this? In general, for quality assurance, many string
>> > property types would profit from a fixed max string length.
>>
>> 400 characters is not a lot for chemicals... InChIs can be a lot
>> larger indeed. 2k would allow us to capture a lot more chemicals. BTW,
>> this also applies to the canonical SMILES, which also doesn't have an
>> upper bound. Tannic acid (Q427956) is an example (which looking at the
>> InChIKey came up when running the bot :) From working with ChEMBL as
>> RDF I know it has InChIs of length > 1024, which was the max length in
>> Virtuoso... I think it's important for the biology and chemistry to
>> increase the limit.
>>
>> Egon
>>
>> --
>> E.L. Willighagen
>> Department of Bioinformatics - BiGCaT
>> Maastricht University (http://www.bigcat.unimaas.nl/)
>> Homepage: http://egonw.github.com/
>> LinkedIn: http://se.linkedin.com/in/egonw
>> Blog: http://chem-bla-ics.blogspot.com/
>> PubList: http://www.citeulike.org/user/egonw/tag/papers
>> ORCID: 0000-0001-7542-0286
>> ImpactStory: https://impactstory.org/EgonWillighagen
>>
>> _______________________________________________
>> Wikidata mailing list
>> Wikidata@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/wikidata
>
>
> _______________________________________________
> Wikidata mailing list
> Wikidata@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikidata
>

_______________________________________________
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata



--
E.L. Willighagen
Department of Bioinformatics - BiGCaT
Maastricht University (http://www.bigcat.unimaas.nl/)
Homepage: http://egonw.github.com/
LinkedIn: http://se.linkedin.com/in/egonw
Blog: http://chem-bla-ics.blogspot.com/
PubList: http://www.citeulike.org/user/egonw/tag/papers
ORCID: 0000-0001-7542-0286
ImpactStory: https://impactstory.org/u/egonwillighagen