Hi all,
I think this topic might have been discussed many months ago. For certain data types in the chemical compound space (P233, canonical smiles, P2017 isomeric smiles and P234 Inchi key) a higher character limit than 400 would be really helpful (1500 to 2000 chars (I sense that this might cause problems with SPARQL)). Are there any plans on implementing this? In general, for quality assurance, many string property types would profit from a fixed max string length.
Best, Sebastian
Sebastian Burgstaller-Muehlbacher, PhD Research Associate Andrew Su Lab MEM-216, Department of Molecular and Experimental Medicine The Scripps Research Institute 10550 North Torrey Pines Road La Jolla, CA 92037 @sebotic
On 13.09.2016 11:39, Sebastian Burgstaller wrote:
Hi all,
I think this topic might have been discussed many months ago. For certain data types in the chemical compound space (P233, canonical smiles, P2017 isomeric smiles and P234 Inchi key) a higher character limit than 400 would be really helpful (1500 to 2000 chars (I sense that this might cause problems with SPARQL)). Are there any plans on implementing this? In general, for quality assurance, many string property types would profit from a fixed max string length.
FWIW, I recall that the main reason for the char limit originally was to discourage the use of Wikidata for textual content. Simply put, we did not want Wikipedia articles in the data. Long texts could also make copyright/license issues more relevant (though, in theory, a copyrighted poem could be rather short).
However, given that we now have such a well informed community with established practices and good quality checks, it seems unproblematic to lift the character limit. I don't think there are major technical reasons for having it. Surely, BlazeGraph (the WMF SPARQL engine) should not expect texts to be short, and I would be surprised if they did. So I would not expect problems on this side.
Best, Markus
Best, Sebastian
Sebastian Burgstaller-Muehlbacher, PhD Research Associate Andrew Su Lab MEM-216, Department of Molecular and Experimental Medicine The Scripps Research Institute 10550 North Torrey Pines Road La Jolla, CA 92037 @sebotic
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Markus' description of the decision for the limit corresponds with mine. I also think that this decision can be revisited. I would still advice for caution, due to technical issues, but I am sure that the development team will make a well-informed decision on this. It would be sad if valid usecases could not be supported due to that.
On Fri, Sep 16, 2016 at 6:51 AM Markus Kroetzsch < markus.kroetzsch@tu-dresden.de> wrote:
On 13.09.2016 11:39, Sebastian Burgstaller wrote:
Hi all,
I think this topic might have been discussed many months ago. For certain data types in the chemical compound space (P233, canonical smiles, P2017 isomeric smiles and P234 Inchi key) a higher character limit than 400 would be really helpful (1500 to 2000 chars (I sense that this might cause problems with SPARQL)). Are there any plans on implementing this? In general, for quality assurance, many string property types would profit from a fixed max string length.
FWIW, I recall that the main reason for the char limit originally was to discourage the use of Wikidata for textual content. Simply put, we did not want Wikipedia articles in the data. Long texts could also make copyright/license issues more relevant (though, in theory, a copyrighted poem could be rather short).
However, given that we now have such a well informed community with established practices and good quality checks, it seems unproblematic to lift the character limit. I don't think there are major technical reasons for having it. Surely, BlazeGraph (the WMF SPARQL engine) should not expect texts to be short, and I would be surprised if they did. So I would not expect problems on this side.
Best, Markus
Best, Sebastian
Sebastian Burgstaller-Muehlbacher, PhD Research Associate Andrew Su Lab MEM-216, Department of Molecular and Experimental Medicine The Scripps Research Institute 10550 North Torrey Pines Road La Jolla, CA 92037 @sebotic
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
(in particular because I expect that character limit to have to change for Wiktionary in Wikidata)
On Fri, Sep 16, 2016 at 10:38 AM Denny Vrandečić vrandecic@gmail.com wrote:
Markus' description of the decision for the limit corresponds with mine. I also think that this decision can be revisited. I would still advice for caution, due to technical issues, but I am sure that the development team will make a well-informed decision on this. It would be sad if valid usecases could not be supported due to that.
On Fri, Sep 16, 2016 at 6:51 AM Markus Kroetzsch < markus.kroetzsch@tu-dresden.de> wrote:
On 13.09.2016 11:39, Sebastian Burgstaller wrote:
Hi all,
I think this topic might have been discussed many months ago. For certain data types in the chemical compound space (P233, canonical smiles, P2017 isomeric smiles and P234 Inchi key) a higher character limit than 400 would be really helpful (1500 to 2000 chars (I sense that this might cause problems with SPARQL)). Are there any plans on implementing this? In general, for quality assurance, many string property types would profit from a fixed max string length.
FWIW, I recall that the main reason for the char limit originally was to discourage the use of Wikidata for textual content. Simply put, we did not want Wikipedia articles in the data. Long texts could also make copyright/license issues more relevant (though, in theory, a copyrighted poem could be rather short).
However, given that we now have such a well informed community with established practices and good quality checks, it seems unproblematic to lift the character limit. I don't think there are major technical reasons for having it. Surely, BlazeGraph (the WMF SPARQL engine) should not expect texts to be short, and I would be surprised if they did. So I would not expect problems on this side.
Best, Markus
Best, Sebastian
Sebastian Burgstaller-Muehlbacher, PhD Research Associate Andrew Su Lab MEM-216, Department of Molecular and Experimental Medicine The Scripps Research Institute 10550 North Torrey Pines Road La Jolla, CA 92037 @sebotic
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Am 16.09.2016 um 19:38 schrieb Denny Vrandečić:
Markus' description of the decision for the limit corresponds with mine. I also think that this decision can be revisited. I would still advice for caution, due to technical issues, but I am sure that the development team will make a well-informed decision on this. It would be sad if valid usecases could not be supported due to that.
I agree, but re-considering this will have to wait until we have a better solution for storing terms. The current mechanism, the wb_terms table, is a massive performance bottleneck, and stuffing more data in there makes me very uncomfortable.
Hi!
However, given that we now have such a well informed community with established practices and good quality checks, it seems unproblematic to lift the character limit. I don't think there are major technical reasons for having it. Surely, BlazeGraph (the WMF SPARQL engine) should not expect texts to be short, and I would be surprised if they did. So I would not expect problems on this side.
I don't think there should be much trouble in this department. Unless one is literally trying to download megabytes of data or millions of items from a query (which we are working on solution for, but not yet) the size of the string doesn't matter much and there would probably be no noticeable difference between 400 and 2K strings for most queries I can think of. Searching within such strings won't probably work very well but that's not the intent anyway, as I understand.
The only thing I can think of is that we now both store the whole item as huge blob in the DB (and consequently load it in memory) so if we had a lot of huge strings it may have negative performance impact. But I don't think changing a property that is usually one per item from 400 bytes to 2K would change anything.
One other usecase for this would be citation URLs. For example, to get the number of inhabitants of all Dutch municipalities you need a 800-character (1) permalink from the central bureau of statistics.
So this change would be very welcome indeed!
-- Hay
(1): http://statline.cbs.nl/Statweb/publication/?VW=T&DM=SLNL&PA=37230NED...
On Tue, Sep 13, 2016 at 11:39 AM, Sebastian Burgstaller sebastian.burgstaller@gmail.com wrote:
Hi all,
I think this topic might have been discussed many months ago. For certain data types in the chemical compound space (P233, canonical smiles, P2017 isomeric smiles and P234 Inchi key) a higher character limit than 400 would be really helpful (1500 to 2000 chars (I sense that this might cause problems with SPARQL)). Are there any plans on implementing this? In general, for quality assurance, many string property types would profit from a fixed max string length.
Best, Sebastian
Sebastian Burgstaller-Muehlbacher, PhD Research Associate Andrew Su Lab MEM-216, Department of Molecular and Experimental Medicine The Scripps Research Institute 10550 North Torrey Pines Road La Jolla, CA 92037 @sebotic
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Hi all,
sorry for joining the party late...
On Tue, Sep 13, 2016 at 11:39 AM, Sebastian Burgstaller sebastian.burgstaller@gmail.com wrote:
I think this topic might have been discussed many months ago. For certain data types in the chemical compound space (P233, canonical smiles, P2017 isomeric smiles and P234 Inchi key) a higher character limit than 400 would be really helpful (1500 to 2000 chars (I sense that this might cause problems with SPARQL)). Are there any plans on implementing this? In general, for quality assurance, many string property types would profit from a fixed max string length.
400 characters is not a lot for chemicals... InChIs can be a lot larger indeed. 2k would allow us to capture a lot more chemicals. BTW, this also applies to the canonical SMILES, which also doesn't have an upper bound. Tannic acid (Q427956) is an example (which looking at the InChIKey came up when running the bot :) From working with ChEMBL as RDF I know it has InChIs of length > 1024, which was the max length in Virtuoso... I think it's important for the biology and chemistry to increase the limit.
Egon
Can you figure out what a good limit would be for these two use cases? I.e. what would support 99%, 99.9%, and 100%?
On Sun, Sep 18, 2016, 12:27 Egon Willighagen egon.willighagen@gmail.com wrote:
Hi all,
sorry for joining the party late...
On Tue, Sep 13, 2016 at 11:39 AM, Sebastian Burgstaller sebastian.burgstaller@gmail.com wrote:
I think this topic might have been discussed many months ago. For certain data types in the chemical compound space (P233, canonical smiles, P2017 isomeric smiles and P234 Inchi key) a higher character limit than 400 would be really helpful (1500 to 2000 chars (I sense that this might cause problems with SPARQL)). Are there any plans on implementing this? In general, for quality assurance, many string property types would profit from a fixed max string length.
400 characters is not a lot for chemicals... InChIs can be a lot larger indeed. 2k would allow us to capture a lot more chemicals. BTW, this also applies to the canonical SMILES, which also doesn't have an upper bound. Tannic acid (Q427956) is an example (which looking at the InChIKey came up when running the bot :) From working with ChEMBL as RDF I know it has InChIs of length > 1024, which was the max length in Virtuoso... I think it's important for the biology and chemistry to increase the limit.
Egon
-- E.L. Willighagen Department of Bioinformatics - BiGCaT Maastricht University (http://www.bigcat.unimaas.nl/) Homepage: http://egonw.github.com/ LinkedIn: http://se.linkedin.com/in/egonw Blog: http://chem-bla-ics.blogspot.com/ PubList: http://www.citeulike.org/user/egonw/tag/papers ORCID: 0000-0001-7542-0286 ImpactStory: https://impactstory.org/EgonWillighagen
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
On Mon, Sep 19, 2016 at 6:19 AM, Denny Vrandečić vrandecic@gmail.com wrote:
Can you figure out what a good limit would be for these two use cases? I.e. what would support 99%, 99.9%, and 100%?
Yes this would be extremely helpful. In general I agree that we can now be more relaxed about this than we were at the beginning because you all understand that Wikidata isn't a place to store long free text. However I still think we need to have some measures in place. One thing we could maybe do is a new datatype for longer text but I'm undecided about this yet. I still don't feel too good about making every string property several thousand characters long.
Cheers Lydia
Thanks, guys! I am glad to hear that the technical hurdles for implementation seem to be relatively low. Is there any realistic timeline by when this could be done?
I agree with Lydia, that not all string properties should allow for unlimited (or even very many) chars. It would be nice to determine at property proposal how many chars a certain property should have. Alternatively, implementing a new data type would also work for us.
Best, Sebastian
On Mon, Sep 19, 2016 at 9:12 AM, Lydia Pintscher lydia.pintscher@wikimedia.de wrote:
On Mon, Sep 19, 2016 at 6:19 AM, Denny Vrandečić vrandecic@gmail.com wrote:
Can you figure out what a good limit would be for these two use cases? I.e. what would support 99%, 99.9%, and 100%?
Yes this would be extremely helpful. In general I agree that we can now be more relaxed about this than we were at the beginning because you all understand that Wikidata isn't a place to store long free text. However I still think we need to have some measures in place. One thing we could maybe do is a new datatype for longer text but I'm undecided about this yet. I still don't feel too good about making every string property several thousand characters long.
Cheers Lydia
-- Lydia Pintscher - http://about.me/lydia.pintscher Product Manager for Wikidata
Wikimedia Deutschland e.V. Tempelhofer Ufer 23-24 10963 Berlin www.wikimedia.de
Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e. V.
Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg unter der Nummer 23855 Nz. Als gemeinnützig anerkannt durch das Finanzamt für Körperschaften I Berlin, Steuernummer 27/029/42207.
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
On 19.09.2016 18:12, Lydia Pintscher wrote:
On Mon, Sep 19, 2016 at 6:19 AM, Denny Vrandečić vrandecic@gmail.com wrote:
Can you figure out what a good limit would be for these two use cases? I.e. what would support 99%, 99.9%, and 100%?
Yes this would be extremely helpful. In general I agree that we can now be more relaxed about this than we were at the beginning because you all understand that Wikidata isn't a place to store long free text. However I still think we need to have some measures in place. One thing we could maybe do is a new datatype for longer text but I'm undecided about this yet. I still don't feel too good about making every string property several thousand characters long.
I am not excited about having another new datatype for this. The proposed difference of 400 vs. 2000 chars does not seem so fundamental, and the limits are rather arbitrary too, so it seems too much detail on the user level to name these things in special ways. Datatypes should be used if they have a benefit to the user (easier input, better display) and not to enforce constraints. There are very many relevant constraints, and length is hardly the most important one, so we should not give it the prominence of having an own type.
Best,
Markus
Hi Denny, Sorry, I missed this email. just did the calculation for InChI string lengths on the 92 Mio PubChem compounds: 99% 99.9% 100% 311 676 4502
That said, there is not upper limit for the length, but 4502 is the longest string in the PubChem database. The other IDs, canonical and isomeric SMILES have the same distribution shape, but are overall slightly shorter.
Best, Sebastian
On Sun, Sep 18, 2016 at 9:19 PM, Denny Vrandečić vrandecic@gmail.com wrote:
Can you figure out what a good limit would be for these two use cases? I.e. what would support 99%, 99.9%, and 100%?
On Sun, Sep 18, 2016, 12:27 Egon Willighagen egon.willighagen@gmail.com wrote:
Hi all,
sorry for joining the party late...
On Tue, Sep 13, 2016 at 11:39 AM, Sebastian Burgstaller sebastian.burgstaller@gmail.com wrote:
I think this topic might have been discussed many months ago. For certain data types in the chemical compound space (P233, canonical smiles, P2017 isomeric smiles and P234 Inchi key) a higher character limit than 400 would be really helpful (1500 to 2000 chars (I sense that this might cause problems with SPARQL)). Are there any plans on implementing this? In general, for quality assurance, many string property types would profit from a fixed max string length.
400 characters is not a lot for chemicals... InChIs can be a lot larger indeed. 2k would allow us to capture a lot more chemicals. BTW, this also applies to the canonical SMILES, which also doesn't have an upper bound. Tannic acid (Q427956) is an example (which looking at the InChIKey came up when running the bot :) From working with ChEMBL as RDF I know it has InChIs of length > 1024, which was the max length in Virtuoso... I think it's important for the biology and chemistry to increase the limit.
Egon
-- E.L. Willighagen Department of Bioinformatics - BiGCaT Maastricht University (http://www.bigcat.unimaas.nl/) Homepage: http://egonw.github.com/ LinkedIn: http://se.linkedin.com/in/egonw Blog: http://chem-bla-ics.blogspot.com/ PubList: http://www.citeulike.org/user/egonw/tag/papers ORCID: 0000-0001-7542-0286 ImpactStory: https://impactstory.org/EgonWillighagen
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Sebastian, great you found time for it! I didn't :/ (Stats are worth a tweet, IMHO :)
Egon
On Fri, Sep 23, 2016 at 12:20 PM, Sebastian Burgstaller < sebastian.burgstaller@gmail.com> wrote:
Hi Denny, Sorry, I missed this email. just did the calculation for InChI string lengths on the 92 Mio PubChem compounds: 99% 99.9% 100% 311 676 4502
That said, there is not upper limit for the length, but 4502 is the longest string in the PubChem database. The other IDs, canonical and isomeric SMILES have the same distribution shape, but are overall slightly shorter.
Best, Sebastian
On Sun, Sep 18, 2016 at 9:19 PM, Denny Vrandečić vrandecic@gmail.com wrote:
Can you figure out what a good limit would be for these two use cases?
I.e.
what would support 99%, 99.9%, and 100%?
On Sun, Sep 18, 2016, 12:27 Egon Willighagen <egon.willighagen@gmail.com
wrote:
Hi all,
sorry for joining the party late...
On Tue, Sep 13, 2016 at 11:39 AM, Sebastian Burgstaller sebastian.burgstaller@gmail.com wrote:
I think this topic might have been discussed many months ago. For certain data types in the chemical compound space (P233, canonical smiles, P2017 isomeric smiles and P234 Inchi key) a higher character limit than 400 would be really helpful (1500 to 2000 chars (I sense that this might cause problems with SPARQL)). Are there any plans on implementing this? In general, for quality assurance, many string property types would profit from a fixed max string length.
400 characters is not a lot for chemicals... InChIs can be a lot larger indeed. 2k would allow us to capture a lot more chemicals. BTW, this also applies to the canonical SMILES, which also doesn't have an upper bound. Tannic acid (Q427956) is an example (which looking at the InChIKey came up when running the bot :) From working with ChEMBL as RDF I know it has InChIs of length > 1024, which was the max length in Virtuoso... I think it's important for the biology and chemistry to increase the limit.
Egon
-- E.L. Willighagen Department of Bioinformatics - BiGCaT Maastricht University (http://www.bigcat.unimaas.nl/) Homepage: http://egonw.github.com/ LinkedIn: http://se.linkedin.com/in/egonw Blog: http://chem-bla-ics.blogspot.com/ PubList: http://www.citeulike.org/user/egonw/tag/papers ORCID: 0000-0001-7542-0286 ImpactStory: https://impactstory.org/EgonWillighagen
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Thank you! I am sure that this will help the Wikidata team to make the right decision. Also, very interesting numbers.
One stupid question: due to the length of these identifiers, and since they are not simple intransparent identifiers but rather encode semantics - if I understand it correctly - could a single such identifier be encoding content or ideas which are potentially covered by copyright or patent law? Is there some background available on that?
On Fri, Sep 23, 2016 at 3:27 AM Egon Willighagen egon.willighagen@gmail.com wrote:
Sebastian, great you found time for it! I didn't :/ (Stats are worth a tweet, IMHO :)
Egon
On Fri, Sep 23, 2016 at 12:20 PM, Sebastian Burgstaller < sebastian.burgstaller@gmail.com> wrote:
Hi Denny, Sorry, I missed this email. just did the calculation for InChI string lengths on the 92 Mio PubChem compounds: 99% 99.9% 100% 311 676 4502
That said, there is not upper limit for the length, but 4502 is the longest string in the PubChem database. The other IDs, canonical and isomeric SMILES have the same distribution shape, but are overall slightly shorter.
Best, Sebastian
On Sun, Sep 18, 2016 at 9:19 PM, Denny Vrandečić vrandecic@gmail.com wrote:
Can you figure out what a good limit would be for these two use cases?
I.e.
what would support 99%, 99.9%, and 100%?
On Sun, Sep 18, 2016, 12:27 Egon Willighagen <
egon.willighagen@gmail.com>
wrote:
Hi all,
sorry for joining the party late...
On Tue, Sep 13, 2016 at 11:39 AM, Sebastian Burgstaller sebastian.burgstaller@gmail.com wrote:
I think this topic might have been discussed many months ago. For certain data types in the chemical compound space (P233, canonical smiles, P2017 isomeric smiles and P234 Inchi key) a higher character limit than 400 would be really helpful (1500 to 2000 chars (I sense that this might cause problems with SPARQL)). Are there any plans on implementing this? In general, for quality assurance, many string property types would profit from a fixed max string length.
400 characters is not a lot for chemicals... InChIs can be a lot larger indeed. 2k would allow us to capture a lot more chemicals. BTW, this also applies to the canonical SMILES, which also doesn't have an upper bound. Tannic acid (Q427956) is an example (which looking at the InChIKey came up when running the bot :) From working with ChEMBL as RDF I know it has InChIs of length > 1024, which was the max length in Virtuoso... I think it's important for the biology and chemistry to increase the limit.
Egon
-- E.L. Willighagen Department of Bioinformatics - BiGCaT Maastricht University (http://www.bigcat.unimaas.nl/) Homepage: http://egonw.github.com/ LinkedIn: http://se.linkedin.com/in/egonw Blog: http://chem-bla-ics.blogspot.com/ PubList: http://www.citeulike.org/user/egonw/tag/papers ORCID: 0000-0001-7542-0286 ImpactStory: https://impactstory.org/EgonWillighagen
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
-- E.L. Willighagen Department of Bioinformatics - BiGCaT Maastricht University (http://www.bigcat.unimaas.nl/) Homepage: http://egonw.github.com/ LinkedIn: http://se.linkedin.com/in/egonw Blog: http://chem-bla-ics.blogspot.com/ PubList: http://www.citeulike.org/user/egonw/tag/papers ORCID: 0000-0001-7542-0286 ImpactStory: https://impactstory.org/u/egonwillighagen _______________________________________________ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
On Fri, Sep 23, 2016 at 5:53 PM, Denny Vrandečić vrandecic@gmail.com wrote:
One stupid question: due to the length of these identifiers, and since they are not simple intransparent identifiers but rather encode semantics - if I understand it correctly - could a single such identifier be encoding content or ideas which are potentially covered by copyright or patent law? Is there some background available on that?
Not the InChI. The standard itself is meant to be reused as much as possible and the software is open source.
Some information here: http://jcheminf.springeropen.com/articles/10.1186/1758-2946-5-7
Egon
On Fri, Sep 23, 2016 at 3:27 AM Egon Willighagen < egon.willighagen@gmail.com> wrote:
Sebastian, great you found time for it! I didn't :/ (Stats are worth a tweet, IMHO :)
Egon
On Fri, Sep 23, 2016 at 12:20 PM, Sebastian Burgstaller < sebastian.burgstaller@gmail.com> wrote:
Hi Denny, Sorry, I missed this email. just did the calculation for InChI string lengths on the 92 Mio PubChem compounds: 99% 99.9% 100% 311 676 4502
That said, there is not upper limit for the length, but 4502 is the longest string in the PubChem database. The other IDs, canonical and isomeric SMILES have the same distribution shape, but are overall slightly shorter.
Best, Sebastian
On Sun, Sep 18, 2016 at 9:19 PM, Denny Vrandečić vrandecic@gmail.com wrote:
Can you figure out what a good limit would be for these two use cases?
I.e.
what would support 99%, 99.9%, and 100%?
On Sun, Sep 18, 2016, 12:27 Egon Willighagen <
egon.willighagen@gmail.com>
wrote:
Hi all,
sorry for joining the party late...
On Tue, Sep 13, 2016 at 11:39 AM, Sebastian Burgstaller sebastian.burgstaller@gmail.com wrote:
I think this topic might have been discussed many months ago. For certain data types in the chemical compound space (P233, canonical smiles, P2017 isomeric smiles and P234 Inchi key) a higher character limit than 400 would be really helpful (1500 to 2000 chars (I sense that this might cause problems with SPARQL)). Are there any plans on implementing this? In general, for quality assurance, many string property types would profit from a fixed max string length.
400 characters is not a lot for chemicals... InChIs can be a lot larger indeed. 2k would allow us to capture a lot more chemicals. BTW, this also applies to the canonical SMILES, which also doesn't have an upper bound. Tannic acid (Q427956) is an example (which looking at the InChIKey came up when running the bot :) From working with ChEMBL as RDF I know it has InChIs of length > 1024, which was the max length in Virtuoso... I think it's important for the biology and chemistry to increase the limit.
Egon
-- E.L. Willighagen Department of Bioinformatics - BiGCaT Maastricht University (http://www.bigcat.unimaas.nl/) Homepage: http://egonw.github.com/ LinkedIn: http://se.linkedin.com/in/egonw Blog: http://chem-bla-ics.blogspot.com/ PubList: http://www.citeulike.org/user/egonw/tag/papers ORCID: 0000-0001-7542-0286 ImpactStory: https://impactstory.org/EgonWillighagen
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
-- E.L. Willighagen Department of Bioinformatics - BiGCaT Maastricht University (http://www.bigcat.unimaas.nl/) Homepage: http://egonw.github.com/ LinkedIn: http://se.linkedin.com/in/egonw Blog: http://chem-bla-ics.blogspot.com/ PubList: http://www.citeulike.org/user/egonw/tag/papers ORCID: 0000-0001-7542-0286 ImpactStory: https://impactstory.org/u/egonwillighagen _______________________________________________ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
(I half expected that link to be paywalled - fortunately it wasn't.)
Thanks!
On Fri, Sep 23, 2016 at 9:10 AM Egon Willighagen egon.willighagen@gmail.com wrote:
On Fri, Sep 23, 2016 at 5:53 PM, Denny Vrandečić vrandecic@gmail.com wrote:
One stupid question: due to the length of these identifiers, and since they are not simple intransparent identifiers but rather encode semantics - if I understand it correctly - could a single such identifier be encoding content or ideas which are potentially covered by copyright or patent law? Is there some background available on that?
Not the InChI. The standard itself is meant to be reused as much as possible and the software is open source.
Some information here: http://jcheminf.springeropen.com/articles/10.1186/1758-2946-5-7
Egon
On Fri, Sep 23, 2016 at 3:27 AM Egon Willighagen < egon.willighagen@gmail.com> wrote:
Sebastian, great you found time for it! I didn't :/ (Stats are worth a tweet, IMHO :)
Egon
On Fri, Sep 23, 2016 at 12:20 PM, Sebastian Burgstaller < sebastian.burgstaller@gmail.com> wrote:
Hi Denny, Sorry, I missed this email. just did the calculation for InChI string lengths on the 92 Mio PubChem compounds: 99% 99.9% 100% 311 676 4502
That said, there is not upper limit for the length, but 4502 is the longest string in the PubChem database. The other IDs, canonical and isomeric SMILES have the same distribution shape, but are overall slightly shorter.
Best, Sebastian
On Sun, Sep 18, 2016 at 9:19 PM, Denny Vrandečić vrandecic@gmail.com wrote:
Can you figure out what a good limit would be for these two use
cases? I.e.
what would support 99%, 99.9%, and 100%?
On Sun, Sep 18, 2016, 12:27 Egon Willighagen <
egon.willighagen@gmail.com>
wrote:
Hi all,
sorry for joining the party late...
On Tue, Sep 13, 2016 at 11:39 AM, Sebastian Burgstaller sebastian.burgstaller@gmail.com wrote: > I think this topic might have been discussed many months ago. For > certain data types in the chemical compound space (P233, canonical > smiles, P2017 isomeric smiles and P234 Inchi key) a higher
character
> limit than 400 would be really helpful (1500 to 2000 chars (I sense > that this might cause problems with SPARQL)). Are there any plans
on
> implementing this? In general, for quality assurance, many string > property types would profit from a fixed max string length.
400 characters is not a lot for chemicals... InChIs can be a lot larger indeed. 2k would allow us to capture a lot more chemicals.
BTW,
this also applies to the canonical SMILES, which also doesn't have an upper bound. Tannic acid (Q427956) is an example (which looking at
the
InChIKey came up when running the bot :) From working with ChEMBL as RDF I know it has InChIs of length > 1024, which was the max length
in
Virtuoso... I think it's important for the biology and chemistry to increase the limit.
Egon
-- E.L. Willighagen Department of Bioinformatics - BiGCaT Maastricht University (http://www.bigcat.unimaas.nl/) Homepage: http://egonw.github.com/ LinkedIn: http://se.linkedin.com/in/egonw Blog: http://chem-bla-ics.blogspot.com/ PubList: http://www.citeulike.org/user/egonw/tag/papers ORCID: 0000-0001-7542-0286 ImpactStory: https://impactstory.org/EgonWillighagen
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
-- E.L. Willighagen Department of Bioinformatics - BiGCaT Maastricht University (http://www.bigcat.unimaas.nl/) Homepage: http://egonw.github.com/ LinkedIn: http://se.linkedin.com/in/egonw Blog: http://chem-bla-ics.blogspot.com/ PubList: http://www.citeulike.org/user/egonw/tag/papers ORCID: 0000-0001-7542-0286 ImpactStory: https://impactstory.org/u/egonwillighagen _______________________________________________ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
-- E.L. Willighagen Department of Bioinformatics - BiGCaT Maastricht University (http://www.bigcat.unimaas.nl/) Homepage: http://egonw.github.com/ LinkedIn: http://se.linkedin.com/in/egonw Blog: http://chem-bla-ics.blogspot.com/ PubList: http://www.citeulike.org/user/egonw/tag/papers ORCID: 0000-0001-7542-0286 ImpactStory: https://impactstory.org/u/egonwillighagen _______________________________________________ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Hi everyone,
I've been thinking more about this and we also discussed this within the development team. Here's my thinking at this point:
* We do have data that you all want to see in Wikidata that is currently prevented by the limit. That is not good. * I agree that the general understanding of all of us is very good when it comes to Wikidata not being the place to store long free texts. However I still fear that especially new people initially do not understand this. We could mitigate this by for example giving the user a hint when their input is getting too long even if it is still within the limit. Twitter does this in a nice way when you are getting close to the 140 character limit. However that is not implemented right now. * I do worry about licensing and copyright issues with especially the following properties: https://www.wikidata.org/wiki/Property:P2795 https://www.wikidata.org/wiki/Property:P1683 https://www.wikidata.org/wiki/Property:P1684 https://www.wikidata.org/wiki/Property:P2315 I took a rough survey of for me potentially troublesome properties and it seems they are all monolingual text. I am not worried about increasing external identifier and URL. It looks like string is also okish at this point in time.
Based on this my proposal is to increase string and URL and potentially external identifier if you request it. One open question is still what the new limit should be.
Cheers Lydia
On Sat, Oct 8, 2016 at 11:07 AM, Lydia Pintscher < lydia.pintscher@wikimedia.de> wrote:
Based on this my proposal is to increase string and URL and potentially external identifier if you request it. One open question is still what the new limit should be.
For small compounds this is answered by Sebastian's analysis... 5K would cover all currently known small molecules. 1K would cover 99.9%.
Lydia, do I understand that a formal request needs to be filed? Who will do that?
Egon
On Sat, Oct 8, 2016 at 11:14 AM, Egon Willighagen egon.willighagen@gmail.com wrote:
For small compounds this is answered by Sebastian's analysis... 5K would cover all currently known small molecules. 1K would cover 99.9%.
Ok. That is for strings, correct? Input for other use cases?
Lydia, do I understand that a formal request needs to be filed? Who will do that?
I'll handle that part for string. It was mostly about me not wanting to increase the limit on external identifiers without a request. I have not seen one but I might have overlooked it.
Cheers Lydia
On Sat, Oct 8, 2016 at 11:19 AM, Lydia Pintscher < lydia.pintscher@wikimedia.de> wrote:
On Sat, Oct 8, 2016 at 11:14 AM, Egon Willighagen egon.willighagen@gmail.com wrote:
For small compounds this is answered by Sebastian's analysis... 5K would cover all currently known small molecules. 1K would cover 99.9%.
Ok. That is for strings, correct? Input for other use cases?
Ah, those numbers are for https://www.wikidata.org/wiki/Property:P234 ...
Egon
On Sat, Oct 8, 2016 at 11:23 AM, Egon Willighagen egon.willighagen@gmail.com wrote:
Ah, those numbers are for https://www.wikidata.org/wiki/Property:P234 ...
External identifier then. Cool. And for string like in https://www.wikidata.org/wiki/Property:P233? Sebastian's initial email says 1500 to 2000. Is this still a good number after this discussion?
Cheers Lydia
On Sat, Oct 8, 2016 at 11:28 AM, Lydia Pintscher < lydia.pintscher@wikimedia.de> wrote:
On Sat, Oct 8, 2016 at 11:23 AM, Egon Willighagen egon.willighagen@gmail.com wrote:
Ah, those numbers are for https://www.wikidata.org/wiki/Property:P234
...
External identifier then. Cool. And for string like in https://www.wikidata.org/wiki/Property:P233? Sebastian's initial email
says 1500 to 2000. Is this still a good number after this discussion?
Yes, that would cover more than 99.9% of all InChIs in PubChem. (See Sebastian's reply earlier in this thread.)
Egon
Probably a silly question but ... did you all consider creating a datatype for molecue representation ? This seem to be a very similar usecase than mathematica formula. Essentially we're not dealing with a raw string but a representation of molecule formulas, with its own encoding ...
Changing the limit seem to be a poor workaround to a dedicated datatype - nobody seems to have found a relevant usecase and it seem to me that we're essentially abusing strings for storing blobs ...
2016-10-08 11:33 GMT+02:00 Egon Willighagen egon.willighagen@gmail.com:
On Sat, Oct 8, 2016 at 11:28 AM, Lydia Pintscher < lydia.pintscher@wikimedia.de> wrote:
On Sat, Oct 8, 2016 at 11:23 AM, Egon Willighagen egon.willighagen@gmail.com wrote:
Ah, those numbers are for https://www.wikidata.org/wiki/Property:P234
...
External identifier then. Cool. And for string like in https://www.wikidata.org/wiki/Property:P233? Sebastian's initial email
says 1500 to 2000. Is this still a good number after this discussion?
Yes, that would cover more than 99.9% of all InChIs in PubChem. (See Sebastian's reply earlier in this thread.)
Egon
-- E.L. Willighagen Department of Bioinformatics - BiGCaT Maastricht University (http://www.bigcat.unimaas.nl/) Homepage: http://egonw.github.com/ LinkedIn: http://se.linkedin.com/in/egonw Blog: http://chem-bla-ics.blogspot.com/ PubList: http://www.citeulike.org/user/egonw/tag/papers ORCID: 0000-0001-7542-0286 ImpactStory: https://impactstory.org/u/egonwillighagen
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
That was discussed and declined a while ago, see https://phabricator.wikimedia.org/T126862. Though I think the proposed realization was presentational rather than functional. I'll have to re-read the discussion, though.
Am 08.10.2016 um 12:07 schrieb Thomas Douillard:
Probably a silly question but ... did you all consider creating a datatype for molecue representation ? This seem to be a very similar usecase than mathematica formula. Essentially we're not dealing with a raw string but a representation of molecule formulas, with its own encoding ...
Changing the limit seem to be a poor workaround to a dedicated datatype - nobody seems to have found a relevant usecase and it seem to me that we're essentially abusing strings for storing blobs ...
2016-10-08 11:33 GMT+02:00 Egon Willighagen <egon.willighagen@gmail.com mailto:egon.willighagen@gmail.com>:
On Sat, Oct 8, 2016 at 11:28 AM, Lydia Pintscher <lydia.pintscher@wikimedia.de <mailto:lydia.pintscher@wikimedia.de>> wrote: On Sat, Oct 8, 2016 at 11:23 AM, Egon Willighagen <egon.willighagen@gmail.com <mailto:egon.willighagen@gmail.com>> wrote: > Ah, those numbers are for https://www.wikidata.org/wiki/Property:P234 <https://www.wikidata.org/wiki/Property:P234> ... External identifier then. Cool. And for string like in https://www.wikidata.org/wiki/Property:P233 <https://www.wikidata.org/wiki/Property:P233>? Sebastian's initial email says 1500 to 2000. Is this still a good number after this discussion? Yes, that would cover more than 99.9% of all InChIs in PubChem. (See Sebastian's reply earlier in this thread.) Egon -- E.L. Willighagen Department of Bioinformatics - BiGCaT Maastricht University (http://www.bigcat.unimaas.nl/) Homepage: http://egonw.github.com/ LinkedIn: http://se.linkedin.com/in/egonw <http://se.linkedin.com/in/egonw> Blog: http://chem-bla-ics.blogspot.com/ <http://chem-bla-ics.blogspot.com/> PubList: http://www.citeulike.org/user/egonw/tag/papers <http://www.citeulike.org/user/egonw/tag/papers> ORCID: 0000-0001-7542-0286 ImpactStory: https://impactstory.org/u/egonwillighagen <https://impactstory.org/u/egonwillighagen> _______________________________________________ Wikidata mailing list Wikidata@lists.wikimedia.org <mailto:Wikidata@lists.wikimedia.org> https://lists.wikimedia.org/mailman/listinfo/wikidata <https://lists.wikimedia.org/mailman/listinfo/wikidata>
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Dear Thomas,
On Sat, Oct 8, 2016 at 12:07 PM, Thomas Douillard < thomas.douillard@gmail.com> wrote:
Probably a silly question but ... did you all consider creating a datatype for molecue representation ? This seem to be a very similar usecase than mathematica formula. Essentially we're not dealing with a raw string but a representation of molecule formulas, with its own encoding ...
The InChI is actually not a structural representation, but a derived unique identifier.
What you propose would, however, apply to the SMILES. That one is generally of about the same size as the InChI, and there your solution sounds like a great idea!
Egon
Changing the limit seem to be a poor workaround to a dedicated datatype - nobody seems to have found a relevant usecase and it seem to me that we're essentially abusing strings for storing blobs ...
2016-10-08 11:33 GMT+02:00 Egon Willighagen egon.willighagen@gmail.com:
On Sat, Oct 8, 2016 at 11:28 AM, Lydia Pintscher < lydia.pintscher@wikimedia.de> wrote:
On Sat, Oct 8, 2016 at 11:23 AM, Egon Willighagen egon.willighagen@gmail.com wrote:
Ah, those numbers are for https://www.wikidata.org/wiki/Property:P234
...
External identifier then. Cool. And for string like in https://www.wikidata.org/wiki/Property:P233? Sebastian's initial email
says 1500 to 2000. Is this still a good number after this discussion?
Yes, that would cover more than 99.9% of all InChIs in PubChem. (See Sebastian's reply earlier in this thread.)
Egon
-- E.L. Willighagen Department of Bioinformatics - BiGCaT Maastricht University (http://www.bigcat.unimaas.nl/) Homepage: http://egonw.github.com/ LinkedIn: http://se.linkedin.com/in/egonw Blog: http://chem-bla-ics.blogspot.com/ PubList: http://www.citeulike.org/user/egonw/tag/papers ORCID: 0000-0001-7542-0286 ImpactStory: https://impactstory.org/u/egonwillighagen
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Lydia and Wikidatans,
If we consider atoms and nano-related entities, and other similar different scale entities, will Wikidata string data types wish for more space in char limits, I wonder (and for anticipating data for a realistic virtual earth/universe at a) street view, b) neuronal/cellular c) molecular, d) nano/atomic levels, etc.)?
Scott http://twitter.com/WorldUnivAndSch
On Oct 8, 2016 5:32 AM, "Egon Willighagen" egon.willighagen@gmail.com wrote:
Dear Thomas,
On Sat, Oct 8, 2016 at 12:07 PM, Thomas Douillard < thomas.douillard@gmail.com> wrote:
Probably a silly question but ... did you all consider creating a datatype for molecue representation ? This seem to be a very similar usecase than mathematica formula. Essentially we're not dealing with a raw string but a representation of molecule formulas, with its own encoding ...
The InChI is actually not a structural representation, but a derived unique identifier.
What you propose would, however, apply to the SMILES. That one is generally of about the same size as the InChI, and there your solution sounds like a great idea!
Egon
Changing the limit seem to be a poor workaround to a dedicated datatype - nobody seems to have found a relevant usecase and it seem to me that we're essentially abusing strings for storing blobs ...
2016-10-08 11:33 GMT+02:00 Egon Willighagen egon.willighagen@gmail.com:
On Sat, Oct 8, 2016 at 11:28 AM, Lydia Pintscher < lydia.pintscher@wikimedia.de> wrote:
On Sat, Oct 8, 2016 at 11:23 AM, Egon Willighagen egon.willighagen@gmail.com wrote:
Ah, those numbers are for https://www.wikidata.org/wiki/Property:P234
...
External identifier then. Cool. And for string like in https://www.wikidata.org/wiki/Property:P233? Sebastian's initial email
says 1500 to 2000. Is this still a good number after this discussion?
Yes, that would cover more than 99.9% of all InChIs in PubChem. (See Sebastian's reply earlier in this thread.)
Egon
-- E.L. Willighagen Department of Bioinformatics - BiGCaT Maastricht University (http://www.bigcat.unimaas.nl/) Homepage: http://egonw.github.com/ LinkedIn: http://se.linkedin.com/in/egonw Blog: http://chem-bla-ics.blogspot.com/ PubList: http://www.citeulike.org/user/egonw/tag/papers ORCID: 0000-0001-7542-0286 ImpactStory: https://impactstory.org/u/egonwillighagen
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
-- E.L. Willighagen Department of Bioinformatics - BiGCaT Maastricht University (http://www.bigcat.unimaas.nl/) Homepage: http://egonw.github.com/ LinkedIn: http://se.linkedin.com/in/egonw Blog: http://chem-bla-ics.blogspot.com/ PubList: http://www.citeulike.org/user/egonw/tag/papers ORCID: 0000-0001-7542-0286 ImpactStory: https://impactstory.org/u/egonwillighagen
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Hey folks :)
Andy and Pasleim just brought this topic to my attention again. Sorry for having dropped the ball a bit. I've created https://phabricator.wikimedia.org/T154660 with a strawman proposal for the still open question of which length it should be. Please add your arguments there.
Cheers Lydia