Hi, does anyone have a good regex for chemical formulae?
These samples below were found with \d,\d,\d\S+-\d-\S+ which seems pretty good in not matching falses (every match is OK), but I don't know how many matches are left out. Do you know a better onee without false positives? Template occurrences are not on purpose, I have to find them anywhere. I don't need a perfect one, just a "good enough", because I don't want to colletc them, I just want to exclude from spelling corrections.
- |IUPACnév = (6a''R'',9''R'')-''N'',''N''-dietil-7-metil-4,6,6a,7,8,9-hexahidroindol-[4,3-''fg'']kinolin-9-karboxamid - |IUPACnév = 2-(2,4-difluorophenyl)-<br />1,3-''bis''(1''H''-1,2,4-triazol-1-yl)propan-2-ol - az E- és Z-4,5,9-tritiadodeka-1,6,11-trién - |IUPACnév = ''6-(2-fluorfenil)-2-metil-9-nitro-<BR>2,5-diazabiciklo[5.4.0]undeka-<BR>5,8,10,12-tetraén-3-on'' - |IUPACnév = (8''S'',10''S'')-10-(4-amino-5-hidroxi-6-metil-<br />tetrahidro-2''H''-pirán-2-iloxi)<br />-6,8,11-trihidroxi-8-(2-hidroxiacetil)<br /
-1-metoxi-7,8,9,10-tetrahidrotetracén<br />-5,12-dion
- |IUPACnév = ''6-(2-chlorophenyl)-9-nitro-<BR>2,5-diazabicyclo[5.4.0]undeca-<BR>5,8,10,12-tetraen-3-one'' - | IUPACnév = (2''R'',4''S'')-''rel''-4-[4-[4-[4-[ [2-(2,4-dichlorophenyl)- 2-(1H-1,2,4-triazol-1-ylmethyl)- 1,3-dioxolan-4- - |IUPACnév = 4-[2-ethoxy-5-(4-ethylpiperazin-1-yl)sulfonyl-phenyl]-<br />9-methyl-7-propyl- 3,5,6,8-tetrazabicyclo[4.3.0]<br />nona-3,7,9-trien-2-o ne - |IUPACnév = (6''R''-transz)-6-(1,3-benzodioxol-5-il)- 2,3,6,7,12,12a-hexahidro-2-metil-pirazino [1', 2':1,6] pirido[3,4-''b'']indol-1,4-dion - |IUPACnév = (7''S'',9''E'',11''S'',12''R'',13''S'',14''R'',15''R'',16''R'',17''S'',18''S'',19''E'',21''Z'')-2,15,17,27,29-pentahidroxi-11-metoxi-3 ,7,12,14,16,18,22-heptametil-26-{(E)-[(4-metilpiperazin-1-il)imino]metil}-6,23-dioxo-8,30-dioxa-24-azatetraciklo[23.3.1.1<sup>4,7</sup>.0<sup>5,28</s up>]triakonta-1(28),2,4,9,19,21,25(29),26-oktaén-13-il-acetát - |Szinoníma = <small>5,6,9,17,19,21-hexahidroxi-23-metoxi-2,4,12,16,18,20,22-heptametil-8-[''N''-(4-metil-1-piperazinil)formimidoil]-2,7-(epoxip entadeka[1,11,13]triénimino)-nafto[2,1-''b'']furán-1,11(2''H'')-dion-21-acetát</small> - |IUPACnév = (''S'')-2-[4-[2-(4-amino-2-oxo-3,5,7-triazabiciklo[4.3.0] nona-3,8,10-trién-9-il)etil] benzoil] aminopentándisav - | IUPACName = 1,17-dihydroxy-10,13-dimethyl-17-[2-(4-methylpiperazin-1-yl)acetyl]-7,8,9,11,12,14,15,16-octahydro-6''H''-cyclopenta[a]phenanthren-3- one - | IUPACNév = 1,1,1-Trichloro-2-methyl-2-propanol - | MásNév = 1,1,1-trichloro-2-methyl-2-propanol, chlorbutol, chloreton, chloretone, chlortran, trichloro-tert-butyl alcohol, 1,1,1-trichloro-tert-bu tyl alcohol, 2-(trichloromethyl)propan-2-ol, 1,1,1-trichloro-2-methyl-2-propanol, tert-Trichlorobutyl alcohol, trichloro-tert-butanol, trichlorisobut - |IUPACnév = ''(RS)''-[4-(4-amino-6,7-dimethoxy- quinazolin-2-yl) piperazin-1-yl]- (2,5-dioxabicyclo[4.4.0] deca-6,8,10-trien-4-yl) methanone - |IUPACnév = 3,5-dihidroxi-7-[6-hidroxi-2-metil-8-(2-metilbutanoiloxi)-1,2,6,7,8,8a-hexahidronaftalén-1-il]-heptánsav - |IUPACnév = 9-nitro-6-fenil-2,5-diazabiciklo[5.4.0]undeka-5,8,10,12-tetraén-3-on - |IUPACName=2-metil-3-[(2''E'')-3,7,11,15-tetrametilhexadec-2-én-1-il]naftokinon - | IUPAC_name = (8''S'',9''S'',10''S'',13''S'',14''S'',17''S'')-17-hidroxi-10,<br>13,17-trimetil-7,8,9,11,12,14,15,16-<br>oktahidro-6''H''-ciklopent a[a]fenantrén-3-on - | 17916||C.I. Reactive black 1.||Cibacron Black BG||<nowiki>Cobalt,4-[[6-[(4-amino-6-chloro-1,3,5-triazin-2-yl)amino]-1-hydroxy-3-sulfo-2-naphthale nyl]azo]-3-hydroxy-7-nitro-1-naphthalenesulfonicacid complex (9CI)</nowiki> l)azo]-2,7-naphthalenedisulfonic acid complex - | IUPACName = (3β,5''Z'',7''E'')-9,10-secocholesta-<br>5,7,10(19)-trien-3-ol - | IUPACName = 2,2′-''bisz''-(8-formil-1,6,7-trihidroxi-5-izopropil-3-metilnaftalin) - | IUPACName = 2,5-anhidro-1,4,6-trideoxi-6-(trimetilammónio)-<small>D</small>-''ribo''-hexit - |IUPACnév = (6''R'',7''R'',''Z'')-7-(2-(2-aminothiazol-4-yl)-<br />2-(methoxyimino)acetamido)-3-((6-hydroxy-2-methyl-5-oxo-<br />2,5-dihydro-1,2,4 -triazin-3-ylthio)methyl)-8-oxo-5-thia-<br />1-aza-bicyclo[4.2.0]oct-2-ene-2-carboxylic acid - |IUPACnév = 4,4-difluoro-''N''-{(1''S'')-3-[3-(3-isopropyl-5-methyl-4''H''-1,2,4-triazol-4-yl)- - |IUPACnév = (5-metil-2-oxo-2''H''-1,3-dioxol-4-il)metil-4-(2-hidroxipropán-2-il)-2-propil-1-({4-[2-(2''H''-1,2,3,4-tetrazol-5-il)fenil]fenil}metil )-1''H''-imidazol-5-karboxilát - | IUPACName =<small>7-α-D-Glucopyranosyl-9,10-dihydro-<br />3,5,6,8-tetrahydroxy-1-methyl-9,10-<br />dioxoanthracenecarboxylic acid - |IUPACnév = 2-[(4''S'')-4-<nowiki>[[</nowiki>(1''S'')-1-ethoxycarbonyl-3-phenyl-propyl]amino]-<br>3-oxo-2-azabicyclo[5.4.0]undeca-7,9,11-trien-2-y l]acetic acid - |IUPACName= <small>(2''R'',6''S'',7a''R'')-2-<br/>[(1''E'',3''E'',5''E'',7''E'',9''E'',11''E'',13''E'',15''E'')-<br/>16-[(1''R'',4''R'')-4-Hidroxi- <br/>2,6,6-trimetil-1-ciklohex-<br/>2-enil]-1,5,10,14-<br/>tetrametilhexadeka-<br/>1,3,5,7,9,11,13,15-<br/>oktaenil]-4,4,7a-trimetil-2,5,6,7-<br/>tet rahidrobenzofurán-6-ol</small> - | 4-[18-(4-hidroxy-2,6,6-trimetil-1-ciklohexenil) -3,7,12,16-tetrametil-oktadeca -1,3,5,7,9,11,13,15,17-nonaenil] -3,5,5-trimetil-ciklohex-2-en-1-o l - | IUPACName = <small>''(R)''-3,5,5-Trimetil-4-[3,7,12,16-<br />tetrametil-18-(2,6,6-trimetilciklohex-<br />1-enyl)-octadeca-1,3,5,7,9,11,13,15,17 -<br />nonaenil]-ciklohex-3-enol</small> - |IUPACName=(1S,4S,6R)-1-[(1E,3E,5E,7E,9E,11E,13E,15E,17E)-18-[(1S,4S,6R)-4-Hydroxy-2,2,6-trimethyl-7-oxabicyclo[4.1.0]heptan-1-yl]-3,7,12,16-tetram ethyloctadeca-1,3,5,7,9,11,13,15,17-nonaenyl]-2,2,6-trimethyl-7-oxabicyclo[4.1.0]heptan-4-ol - | IUPACName = 6,6,9-trimetil-3-pentil-6''H''-benzo[c]kromén-1-ol - | IUPACName = 6,6,9-trimetil-3-propil-6''H''-benzo[''c'']kromén-1-ol - |IUPACnév = 6a,7,8,10a-tetrahydro-1-hydroxy-6,6-dimethyl-3-pentyl- 6H-Dibenzo[b,d]pyran-9-methanol - | [[IUPAC]] név || (6a''S'',10a''S'')-6,6,9-trimethyl-3-propyl-6a,7,8,10a-<br>tetrahydro-6''H''-benzo[''c'']chromen-1-ol - 3,4,5-tris(2-methylpropanoyloxy)oxan-2-yl]oxy<br />-4-(2-methylpropanoyloxy)-5-(2 - |IUPACName={[3-phenyl-1-{[(2''S'',3''R'',4''S'',5''S'',6''R'') -3,4,5-trihydroxy-6-(hydroxymethyl)-
Couldn't you match against a set of categories?
Best xqt
Am 09.08.2016 um 19:44 schrieb Bináris wikiposta@gmail.com:
propil
No, I don't want to exclude whole pages, just the formula strings themselves.
2016-08-10 6:07 GMT+02:00 info info@gno.de:
Couldn't you match against a set of categories?
Best xqt
Am 09.08.2016 um 19:44 schrieb Bináris wikiposta@gmail.com:
propil
pywikibot mailing list pywikibot@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/pywikibot
On Wed, Aug 10, 2016 at 12:44 AM, Bináris wikiposta@gmail.com wrote:
Hi, does anyone have a good regex for chemical formulae?
These samples below were found with \d,\d,\d\S+-\d-\S+ which seems pretty good in not matching falses (every match is OK), but I don't know how many matches are left out.
You could find gaps in your regex by using a regex like /(IUPACName|IUPACnév|MásNév|Szinoníma) *=(.*))$/ to grab potential matches, and then testing your regex on \1 . That should show you any cases that your regex is not matching.
2016-08-10 7:38 GMT+02:00 John Mark Vandenberg jayvdb@gmail.com:
You could find gaps in your regex by using a regex like /(IUPACName|IUPACnév|MásNév|Szinoníma) *=(.*))$/ to grab potential matches, and then testing your regex on \1 . That should show you any cases that your regex is not matching.
Good idea! Why did I not find it out myself? :-) So nobody had one ready-to-use.
On Wed, Aug 10, 2016 at 12:44 AM, Bináris wikiposta@gmail.com wrote:
Hi, does anyone have a good regex for chemical formulae?
These samples below were found with \d,\d,\d\S+-\d-\S+ which seems pretty good in not matching falses (every match is OK), but I don't know how many matches are left out. Do you know a better one without false positives?
Does this one suit your purposes:
([αβγδφωλμπ]-)?([([]*[A-Z☐][ub]?[a-z]?[₁₂₃₄₅₆₇₈₉₀]*()?[¹²³⁴⁵⁶⁷⁸⁹⁰]*[⁺⁻]?)?[])|,₁₂₃₄₅₆₇₈₉₀]*(·(?[-0-9.]*n?)?)?)+
https://www.wikidata.org/wiki/Property:P274
-- John
2016-08-10 8:19 GMT+02:00 John Mark Vandenberg jayvdb@gmail.com:
Does this one suit your purposes:
([αβγδφωλμπ]-)?([([]*[A-Z☐][ub]?[a-z]?[₁₂₃₄₅₆₇₈₉₀]*()?[¹² ³⁴⁵⁶⁷⁸⁹⁰]*[⁺⁻]?)?[])|,₁₂₃₄₅₆₇₈₉₀]*(·(?[-0-9.]*n?)?)?)+
This should be another kind of chemical formulae, but I will test it. What is ☐? I never knew that WD stored regexes, but I like it.
On Wed, Aug 10, 2016 at 1:26 PM, Bináris wikiposta@gmail.com wrote:
2016-08-10 8:19 GMT+02:00 John Mark Vandenberg jayvdb@gmail.com:
Does this one suit your purposes:
([αβγδφωλμπ]-)?([([]*[A-Z☐][ub]?[a-z]?[₁₂₃₄₅₆₇₈₉₀]*()?[¹²³⁴⁵⁶⁷⁸⁹⁰]*[⁺⁻]?)?[])|,₁₂₃₄₅₆₇₈₉₀]*(·(?[-0-9.]*n?)?)?)+
This should be another kind of chemical formulae, but I will test it. What is ☐?
I dont know; I just copy from Wikidata. It is curious.
It seems to be https://en.wiktionary.org/wiki/%E2%98%90 , "ballot box" https://en.wikipedia.org/wiki/%E2%98%90
Maybe ask on the WD talk page why it is in the regex, if nobody here can help.