Introduction to SMARTS
Standing on the shoulders of giants, SMARTS was developed by Daylight Chemical Information Systems, the same company that introduced SMILES. The documentation here is heavily inspired by the original Daylight SMARTS theory. In this documentation we are following the implementation by RDKit, which includes many extensions like Hybridization, Heteroatom Neighbor, Range Queries and Dative Bonds.
What is SMARTS?
SMARTS (SMiles ARbitrary Target Specification) is a language for describing molecular patterns and properties. It extends the SMILES notation to allow expressive queries over chemical structures, making it possible to search, filter, and classify molecules based on substructure patterns. In the SMILES language we have atoms and bonds. The same is true in SMARTS, which is further extended with property filters and logical operators.
The simplest SMARTS patterns match individual atoms. Either as bracket or non-bracket.
Non-bracket follows SMILES notation. Atoms are specified inside square brackets [ ] and can carry multiple constraints joined by logical operators.
Atomic Identity
Any element symbol in brackets matches that element with the given aromaticity. Lower case atom
means explicit aromatic, and capitalized means explicitly aliphatic (not-aromatic) atom. Or by
the atomic number (#<n>), which matches regardless of aromaticity.
A number pre-fixed on the atom designates the isotope. [35Cl] matches chlorine-35.
Aromaticity atomic queries works for Boron b, Carbon c, Nitrogen n, Oxygen o, Phosphorus p, and Sulfur s.
[C]- Explicit aliphatic (non-aromatic) carbon.[c]- Explicit aromatic carbon.[#6]matches bothCandc.[13C]- isotope (atomic mass), matches carbon-13.
Wildcards and Aromaticity
Three special tokens match atoms by aromaticity alone, without constraining the element:
[*]- wildcard; matches any atom regardless of element or aromaticity.[a]- matches any aromatic atom.[A]- matches any aliphatic (non-aromatic) atom.
Outside atomic brackets ([a][a]), the same tokens work (aa).
Hydrogen Count
Two separate primitives control hydrogen matching. They are distinct and can appear together
inside [...]:
H<n>- total attached hydrogen count. The explicit hydrogen atoms or implied.h<n>- implicit hydrogen count. Barehmeans "has any implicit hydrogens".
H without explicit count defaults to 1.
Example: [CH3] matches a carbon with exactly 3 attached hydrogens, and [Ch2] means exactly 2 implicit hydrogens.
In practise you would use the different queries of explicit and implicit hydrogens when you load molecules from MolBlock/SDF format.
Degree and Connectivity
D<n>- the number of explicit (non-implicit-Hydrogen) bonds connected to atom.d<n>- the number of heavy-atom (non-Hydrogen) neighbors.X<n>- total number of bonds including implicit hydrogens (total Connectivity).v<n>- the sum of bond orders of all bonds (total valence).
Without explicit number D, d, X and v defaults
to exactly 1.
Ring Membership
R<n>- number of rings the atom belongs to.r<n>- smallest ring size containing this atom. Size of smallest set of smallest rings (SSSR) minimum.k<n>- ring membership by exact ring size.x<n>- number of ring bonds to atom.
Bare [R,r,k,x] (no numbers) for all four are "greater than zero" ring connections,
and all support range queries (e.g. [k{5-6}]).
[R2] matches an atom that is in exactly 2 rings (e.g. a ring fusion atom). [r5] matches an atom whose smallest containing ring has exactly 5 members. [k5] matches an atom that belongs to a ring of exactly 5 members (unlike r5, which checks the minimum size).
Want to go deeper into finding rings? https://www.rdkit.org/docs/RDKit_Book.html#ring-finding-and-sssr
Formal Charge
+<n>- positive formal charge.-<n>- negative formal charge.
Bare + means +1, and ++ is the same as +2.
Equivalent for - and --.
Heteroatom Neighbors
Match atoms based on the number of heteroatom neighbors (non-C, non-H).
z<n>- exactly n heteroatom neighbors (aromatic or aliphatic).Z<n>- exactly n aliphatic heteroatom neighbors.
Bare z and Z means "has any neighbor" of that type.
Hybridization
The ^ matches atoms by hybridization state. It requires a digit and does not have a default
value.
[^0]- S[^1]- SP[^2]- SP2[^3]- SP3[^4]- SP3D[^5]- SP3D2
Logical Operators
Atom and bond primitives can be combined using logical operators to build complex queries: Operator priority, lowest to highest:
;- low-precedence AND.,- OR.&- high-precedence AND (explicit).!- NOT operation.
Example; [!C] matches any non-carbon atom.
No operator between two primitives is equivalent to an implicit &. So [CH3] is the same as [C&H3].
Bond primitives
Bonds between atoms can also be constrained.
-- single bond=- double bond#- triple bond$- quadruple bond:- aromatic bond~- any bond (wildcard)@- any ring bond/- directional bond "up" (for E/Z stereo)\- directional bond "down" (for E/Z stereo)
An unspecified bond in a SMARTS pattern matches either a single or aromatic bond. CC is the same as C-C, and cc is the same as c:c. [#6][#6] will match both aromatic and single bonds.
Note: the /? and \? "up-or-unspecified" / "down-or-unspecified" directional
bond tokens appear in the original Daylight specification but are not present in the RDKit implementation.
Chirality
Tetrahedral chirality can be specified using @ (anticlockwise) and @@ (clockwise), looking from first neighbour, following the same convention as SMILES. When included
in a SMARTS pattern, chirality is used as a matching constraint - unspecified chirality in the query
matches both enantiomers.
[C@H]- carbon with anticlockwise tetrahedral configuration.[C@@H]- carbon with clockwise tetrahedral configuration.
The @? and @@? "unspecified chirality" tokens appear in the original Daylight
specification but are not supported in RDKit and will cause a parse error.
Stereochemistry is a big topic, read more about it at https://www.rdkit.org/docs/RDKit_Book.html#stereochemistry
Recursive SMARTS
A recursive SMARTS [$(...)] defines a subquery/criteria for the first atom in the query.
These expressions behave like atomic primitives and can be combined with other primitives using logical operators. For example;
[$(*C)]- any atom connected to a non-aromatic carbon[$(N[CH3]);$(NC[CH3])]- Nitrogen atom connected to both methyl and ethyl sidegroups
Component-level Grouping
A dot (.) in a SMARTS pattern separates disconnected fragments. Each fragment can
match anywhere in the target - there is no constraint on which component it belongs to.
C.O- carbon and oxygen found in the SMILES. Will match atoms inCCOandCC.OO.C.Odoes not matchCCC, because there is no oxygen present.
(C).(C) to require matches in separate components), but this is not supported
in RDKit. In RDKit, parentheses are only used for branching, and (C).(C) is a parse
error. Additionally, . does not enforce matching across different disconnected
fragments, so C.C can match within a single molecule. To correctly handle
fragment-level constraints, split the molecule first (e.g. with Chem.GetMolFrags) and match each fragment separately or post-filter the results. Range Queries
Many numeric primitives accept a range in curly braces instead of a fixed value. Supported
primitives: D, d, h, k, r, R, v, x, X, z, Z, +, -.
D{2-4}- between 2 and 4 explicit connections (inclusive)D{-3}- at most 3 explicit connectionsD{2-}- at least 2 explicit connections
Dative Bonds
Dative bonds <- and -> are covalent bonds in which both electrons
in the shared pair come from the same atom, so the bond is directional.
->- dative bond pointing right (donor → acceptor)<-- dative bond pointing left (acceptor ← donor)
The two patterns do not match the same atoms. In the example below, the nitrogen in
trimethylamine donates a dative bond to platinum. [#7]->* matches the nitrogen as donor, while *<-[#7] matches the
platinum as acceptor. With SMILES [Fe]->CC1=O.CN(C1)(C)->[Pt].
Need more
No better option than reading https://www.rdkit.org/docs/RDKit_Book.html.