Introduction to SMARTS

Standing on the shoulders of giants, SMARTS was developed by Daylight Chemical Information Systems, the same company that introduced SMILES. The documentation here is heavily inspired by the original Daylight SMARTS theory. In this documentation we are following the implementation by RDKit, which includes many extensions like Hybridization, Heteroatom Neighbor, Range Queries and Dative Bonds.

What is SMARTS?

SMARTS (SMiles ARbitrary Target Specification) is a language for describing molecular patterns and properties. It extends the SMILES notation to allow expressive queries over chemical structures, making it possible to search, filter, and classify molecules based on substructure patterns. In the SMILES language we have atoms and bonds. The same is true in SMARTS, which is further extended with property filters and logical operators.

The simplest SMARTS patterns match individual atoms. Either as bracket or non-bracket. Non-bracket follows SMILES notation. Atoms are specified inside square brackets [ ] and can carry multiple constraints joined by logical operators.

Atomic Identity

Any element symbol in brackets matches that element with the given aromaticity. Lower case atom means explicit aromatic, and capitalized means explicitly aliphatic (not-aromatic) atom. Or by the atomic number (#<n>), which matches regardless of aromaticity.

A number pre-fixed on the atom designates the isotope. [35Cl] matches chlorine-35.

Aromaticity atomic queries works for Boron b, Carbon c, Nitrogen n, Oxygen o, Phosphorus p, and Sulfur s.

  • [C] - Explicit aliphatic (non-aromatic) carbon.
  • [c] - Explicit aromatic carbon.
  • [#6] matches both C and c.
  • [13C] - isotope (atomic mass), matches carbon-13.

Wildcards and Aromaticity

Three special tokens match atoms by aromaticity alone, without constraining the element:

  • [*] - wildcard; matches any atom regardless of element or aromaticity.
  • [a] - matches any aromatic atom.
  • [A] - matches any aliphatic (non-aromatic) atom.

Outside atomic brackets ([a][a]), the same tokens work (aa).

Hydrogen Count

Two separate primitives control hydrogen matching. They are distinct and can appear together inside [...]:

  • H<n> - total attached hydrogen count. The explicit hydrogen atoms or implied.
  • h<n> - implicit hydrogen count. Bare h means "has any implicit hydrogens".

H without explicit count defaults to 1.

Example: [CH3] matches a carbon with exactly 3 attached hydrogens, and [Ch2] means exactly 2 implicit hydrogens.

In practise you would use the different queries of explicit and implicit hydrogens when you load molecules from MolBlock/SDF format.

Degree and Connectivity

  • D<n> - the number of explicit (non-implicit-Hydrogen) bonds connected to atom.
  • d<n> - the number of heavy-atom (non-Hydrogen) neighbors.
  • X<n> - total number of bonds including implicit hydrogens (total Connectivity).
  • v<n> - the sum of bond orders of all bonds (total valence).

Without explicit number D, d, X and v defaults to exactly 1.

Ring Membership

  • R<n> - number of rings the atom belongs to.
  • r<n> - smallest ring size containing this atom. Size of smallest set of smallest rings (SSSR) minimum.
  • k<n> - ring membership by exact ring size.
  • x<n> - number of ring bonds to atom.

Bare [R,r,k,x] (no numbers) for all four are "greater than zero" ring connections, and all support range queries (e.g. [k{5-6}]).

[R2] matches an atom that is in exactly 2 rings (e.g. a ring fusion atom). [r5] matches an atom whose smallest containing ring has exactly 5 members. [k5] matches an atom that belongs to a ring of exactly 5 members (unlike r5, which checks the minimum size).

Want to go deeper into finding rings? https://www.rdkit.org/docs/RDKit_Book.html#ring-finding-and-sssr

Formal Charge

  • +<n> - positive formal charge.
  • -<n> - negative formal charge.

Bare + means +1, and ++ is the same as +2. Equivalent for - and --.

Heteroatom Neighbors

Match atoms based on the number of heteroatom neighbors (non-C, non-H).

  • z<n> - exactly n heteroatom neighbors (aromatic or aliphatic).
  • Z<n> - exactly n aliphatic heteroatom neighbors.

Bare z and Z means "has any neighbor" of that type.

Hybridization

The ^ matches atoms by hybridization state. It requires a digit and does not have a default value.

  • [^0] - S
  • [^1] - SP
  • [^2] - SP2
  • [^3] - SP3
  • [^4] - SP3D
  • [^5] - SP3D2

Logical Operators

Atom and bond primitives can be combined using logical operators to build complex queries: Operator priority, lowest to highest:

  • ; - low-precedence AND.
  • , - OR.
  • & - high-precedence AND (explicit).
  • ! - NOT operation.

Example; [!C] matches any non-carbon atom.

No operator between two primitives is equivalent to an implicit &. So [CH3] is the same as [C&H3].

Bond primitives

Bonds between atoms can also be constrained.

  • - - single bond
  • = - double bond
  • # - triple bond
  • $ - quadruple bond
  • : - aromatic bond
  • ~ - any bond (wildcard)
  • @ - any ring bond
  • / - directional bond "up" (for E/Z stereo)
  • \ - directional bond "down" (for E/Z stereo)

An unspecified bond in a SMARTS pattern matches either a single or aromatic bond. CC is the same as C-C, and cc is the same as c:c. [#6][#6] will match both aromatic and single bonds.

Note: the /? and \? "up-or-unspecified" / "down-or-unspecified" directional bond tokens appear in the original Daylight specification but are not present in the RDKit implementation.

Chirality

Tetrahedral chirality can be specified using @ (anticlockwise) and @@ (clockwise), looking from first neighbour, following the same convention as SMILES. When included in a SMARTS pattern, chirality is used as a matching constraint - unspecified chirality in the query matches both enantiomers.

  • [C@H] - carbon with anticlockwise tetrahedral configuration.
  • [C@@H] - carbon with clockwise tetrahedral configuration.

The @? and @@? "unspecified chirality" tokens appear in the original Daylight specification but are not supported in RDKit and will cause a parse error.

Stereochemistry is a big topic, read more about it at https://www.rdkit.org/docs/RDKit_Book.html#stereochemistry

Recursive SMARTS

A recursive SMARTS [$(...)] defines a subquery/criteria for the first atom in the query.

These expressions behave like atomic primitives and can be combined with other primitives using logical operators. For example;

  • [$(*C)] - any atom connected to a non-aromatic carbon
  • [$(N[CH3]);$(NC[CH3])] - Nitrogen atom connected to both methyl and ethyl sidegroups

Component-level Grouping

A dot (.) in a SMARTS pattern separates disconnected fragments. Each fragment can match anywhere in the target - there is no constraint on which component it belongs to.

  • C.O - carbon and oxygen found in the SMILES. Will match atoms in CCO and CC.OO.
  • C.O does not match CCC, because there is no oxygen present.
Note: The Daylight SMARTS syntax defines component-level grouping using zero-level parentheses (e.g. (C).(C) to require matches in separate components), but this is not supported in RDKit. In RDKit, parentheses are only used for branching, and (C).(C) is a parse error. Additionally, . does not enforce matching across different disconnected fragments, so C.C can match within a single molecule. To correctly handle fragment-level constraints, split the molecule first (e.g. with Chem.GetMolFrags) and match each fragment separately or post-filter the results.

Range Queries

Many numeric primitives accept a range in curly braces instead of a fixed value. Supported primitives: D, d, h, k, r, R, v, x, X, z, Z, +, -.

  • D{2-4} - between 2 and 4 explicit connections (inclusive)
  • D{-3} - at most 3 explicit connections
  • D{2-} - at least 2 explicit connections

Dative Bonds

Dative bonds <- and -> are covalent bonds in which both electrons in the shared pair come from the same atom, so the bond is directional.

  • -> - dative bond pointing right (donor → acceptor)
  • <- - dative bond pointing left (acceptor ← donor)

The two patterns do not match the same atoms. In the example below, the nitrogen in trimethylamine donates a dative bond to platinum. [#7]->* matches the nitrogen as donor, while *<-[#7] matches the platinum as acceptor. With SMILES [Fe]->CC1=O.CN(C1)(C)->[Pt].

Need more

No better option than reading https://www.rdkit.org/docs/RDKit_Book.html.