How to write a reaction in SMILES format

THE ZYMVOL BLOG

How to write a reaction in SMILES

No matter if you are a chemist or not, if you are interested in chemical notation, we hope this post will make you “smile” 🙂

 

What’s SMILES in chemistry?

Simplified Molecular-Input Line-Entry System (SMILES) is a user-friendly, chemical notation method for specifying the structure of molecules and reactions.

It consists of unambiguous, short, linear strings of characters in ASCII format, written in a language made of symbols and simple “grammar” rules.

SMILES was created to facilitate storage, retrieval and modeling of chemical structures and information in computational chemistry. Thanks to its easy and compact format, it requires a small amount of computer memory and makes it convenient for people to use it.

Plus, SMILES can be read by molecule editors and converted into two-dimensional and three-dimensional models. This is very useful, for example, when you need to study the structure of proteins such as enzymes!

 

The SMILES notation system was created in the 1980s at the Mid-Continent Ecology Division Laboratory in Duluth, Minnesota, and funded by the U.S. Environmental Protection Agency.

Afterwards, other organizations have modified and extended SMILES, which also exists in an open standard called OpenSMILES developed by the Blue Obelisk open-source chemistry community.

 

Five rules for writing SMILES

Understanding SMILES is quite easy!

First of all, keep in mind that in SMILES, each notation string represents the topological structure of a molecule or a reaction.

Similar to the concept of a graph, in SMILES the atoms of a molecule are considered as nodes, bonds are the edges, parentheses indicate branching points and numeric labels designate ring connection points.

Benzoic acid
C1=CC=C(C=C1)C(=O)O

 

And now let’s learn the five rules to write in SMILES format:

Rule 1: Atoms

In SMILES, atoms are represented by their atomic symbols: O for oxygen, Br for bromine, and so on, using lower case for the second letter in two-character symbols.

For elements in the «organic subset» (B, C, N, O, P, S, F, Cl, Br, and I) and with their lowest normal valence, attached hydrogens usually don’t need to be written. That’s why methane (CH4) can be written simply as C.

However, all elements and organic ones with other valences must be described in brackets as follows:

 

[OH3+]

Inside brackets, any attached hydrogens have to be indicated by an «H», followed by a digit.

 

[Fe+3]

Meanwhile, formal charges must always be specified by the symbol «+» or «-«, followed by a digit.

 

Rule 2: Bonds

Bonds between atoms are represented by different symbols depending on the type:

 

For single bonds

=

For double bonds (C=O formaldehyde)

#

For triple bonds (C#N hydrogen cyanide)

:

For aromatic bonds.

 

But atoms that are next to each other are assumed to be connected by a single or aromatic bond, so these two may always be omitted, as in ethanol:

 

CCO

 

Rule 3: Branches

In case of molecules with branches, you just have to know that branches are enclosed in parentheses «( )» and the bond that joins the branch to the “parent chain” has to appear inside the parentheses.

Have a look at triethylamine. Its SMILES is CCN(CC)CC, where (CC) refers to the branch that starts from the nitrogen atom:

 

CCN(CC)CC

 

Rule 4: Cyclic structures

Molecules that are shaped in a ring –like aromatic molecules– are also written linearly.

Ring opening and closure are indicated by a digit that follows

 

the atomic symbol at each opening/closure. See for example cyclohexane:

 

C1CCCCC1

 

Curiously, different notations can represent the same cyclic structure, depending on where the ring starts to be written, and all are equally valid.

For example, cyclohexene (see image below) can be written like:

 

C1=CCCCC1
C=1CCCCC1
C1CCCCC=1
C=1CCCCC=1

 

 

All are equally valid and just differ in the ring’s starting point.

Atoms in aromatic rings are written in lower case to be differentiated, as in benzene. This SMILES represents an hexagonal ring of six carbons with one hydrogen atom attached to each:

 

c1ccccc1

 

Rule 5: Disconnected structures

SMILES does not only serve for writing single molecules. You can also represent disconnected compounds or, in other words, atoms not bonded to each other.

But how?

Disconnected structures are written as individual structures separated by a period: «.». They are just adjacent atoms and the order in which ions or ligands are listed is arbitrary. 

For example, sodium chloride (table salt) is an ionic compound and its SMILES looks like this:

 

[Na+].[Cl-]

 

Are you starting to get it?

Thanks to these five rules, chemists can write very complex topological structures and unique strings for every existing molecular structure.

 

How can SMILES represent unique structures?

As you might have guessed from the rules, SMILES strings describe the two-dimensional graphs that chemists normally use to represent molecules.

Three-dimensional structures are also obtained from SMILES strings with energy-minimization approaches, which basically predict protein structure based on the most efficient arrangement of the atoms and bonds of the molecule in terms of free energy.

But how is this notation method so precise, being the diversity of molecules so wide and complex?

As mentioned before, there is more than one SMILES for some molecules.

For example: OCC, C-C-O and C(O)C are all generic SMILES for ethanol. Generic SMILES do not take into account chiral or isotopic information. 

How has this been solved? With canonicalization: algorithms that generate one single specific SMILES among all valid possibilities. 

The unique SMILE takes into account chiral and isotopic specifications. For the previous example, all previous generic SMILES would be converted into the unique smile CCO, a universal identifier for a specific chemical structure.

 

Using a SMILES generator

Understanding the SMILES “language” is always going to be useful for those who have to deal with chemical notations in computational format.

But don’t worry, you don’t need to learn SMILES by heart, because there are tools to generate SMILES.

For example, when our company launched ZYMSCAN, we knew we wanted to use a SMILES generator to make the user experience as easy as possible for chemists.

ZYMSCAN was created to help users know if a certain reaction can be performed enzymatically without wasting time going through other methods.

It consists of three simple steps, which start with submitting the substrate and product SMILES of the reaction of interest.

Thanks to SMILES’ unambiguous format, we are sure to be understanding the user’s very specific needs correctly, because each molecule’s notation is unique.

 

Are there alternatives to SMILES?

SMILES is not the only linear notation. The International Union of Pure and Applied Chemistry (IUPAC) created its own system to standardize the identification for chemical databases: InChi

As SMILES, it is also open source and freely accessible. Both are the most important and commonly used line notations today, and are complementary to each other. 

The big differences are that SMILES is not an identifier, but a chemical representation format. Besides, while InChI is well-documented and standardized through IUPAC, there is no up-to-date specification documentation for SMILES.

The latter is the main reason why the US Environmental Protection Agency, which created SMILES, is working on the interoperability of this format. The aim is to establish a formalized specification to promote the exchange of scientific information together with IUPAC’s InChi.