An overview of the RDKit
========================

What is it?
-----------

Open source toolkit for cheminformatics
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

-  Business-friendly BSD license
-  Core data structures and algorithms in C++
-  Python (2.x and 3.x) wrapper generated using Boost.Python
-  Java and C# wrappers generated with SWIG
-  2D and 3D molecular operations
-  Descriptor generation for machine learning
-  Molecular database cartridge for PostgreSQL
-  Cheminformatics nodes for KNIME (distributed from the KNIME community
   site: http://tech.knime.org/community/rdkit)

Operational:
~~~~~~~~~~~~

-  http://www.rdkit.org
-  Supports Mac/Windows/Linux
-  Releases every 6 months
-  Web presence:

   -  Homepage: http://www.rdkit.org
      Documentation, links
   -  Github (https://github.com/rdkit)
      Downloads, bug tracker, git repository
   -  Sourceforge (http://sourceforge.net/projects/rdkit)
      Mailing lists

-  Mailing lists at https://sourceforge.net/p/rdkit/mailman/, searchable
   archives available for
   `rdkit-discuss <http://www.mail-archive.com/rdkit-discuss@lists.sourceforge.net/>`__
   and
   `rdkit-devel <http://www.mail-archive.com/rdkit-devel@lists.sourceforge.net/>`__

History:
~~~~~~~~

-  2000-2006: Developed and used at Rational Discovery for building
   predictive models for ADME, Tox, biological activity
-  June 2006: Open-source (BSD license) release of software, Rational
   Discovery shuts down
-  to present: Open-source development continues, use within Novartis,
   contributions from Novartis back to open-source version

Functionality overview
----------------------

Basics
------

-  Input/Output: SMILES/SMARTS, SDF, TDT, SLN `1 <#footnote1>`__, Corina
   mol2 `1 <#footnote1>`__, PDB, sequence notation, FASTA (peptides
   only), HELM (peptides only)
-  Substructure searching
-  Canonical SMILES
-  Chirality support (i.e. R/S or E/Z labeling)
-  Chemical transformations (e.g. remove matching substructures)
-  Chemical reactions
-  Molecular serialization (e.g. mol <-> text)
-  2D depiction, including constrained depiction
-  Fingerprinting: Daylight-like, atom pairs, topological torsions,
   Morgan algorithm, “MACCS keys”, extended reduced graphs, etc.
-  Similarity/diversity picking
-  Gasteiger-Marsili charges
-  Bemis and Murcko scaffold determination
-  Salt stripping
-  Functional-group filters

2D
~~

-  2D pharmacophores `1 <#footnote1>`__
-  Hierarchical subgraph/fragment analysis
-  RECAP and BRICS implementations
-  Multi-molecule maximum common substructure `2 <#footnote2>`__
-  Enumeration of molecular resonance structures
-  Molecular descriptor library:
-  Topological (κ3, Balaban J, etc.)
-  Compositional (Number of Rings, Number of Aromatic Heterocycles,
   etc.)
-  Electrotopological state (Estate)
-  clogP, MR (Wildman and Crippen approach)
-  “MOE like” VSA descriptors
-  MQN `6 <#footnote6>`__
-  Similarity Maps `7 <#footnote7>`__
-  Machine Learning:
-  Clustering (hierarchical, Butina)
-  Information theory (Shannon entropy, information gain, etc.)
-  Tight integration with the `Jupyter <http://jupyter.org>`__ notebook
   (formerly the IPython notebook) and
   `Pandas <http://pandas.pydata.org/>`__.

3D
~~

-  2D->3D conversion/conformational analysis via distance geometry,
   including optional use of experimental torsion angle potentials.
-  UFF and MMFF94/MMFF94S implementations for cleaning up structures
-  Pharmacophore embedding (generate a pose of a molecule that matches a
   3D pharmacophore) `1 <#footnote1>`__
-  Feature maps
-  Shape-based similarity
-  RMSD-based molecule-molecule alignment
-  Shape-based alignment (subshape alignment `3 <#footnote3>`__)
   `1 <#footnote1>`__
-  Unsupervised molecule-molecule alignment using the Open3DAlign
   algorithm `4 <#footnote4>`__
-  Integration with PyMOL for 3D visualization
-  Molecular descriptor library:
-  Feature-map vectors `5 <#footnote5>`__
-  Torsion Fingerprint Differences for comparing conformations
   `8 <#footnote8>`__

Integration with other open-source projects
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

-  `KNIME <https://tech.knime.org/community/rdkit>`__: Workflow and
   analytics tool
-  `Django <http://django-rdkit.readthedocs.org/en/latest/>`__: "The web
   framework for perfectionists with deadlines"
-  `PostgreSQL <https://github.com/rdkit/rdkit/blob/master/Docs/Book/Cartridge.rst>`__:
   Extensible relational database
-  `Lucene <https://github.com/rdkit/org.rdkit.lucene>`__: Text-search
   engine `1 <#footnote1>`__

The Contrib Directory
---------------------

The Contrib directory, part of the standard RDKit distribution, includes
code that has been contributed by members of the community.

LEF: Local Environment Fingerprints
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Contains python source code from the publications:

-  A. Vulpetti, U. Hommel, G. Landrum, R. Lewis and C. Dalvit, "Design
   and NMR-based screening of LEF, a library of chemical fragments with
   different Local Environment of Fluorine" *J. Am. Chem. Soc.* **131**
   (2009) 12949-12959. http://dx.doi.org/10.1021/ja905207t
-  Vulpetti, G. Landrum, S. Ruedisser, P. Erbel and C. Dalvit, "19F NMR
   Chemical Shift Prediction with Fluorine Fingerprint Descriptor" *J.
   of Fluorine Chemistry* **131** (2010) 570-577.
   http://dx.doi.org/10.1016/j.jfluchem.2009.12.024

Contribution from Anna Vulpetti

M\_Kossner
~~~~~~~~~~

Contains a set of pharmacophoric feature definitions as well as code for
finding molecular frameworks.

Contribution from Markus Kossner

PBF: Plane of best fit
~~~~~~~~~~~~~~~~~~~~~~

Contains C++ source code and sample data from the publication:

Firth, N. Brown, and J. Blagg, "Plane of Best Fit: A Novel Method to
Characterize the Three-Dimensionality of Molecules" *Journal of Chemical
Information and Modeling* **52** 2516-2525 (2012).
http://pubs.acs.org/doi/abs/10.1021/ci300293f

Contribution from Nicholas Firth

mmpa: Matched molecular pairs
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Python source and sample data for an implementation of the
matched-molecular pair algorithm described in the publication:

Hussain, J., & Rea, C. "Computationally efficient algorithm to identify
matched molecular pairs (MMPs) in large data sets." *Journal of chemical
information and modeling* **50** 339-348 (2010).
http://dx.doi.org/10.1021/ci900450m

Includes a fragment indexing algorithm from the publication:

Wagener, M., & Lommerse, J. P. "The quest for bioisosteric
replacements." *Journal of chemical information and modeling* **46**
677-685 (2006).

Contribution from Jameed Hussain.

SA\_Score: Synthetic assessibility score
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Python source for an implementation of the SA score algorithm described
in the publication:

Ertl, P. and Schuffenhauer A. "Estimation of Synthetic Accessibility
Score of Drug-like Molecules based on Molecular Complexity and Fragment
Contributions" *Journal of Cheminformatics* **1:8** (2009)

Contribution from Peter Ertl

fraggle: A fragment-based molecular similarity algorithm
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Python source for an implementation of the fraggle similarity algorithm
developed at GSK and described in this RDKit UGM presentation:
https://github.com/rdkit/UGM_2013/blob/master/Presentations/Hussain.Fraggle.pdf

Contribution from Jameed Hussain

pzc: Tools for building and validating classifiers
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Contribution from Paul Czodrowski

ConformerParser: parser for Amber trajectory files
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Contribution from Sereina Riniker

NP\_Score: Natural-product likeness score
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Python source for an implementation of the NP score algorithm described
in the publication:

| "Natural Product Likeness Score and Its Application for Prioritization
  of Compound Libraries"
| Peter Ertl, Silvio Roggo, and Ansgar Schuffenhauer
| *Journal of Chemical Information and Modeling* **48:68-74** (2008)
| http://pubs.acs.org/doi/abs/10.1021/ci700286x

Contribution from Peter Ertl

Footnotes
---------

1: These implementations are functional but are not necessarily the
best, fastest, or most complete.

2: Originally contributed by Andrew Dalke

3: Putta, S., Eksterowicz, J., Lemmen, C. & Stanton, R. "A Novel
Subshape Molecular Descriptor" *Journal of Chemical Information and
Computer Sciences* **43:1623–35** (2003).

4: Tosco, P., Balle, T. & Shiri, F. "Open3DALIGN: an open-source
software aimed at unsupervised ligand alignment." *J Comput Aided Mol
Des* **25:777–83** (2011).

5: Landrum, G., Penzotti, J. & Putta, S. "Feature-map vectors: a new
class of informative descriptors for computational drug discovery"
*Journal of Computer-Aided Molecular Design* **20:751–62** (2006).

6: Nguyen, K. T., Blum, L. C., van Deursen, R. & Reymond, J.-L.
"Classification of Organic Molecules by Molecular Quantum Numbers."
*ChemMedChem* **4:1803–5** (2009).

7: Riniker, S. & Landrum, G. A. "Similarity maps - a visualization
strategy for molecular fingerprints and machine-learning methods."
*Journal of Cheminformatics* **5:43** (2013).

8: Schulz-Gasch, T., Schärfer, C., Guba, W. & Rarey, M. "TFD: Torsion
Fingerprints As a New Measure To Compare Small Molecule Conformations."
*J. Chem. Inf. Model.* **52:1499–1512** (2012).

License
-------

This document is copyright (C) 2013-2015 by Greg Landrum

This work is licensed under the Creative Commons Attribution-ShareAlike
3.0 License. To view a copy of this license, visit
http://creativecommons.org/licenses/by-sa/3.0/ or send a letter to
Creative Commons, 543 Howard Street, 5th Floor, San Francisco,
California, 94105, USA.

The intent of this license is similar to that of the RDKit itself. In
simple words: “Do whatever you want with it, but please give us some
credit.”
