Description: QM9 dataset is an enumeration of around 134k stable organic molecules with up to 9 heavy atoms (carbon, oxygen, nitrogen and fluorine). As no filtering is applied, the molecules in this dataset only reflect basic structural constraints.
Statistics:
Name
Type
#Graphs
#Nodes
#Edges
Attributed
Directed
Weighted
Signed
Homogeneous
Spatial
Temporal
Labels
QM9
Molecules
133,885
~9
~19
YES
NO
YES
NO
YES
3D
NO
YES
Acknowlegement: L. Ruddigkeit, R. van Deursen, L. C. Blum, J.-L. Reymond, Enumeration of 166 billion organic small molecules in the chemical universe database GDB-17, J. Chem. Inf. Model. 52, 2864–2875, 2012.
R. Ramakrishnan, P. O. Dral, M. Rupp, and O. A. Von Lilienfeld, “Quantum chemistry structures and properties of 134 kilo molecules,” Scientific data, vol. 1, no. 1, pp. 1–7, 2014.
ZINC
Description: This dataset is a curated set of 250k commercially available drug-like chemical compounds. On average, these molecules are bigger (about 23 heavy atoms) and structurally more complex than the molecules in QM9.
Statistics:
Name
Type
#Graphs
#Nodes
#Edges
Attributed
Directed
Weighted
Signed
Homogeneous
Spatial
Temporal
Labels
ZINC250K
Molecules
249,455
~23
~50
YES
NO
YES
NO
YES
3D
NO
Yes
Acknowlegement: J. J. Irwin, T. Sterling, M. M. Mysinger, E. S. Bolstad, and R. G. Coleman, “Zinc: a free tool to discover chemistry for biology,” Journal of chemical information and modeling, vol. 52, no. 7, pp. 1757–1768, 2012.
Jin, W., Yang, K., Barzilay, R., & Jaakkola, T. (2018). Learning multimodal graph-to-graph translation for molecular optimization. arXiv preprint arXiv:1812.01070.
MOSES
Description: Molecular Sets (MOSES) is a benchmark platform for distribution learning based molecule generation. Within this benchmark, MOSES provides a cleaned dataset of molecules that are ideal of optimization. It is processed from the ZINC Clean Leads dataset.
Statistics:
Name
Type
#Graphs
#Nodes
#Edges
Attributed
Directed
Weighted
Signed
Homogeneous
Spatial
Temporal
Labels
MOSES
Molecules
1,936,963
~22
~47
YES
NO
YES
NO
YES
3D
NO
YES
Acknowlegement: Polykovskiy, Daniil, et al. "Molecular sets (MOSES): a benchmarking platform for molecular generation models." Frontiers in pharmacology 11 (2020).
ChEMBL
Description: ChEMBL is a manually curated database of bioactive molecules with drug-like properties. It brings together chemical, bioactivity and genomic data to aid the translation of genomic information into effective new drugs.
Statistics:
Name
Type
#Graphs
#Nodes
#Edges
Attributed
Directed
Weighted
Signed
Homogeneous
Spatial
Temporal
Labels
ChEMBL
Molecules
1,799,433
~27
~58
YES
NO
YES
NO
YES
3D
NO
YES
Acknowlegement: Mendez, David, et al. “ChEMBL: towards direct deposition of bioassay data.” Nucleic acids research 47.D1 (2019): D930-D940.
Molecule Optimization
Description: TBA
Statistics:
Name
Type
#Graphs
#Nodes
#Edges
Attributed
Directed
Weighted
Signed
Homogeneous
Spatial
Temporal
Labels
MolOpt
Molecules
229,473
~24
~53
YES
NO
YES
NO
YES
3D
NO
YES
Download link:
Acknowlegement: Jin, W., Yang, K., Barzilay, R., Jaakkola, T. (2018). Learning multimodal graph-to-graph translation for molecular optimization. arXiv preprint arXiv:1812.01070.
Chemical Reaction
Description: There are totally 7180 pairs of reactant anf product molecule graph in the dataset. The number of nodes (atoms) of molecule ranges from 9 to 20, and the number of atoms for each pair is recorded and stored in the file "Num_nodes.cxv". The file folder "mol_edge" store all the adjacent matrix for the input and target graph. The dimension of the adjacent matrix is 20 by 20, and for those graphs whose nodes are less than 20, we use zero to pad. The values in adjacent matrix is in [0,1,2,3,4] representing five bond types (none, single, double, triple, or aromatic). The folder "Mol_nodes" stores the node features for all the nodes in each graph. The node feature indicates the atom type (82 types) which are embedded by one-hot vector with 82 dimensions. In this problem setting, the node feature remains unchanged during the translation.
Statistics:
Name
Type
#Graphs
#Nodes
#Edges
Attributed
Directed
Weighted
Signed
Homogeneous
Spatial
Temporal
Labels
ChemReact
Molecules
7,180
~20
~16
YES
NO
YES
NO
YES
3D
NO
YES
Acknowlegement: Guo X, Zhao L, Nowzari C, Rafatirad S, Homayoun H, Dinakarrao SM. Deep Multi-attributed Graph Translation with Node-Edge Co-evolution. Inhe 19th International Conference on Data Mining (ICDM 2019).
D. Lowe, “Patent reaction extraction: downloads,” 2014.