Description: This dataset contains 918 protein graphs with 100 ≤ \|V\| ≤ 500. Each protein is represented by a graph, where nodes are amino acids and two nodes are connected if they are less than 6 Angstroms apart.
Statistics:
Name
Type
#Graphs
#Nodes
#Edges
Attributed
Directed
Weighted
Signed
Homogeneous
Spatial
Temporal
Labels
Protein
Proteins
1,113
~39
~73
YES
NO
NO
NO
YES
NO
NO
YES
Acknowlegement: P. D. Dobson and A. J. Doig, “Distinguishing enzyme structures from non-enzymes without alignments,” Journal of molecular biology, vol. 330, no. 4, pp. 771–783, 2003.
Protein Folding
Description: This dataset contains dynamic folding processes of a protein peptide with sequence AGAAAAGA in 38 steps. The node feature of each protein is the sequence (AGAAAAGA) along with the spatial locations of each amino acid, and the edge feature of each protein is an adjacency matrix constructed by connecting all pairs of nodes with distance < 8 Å.
Statistics:
Name
Type
#Graphs
#Nodes
#Edges
Attributed
Directed
Weighted
Signed
Homogeneous
Spatial
Temporal
Labels
ProFold
Proteins
76,000
8
~40
YES
NO
NO
NO
YES
3D
YES
YES
Acknowlegement: X. Guo, Y. Du, and L. Zhao, "Disentangled Deep Generative Model for Spatial Networks", ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 2021.
Enzyme
Description: This dataset contains protein tertiary structures representing 600 enzymes. Nodes in a graph (protein) represent secondary structure elements, and two nodes are connected if the corresponding elements are interacting. The node labels indicate the type of secondary structure, which is either helices, turns, or sheets.
Statistics:
Name
Type
#Graphs
#Nodes
#Edges
Attributed
Directed
Weighted
Signed
Homogeneous
Spatial
Temporal
Labels
Enzymes
Proteins
600
~33
~62
YES
NO
NO
NO
YES
NO
NO
YES
Acknowlegement: I. Schomburg, A. Chang, C. Ebeling, M. Gremse, C. Heldt, G. Huhn, and D. Schomburg, “Brenda, the enzyme database: updates and major new developments,” Nucleic acids research, vol. 32, no. suppl 1, pp. D431–D433, 2004.