Data and Code for "Designing the Next Better Catalyst Utilizing Machine Learning with a Key-Intermediate Graph: Differentiating a Methyl from an Ethyl Group"

dc.contributorGerbig, Dennis
dc.contributorWende, Raffael C.
dc.contributorSchreiner, Peter R.
dc.contributor.authorPereira, Oliver
dc.contributor.authorRuth, Marcel
dc.date.accessioned2023-11-17T06:44:29Z
dc.date.available2023-11-17T06:44:29Z
dc.date.issued2023-11-15
dc.description.abstract# Dataset and Scripts Overview ## General Overview This dataset includes a series of Python scripts and Jupyter notebooks that are primarily focused on the analysis, modeling, and visualization of chemical data. The scripts encompass various aspects of data preprocessing, including graph-based transformations, model definitions for Graph Neural Networks (GNNs) and Feedforward Neural Networks (FFNNs), training utilities like early stopping, and comprehensive workflows for training, evaluating, and visualizing model performance. ## File Descriptions ### Python Scripts 1. **CV_cat-subs.py**: Script for training and evaluating a machine learning model using cross-validation. 2. **LOOCV_CV.py**: Similar to CV_cat-subs.py but employs Leave-One-Out Cross-Validation for model evaluation. 3. **config.py**: Configuration file containing optimal parameters for the models. 4. **eval_mol_representation.py**: Evaluates molecular representations using various machine learning models. 5. **hpo_cbs.py**: Hyperparameter optimization script for tuning Graph Neural Network models. 6. **models.py**: Defines neural network models, including Graph Neural Networks. 7. **preprocessing_data_new.py**: Preprocesses the dataset, preparing it for analysis and modeling. 8. **preprocessing_graph_new.py**: Prepares graph-based data representations, essential for GNNs. 9. **screening_models.py**: Provides additional model definitions for various machine learning tasks. 10. **training.py**: Contains utilities for model training, including an early stopping mechanism. ### Jupyter Notebooks 1. **CV_plots.ipynb**: Focuses on data visualization, particularly for cross-validation results. 2. **main.ipynb**: A comprehensive notebook covering data preprocessing, model training, evaluation, and visualization. ## Dataset Structure: "CBS_10-04-2023.csv" The dataset "CBS_10-04-2023.csv" is a key component of this collection. It includes various chemical properties and molecular structures relevant to the domain of cheminformatics. The structure of the dataset comprises columns that detail different chemical entities (catalyst, substrate, product), reaction conditions, and results. This dataset is used extensively throughout the scripts and notebooks for preprocessing, analysis, and model training. Understanding its structure is crucial for interpreting the results and for any further modification or analysis. ## Usage Instructions To use these scripts and notebooks, ensure you have Python installed along with necessary libraries like Pandas, NumPy, Torch, Torch Geometric, and RDKit. Each script can be executed independently, provided the required data files are available in the specified paths. The Jupyter notebooks can be run in a sequence for a complete end-to-end workflow. ## Variable and Function Explanations - Variables and functions within the scripts and notebooks are named to reflect their purpose in data processing, modeling, or visualization tasks. Specific domain-related variables, such as those handling chemical properties or molecular structures, are used in accordance with standard practices in cheminformatics. ## Additional Notes - These scripts and notebooks are tailored for chemical data analysis and may require domain-specific understanding for optimal usage and interpretation of results.de_DE
dc.identifier.urihttps://jlupub.ub.uni-giessen.de//handle/jlupub/18631
dc.identifier.urihttp://dx.doi.org/10.22029/jlupub-17995
dc.language.isoende_DE
dc.relationhttp://dx.doi.org/10.22029/jlupub-18463
dc.rightsCC0 1.0 Universal*
dc.rights.urihttp://creativecommons.org/publicdomain/zero/1.0/*
dc.subjectMachine Learningde_DE
dc.subjectOrganocatalysisde_DE
dc.subjectChemistryde_DE
dc.subjectCBS-Reductionde_DE
dc.subject.ddcddc:540de_DE
dc.titleData and Code for "Designing the Next Better Catalyst Utilizing Machine Learning with a Key-Intermediate Graph: Differentiating a Methyl from an Ethyl Group"de_DE
dc.typeDatasetde_DE
local.affiliationFB 08 - Biologie und Chemiede_DE
local.projectSPP 2363, Schr 597/41-1de_DE

Dateien

Originalbündel

Gerade angezeigt 1 - 17 von 17
Lade...
Vorschaubild
Name:
pred_df.csv
Größe:
11.1 KB
Format:
Unknown data format
Beschreibung:
Dataframe of all predictions made by our ML model. Given in ee and ddG (abs and relative values) for various catalysts at the CBS reduction of butanone.
Lade...
Vorschaubild
Name:
screening_models.py
Größe:
2.2 KB
Format:
Unknown data format
Beschreibung:
Models used for screening of the molecular representations.
Lade...
Vorschaubild
Name:
eval_mol_representation.py
Größe:
10.98 KB
Format:
Unknown data format
Beschreibung:
Script to screen different molecular representations (graph, fingerprint, ...) on their performance in modeling the CBS reduction.
Lade...
Vorschaubild
Name:
Net15_11_2023_18_48.pt
Größe:
4.82 MB
Format:
Unknown data format
Beschreibung:
Trained ML Model for the CBS reduction.
Lade...
Vorschaubild
Name:
hpo_cbs.py
Größe:
6.43 KB
Format:
Unknown data format
Beschreibung:
Script to search for the optimal hyperparameters for our GNN based ML model.
Lade...
Vorschaubild
Name:
config.py
Größe:
515 B
Format:
Unknown data format
Beschreibung:
Config file containing the best parameters that were found by the hyperparameter optimization.
Lade...
Vorschaubild
Name:
models.py
Größe:
1.75 KB
Format:
Unknown data format
Beschreibung:
The main model class of our best GNN based ML model.
Lade...
Vorschaubild
Name:
CV_cat-subs.py
Größe:
8.81 KB
Format:
Unknown data format
Beschreibung:
Script to evaluate the cross-validation by "chemical splitting" of training and testing data. Certain molecules are exluded entirely from the training set and are then the only ones on which the models is validated. This happens for all molecules in the dataset.
Lade...
Vorschaubild
Name:
LOOCV_CV.py
Größe:
7.97 KB
Format:
Unknown data format
Beschreibung:
Script to perform the leave-one-out-cross-validation on our GNN based ML model.
Lade...
Vorschaubild
Name:
main.ipynb
Größe:
1.14 MB
Format:
Unknown data format
Beschreibung:
Notebook to train, test, and predict new catalysts of our GNN based ML model.
Lade...
Vorschaubild
Name:
training.py
Größe:
2.06 KB
Format:
Unknown data format
Beschreibung:
Supporting file that contains functions and classes for training of our GNN based ML model.
Lade...
Vorschaubild
Name:
preprocessing_data_new.py
Größe:
10.69 KB
Format:
Unknown data format
Beschreibung:
File that contains functions that are used to preprocess the raw data.
Lade...
Vorschaubild
Name:
preprocessing_graph_new.py
Größe:
9.91 KB
Format:
Unknown data format
Beschreibung:
File that contains the functions to generate the used "key-intermediate" graph representations for our GNN based ML model.
Lade...
Vorschaubild
Name:
CV_plots.ipynb
Größe:
625.81 KB
Format:
Unknown data format
Beschreibung:
Notebook to create the cross-validation plots to evaluate our model.
Lade...
Vorschaubild
Name:
CBS_10-04-2023.csv
Größe:
13.16 KB
Format:
Unknown data format
Beschreibung:
The used dataset. The selectivity for each reaction is given in ee and ddG.
Lade...
Vorschaubild
Name:
CV_results.zip
Größe:
122.95 KB
Format:
Unknown data format
Beschreibung:
Archive that contains the CSV files that were generated by the cross-validation scripts. Each CSV has two columns, predicted and true value for ddG in kcal/mol.
Lade...
Vorschaubild
Name:
README.md
Größe:
3.27 KB
Format:
Unknown data format
Beschreibung:
Readme file that contains additional information about the data structure and how the scripts and Jupyter notebooks can be used.

Lizenzbündel

Gerade angezeigt 1 - 1 von 1
Lade...
Vorschaubild
Name:
license.txt
Größe:
7.58 KB
Format:
Item-specific license agreed upon to submission
Beschreibung:

Sammlungen