
# Dataset and Scripts Overview

## General Overview
This dataset includes a series of Python scripts and Jupyter notebooks that are primarily focused on the analysis, 
modeling, and visualization of chemical data. The scripts encompass various aspects of data preprocessing, 
including graph-based transformations, model definitions for Graph Neural Networks (GNNs) and Feedforward Neural 
Networks (FFNNs), training utilities like early stopping, and comprehensive workflows for training, evaluating, 
and visualizing model performance.

## File Descriptions

### Python Scripts
1. **CV_cat-subs.py**: Script for training and evaluating a machine learning model using cross-validation.
2. **LOOCV_CV.py**: Similar to CV_cat-subs.py but employs Leave-One-Out Cross-Validation for model evaluation.
3. **config.py**: Configuration file containing optimal parameters for the models.
4. **eval_mol_representation.py**: Evaluates molecular representations using various machine learning models.
5. **hpo_cbs.py**: Hyperparameter optimization script for tuning Graph Neural Network models.
6. **models.py**: Defines neural network models, including Graph Neural Networks.
7. **preprocessing_data_new.py**: Preprocesses the dataset, preparing it for analysis and modeling.
8. **preprocessing_graph_new.py**: Prepares graph-based data representations, essential for GNNs.
9. **screening_models.py**: Provides additional model definitions for various machine learning tasks.
10. **training.py**: Contains utilities for model training, including an early stopping mechanism.

### Jupyter Notebooks
1. **CV_plots.ipynb**: Focuses on data visualization, particularly for cross-validation results.
2. **main.ipynb**: A comprehensive notebook covering data preprocessing, model training, evaluation, and visualization.

## Dataset Structure: "CBS_10-04-2023.csv"
The dataset "CBS_10-04-2023.csv" is a key component of this collection. It includes various chemical properties 
and molecular structures relevant to the domain of cheminformatics. The structure of the dataset comprises columns 
that detail different chemical entities (catalyst, substrate, product), reaction conditions, and results. This dataset is used extensively 
throughout the scripts and notebooks for preprocessing, analysis, and model training. Understanding its structure 
is crucial for interpreting the results and for any further modification or analysis.

## Usage Instructions
To use these scripts and notebooks, ensure you have Python installed along with necessary libraries like Pandas, 
NumPy, Torch, Torch Geometric, and RDKit. Each script can be executed independently, provided the required data 
files are available in the specified paths. The Jupyter notebooks can be run in a sequence for a complete 
end-to-end workflow.

## Variable and Function Explanations
- Variables and functions within the scripts and notebooks are named to reflect their purpose in data processing, 
modeling, or visualization tasks. Specific domain-related variables, such as those handling chemical properties 
or molecular structures, are used in accordance with standard practices in cheminformatics.

## Additional Notes
- These scripts and notebooks are tailored for chemical data analysis and may require domain-specific understanding 
for optimal usage and interpretation of results.
