{ "cells": [ { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\"\"\" Created on November 13, 2023 // Updated on March 20, 2026 // @author: Sarah Shi \"\"\"\n", "\n", "import os\n", "import numpy as np\n", "import pandas as pd\n", "\n", "import mineralML as mm\n", "from sklearn.metrics import classification_report\n", "\n", "import matplotlib.pyplot as plt\n", "%matplotlib inline\n", "%config InlineBackend.figure_format = 'png'" ] }, { "cell_type": "markdown", "metadata": {}, "source": "# mineralML Quickstart for Tabular Data\n\nThis notebook shows **how to load and run your data through mineralML** with an example CSV: `training_hundred.csv`. This is a five step process: \n1. Load a CSV with `mm.load_df` (or `pd.read_csv` directly). Clean and align columns with `mm.prep_df`.\n2. Run data through the neural network with `mm.predict_class_prob` to derive classifications and prediction scores. \n3. Export prediction scores with `mm.export_predictions_to_excel`.\n4. Examine predictions with `classification_report`, `mm.confusion_matrix_df`, and `mm.pp_matrix`.\n5. Project data into latent space with `mm.plot_latent_space`, for visualization. \n\nWe loaded in the **mineralML** Python package as `mm`. **mineralML** has trained machine learning models for classifying minerals. This implementation aims to get your electron microprobe or quantitative EDS compositions classified and processed. We remove some degrees of freedom to simplify the process as much as possible. The minerals considered for this study include: Amphibole, Apatite, Biotite, Calcite, Chlorite, Epidote, Feldspar (Alkali Feldspar and Plagioclase), Garnet, Glass, Kalsilite, Leucite, Melilite, Muscovite, Nepheline, Olivine, Oxide (Rhombohedral_Oxides including Hematite-Ilmenite, Spinel_Group including Magnetite-Spinel), Pyroxene (Clinopyroxene, Orthopyroxene, Na-Pyroxene), Quartz, Rutile, Serpentine, Titanite, Tourmaline, and Zircon. \n\nOne CSV file containing your electron microprobe analyses in oxide weight percentages is necessary. Find an example [here](https://github.com/sarahshi/mineralML/blob/main/docs/examples/training_hundred.csv). The necessary oxides are SiO$_2$, TiO$_2$, Al$_2$O$_3$, FeO$_t$, MnO, MgO, CaO, Na$_2$O, K$_2$O, Cr$_2$O$_3$, P$_2$O$_5$, and ZrO$_2$ (if you are aiming to classify zircon). For the oxides not analyzed for specific minerals, the preprocessing will fill in the nan values as 0. \n\nWe will apply the neural network method to the dataset." }, { "cell_type": "markdown", "metadata": {}, "source": "## 1. Load and prepare data for analysis\n\nWe will use `mm.load_df` and `mm.prep_df` to do so." }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Read in your dataframe of mineral data, called training_hundred.csv. \n", "df_load = mm.load_df('TabularData/training_hundred.csv')\n", "\n", "# Prepare the dataframe by removing rows with too many NaNs, and filling in zeros. \n", "df_nn = mm.prep_df(df_load, # dataframe to prepare\n", " renormalize=False, # optionally renormalize rows to sum to 100 wt%\n", " convert_fe=False, # optionally convert disparate input formats of Fe all to FeOt\n", " drop_empty_rows=False, # optionally drop rows with more nan values than the min_oxide_count\n", " min_oxide_count=2, # minimum number of oxides in a row to keep that analysis\n", " verbose=True\n", " )\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Examine the prepared dataframe\n", "\n", "display(df_nn.head())" ] }, { "cell_type": "markdown", "metadata": {}, "source": "## 2. Apply the trained neural network (mm.predict_class_prob)\n\nWe will use `mm.predict_class_prob` to do so." }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# The trained neural network can be applied in just one line. It returns predictions in columns called \"Predict_Mineral\", \"Submineral\" (if applicable, for pyroxenes, feldspars, and oxides), \"Predict_Probability\", \"Second_Predict_Mineral\", \"Second_Predict_Probability\".\n", "\n", "df_pred_nn = mm.predict_class_prob(df_nn)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Examine the predicted mineral classifications\n", "\n", "display(df_pred_nn)" ] }, { "cell_type": "markdown", "metadata": {}, "source": "There is a good amount of information in this dataframe. The predicted mineral is provided in the `Predict_Mineral` column, along with the prediction score expressed in the `Prediction_Score` column (representing likelihood of prediction) and standard deviation on this prediction in the `Prediction_Score_Sigma` column." }, { "cell_type": "markdown", "metadata": {}, "source": "## 3. Export prediction results\n\nSay you would like to go back to working with Excel now. Use `mm.export_predictions_to_excel` to export the predictions and these values. All the original input data are returned in the first sheet, and data are split into individual mineral phases in all other sheets." }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Export prediction results to an Excel workbook with one sheet called \"All\" containing all rows, and additional sheets for each predicted mineral.\n", "\n", "mm.export_predictions_to_excel(df_pred_nn, filename='TabularData/prediction_results.xlsx')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 4. Examine prediction results" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Create a classification report to determine the accuracy, precision, f1, etc. This is possible in this case because these are our training data, where we know the classes. \n", "\n", "bayes_valid_report = classification_report(\n", " df_pred_nn['Mineral'], df_pred_nn['Predict_Mineral'], zero_division=0\n", ")\n", "print(\"Validation Report:\\n\", bayes_valid_report)\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Create and plot a confusion matrix \n", "\n", "# This compares your stated mineral and mineralML's predicted mineral\n", "cm = mm.confusion_matrix_df(df_pred_nn['Mineral'], df_pred_nn['Predict_Mineral'])\n", "# This plots the results in a confusion matrix\n", "mm.pp_matrix(cm, figsize=[8, 8], savefig=None) \n" ] }, { "cell_type": "markdown", "metadata": {}, "source": "## 5. Plot in latent space\n\nExcellent, these classifications are quite promising. The most likely predicted minerals, along with their associated prediction scores with uncertainties are returned. We can further visualize these classifications in latent space with `mm.plot_latent_space`." }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "mm.plot_latent_space(df_pred_nn)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Neat! We can see where these compositions lie in latent space, and whether the predictions line up with our expected mineral phase. The points in the background are from the training and validation dataset." ] } ], "metadata": { "kernelspec": { "display_name": "science", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.14" } }, "nbformat": 4, "nbformat_minor": 2 }