{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": ""
    },
    "tags": []
   },
   "source": [
    "# What is Featuristic?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": ""
    },
    "tags": []
   },
   "source": [
    "![featuristic_logo](_static/logo.png \"Featuristic\")\n",
    "\n",
    "**Featuristic** uses Genetic Algorithms to automate the process of **feature engineering** and **feature selection**, enhancing the performance of machine learning models by optimizing their predictive capabilities.\n",
    "\n",
    "## Understanding Genetic Feature Synthesis\n",
    "\n",
    "Featuristic uses symbolic regression to intelligently derive interpretable mathematical formulas, which are then used to create new features from your dataset.\n",
    "\n",
    "Initially, Featuristic creates a diverse population of formulas using fundamental mathematical operators such as `add`, `subtract`, `sin`, `tan`, `square`, `sqrt`, and more.\n",
    "\n",
    "For instance, a formula generated by Featuristic might look like this: `(square(feature_1) - abs(feature_2)) * feature_3`.\n",
    "\n",
    "Next, Featuristic assesses the importance of these formulas by quantifying how well they correlate with the target variable. Those formulas yielding features with the strongest correlations are then selected and recombined using a genetic algorithm to produce offspring, as illustrated below.\n",
    "\n",
    "![Symbolic Regression Example](_static/symbolic_regression_example.png \"Symbolic Regression Example\")\n",
    "\n",
    "These offspring may also undergo point mutations, which causes alterations to random operators within the formula. This process introduces slight variations to the formulas, enhancing the diversity of the population and potentially leading to the discovery of novel and more effective feature representations.\n",
    "\n",
    "![Mutation Example](_static/mutation_example.png \"Mutation Example\")\n",
    "\n",
    "This iterative process continues across multiple generations, continually refining the population of formulas with the goal of generating features that exhibit strong correlations with the target variable."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "editable": true,
    "raw_mimetype": "",
    "slideshow": {
     "slide_type": ""
    },
    "tags": []
   },
   "source": [
    "## Quickstart"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": ""
    },
    "tags": []
   },
   "source": [
    "Below is a simple example of how Featuristic performs automated feature engineering and selection on the widely used `cars` dataset.\n",
    "\n",
    "Featuristic operates in two distinct steps:\n",
    "\n",
    "1. **Genetic Feature Synthesis:** In this initial phase, Featuristic intelligently evolves new features through a form of symbolic regression. This involves the generation of mathematical expressions using genetic algorithms. These expressions are designed to capture complex relationships within the dataset.\n",
    "\n",
    "2. **Genetic Feature Selection:** Following the creation of new features, Featuristic employs a Genetic Feature Selection algorithm. This algorithm searches through the newly formed feature space to identify the most optimal subset of features. The objective is to maximize predictive accuracy while minimizing the number of features required for model training."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": ""
    },
    "tags": []
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1.0.1\n"
     ]
    }
   ],
   "source": [
    "from sklearn.linear_model import LinearRegression\n",
    "from sklearn.model_selection import train_test_split, cross_val_score\n",
    "from sklearn.metrics import mean_absolute_error\n",
    "import featuristic as ft\n",
    "import numpy as np\n",
    "\n",
    "np.random.seed(8888)\n",
    "\n",
    "print(ft.__version__)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": ""
    },
    "tags": []
   },
   "source": [
    "### Load the Data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": ""
    },
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>displacement</th>\n",
       "      <th>cylinders</th>\n",
       "      <th>horsepower</th>\n",
       "      <th>weight</th>\n",
       "      <th>acceleration</th>\n",
       "      <th>model_year</th>\n",
       "      <th>origin</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>307.0</td>\n",
       "      <td>8</td>\n",
       "      <td>130.0</td>\n",
       "      <td>3504</td>\n",
       "      <td>12.0</td>\n",
       "      <td>70</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>350.0</td>\n",
       "      <td>8</td>\n",
       "      <td>165.0</td>\n",
       "      <td>3693</td>\n",
       "      <td>11.5</td>\n",
       "      <td>70</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>318.0</td>\n",
       "      <td>8</td>\n",
       "      <td>150.0</td>\n",
       "      <td>3436</td>\n",
       "      <td>11.0</td>\n",
       "      <td>70</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>304.0</td>\n",
       "      <td>8</td>\n",
       "      <td>150.0</td>\n",
       "      <td>3433</td>\n",
       "      <td>12.0</td>\n",
       "      <td>70</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>302.0</td>\n",
       "      <td>8</td>\n",
       "      <td>140.0</td>\n",
       "      <td>3449</td>\n",
       "      <td>10.5</td>\n",
       "      <td>70</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   displacement  cylinders  horsepower  weight  acceleration  model_year  \\\n",
       "0         307.0          8       130.0    3504          12.0          70   \n",
       "1         350.0          8       165.0    3693          11.5          70   \n",
       "2         318.0          8       150.0    3436          11.0          70   \n",
       "3         304.0          8       150.0    3433          12.0          70   \n",
       "4         302.0          8       140.0    3449          10.5          70   \n",
       "\n",
       "   origin  \n",
       "0       1  \n",
       "1       1  \n",
       "2       1  \n",
       "3       1  \n",
       "4       1  "
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "X, y = ft.fetch_cars_dataset()\n",
    "\n",
    "X.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": ""
    },
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0    18.0\n",
       "1    15.0\n",
       "2    18.0\n",
       "3    16.0\n",
       "4    17.0\n",
       "Name: mpg, dtype: float64"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "y.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": ""
    },
    "tags": []
   },
   "source": [
    "### Genetic Feature Synthesis\n",
    "\n",
    "Now, let's dive into the fun part: using Genetic Feature Synthesis to automatically engineer new features from our dataset!\n",
    "\n",
    "Before we proceed, it's important to ensure a robust evaluation of our model's performance. To achieve this, we'll first split our dataset into training and testing sets. The training set will be used to train our model, while the testing set will remain unseen during the training process and will serve as an independent dataset to evaluate the model's performance.\n",
    "\n",
    "Once our data is appropriately split, we'll initiate the Genetic Feature Synthesis process. We've configured the genetic algorithm to  synthesize 5 new features for us. This entails evolving a population consisting of 200 individuals iteratively over 100 generations. To ensure optimal performance, we've set the genetic algorithm to halt early if it fails to improve upon the best feature identified within 25 generations. Additionally, for enhanced computational efficiency, we've designated `n_jobs` as -1, enabling concurrent execution across all available CPUs on our computer.\n",
    "\n",
    "With everything set up, we simply call the `fit` function to generate our new features."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Creating new features...:  48%|███████▋        | 48/100 [00:19<00:21,  2.43it/s]\n",
      "Pruning feature space...: 100%|██████████████████| 5/5 [00:00<00:00, 519.75it/s]\u001b[A\n",
      "Creating new features...:  48%|███████▋        | 48/100 [00:19<00:21,  2.44it/s]\n"
     ]
    }
   ],
   "source": [
    "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)\n",
    "\n",
    "synth = ft.GeneticFeatureSynthesis(\n",
    "    num_features=5,\n",
    "    population_size=200,\n",
    "    max_generations=100,\n",
    "    early_termination_iters=25,\n",
    "    parsimony_coefficient=0.035,\n",
    "    n_jobs=-1,\n",
    ")\n",
    "synth.fit(X_train, y_train)\n",
    "\n",
    "None"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next, we call the `transform` function to generate a dataframe containing our new features. By default, the `GeneticFeatureSynthesis` class will return both the original features and the newly synthesised features. However, we return just the new features by setting the `return_all_features` argument to False when we create the class. \n",
    "\n",
    "We can also combine both the `fit` and `transform` steps into one step by calling `fit_transform` instead."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>displacement</th>\n",
       "      <th>cylinders</th>\n",
       "      <th>horsepower</th>\n",
       "      <th>weight</th>\n",
       "      <th>acceleration</th>\n",
       "      <th>model_year</th>\n",
       "      <th>origin</th>\n",
       "      <th>feature_0</th>\n",
       "      <th>feature_4</th>\n",
       "      <th>feature_11</th>\n",
       "      <th>feature_1</th>\n",
       "      <th>feature_22</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>89.0</td>\n",
       "      <td>4</td>\n",
       "      <td>62.0</td>\n",
       "      <td>2050</td>\n",
       "      <td>17.3</td>\n",
       "      <td>81</td>\n",
       "      <td>3</td>\n",
       "      <td>-8571.629032</td>\n",
       "      <td>-0.312535</td>\n",
       "      <td>-96.744944</td>\n",
       "      <td>-105.822581</td>\n",
       "      <td>-0.624987</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>318.0</td>\n",
       "      <td>8</td>\n",
       "      <td>150.0</td>\n",
       "      <td>4077</td>\n",
       "      <td>14.0</td>\n",
       "      <td>72</td>\n",
       "      <td>1</td>\n",
       "      <td>-2488.320000</td>\n",
       "      <td>-0.786564</td>\n",
       "      <td>-75.169811</td>\n",
       "      <td>-34.560000</td>\n",
       "      <td>-1.573022</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>383.0</td>\n",
       "      <td>8</td>\n",
       "      <td>170.0</td>\n",
       "      <td>3563</td>\n",
       "      <td>10.0</td>\n",
       "      <td>70</td>\n",
       "      <td>1</td>\n",
       "      <td>-2017.647059</td>\n",
       "      <td>-0.727317</td>\n",
       "      <td>-71.827676</td>\n",
       "      <td>-28.823529</td>\n",
       "      <td>-1.454460</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>260.0</td>\n",
       "      <td>8</td>\n",
       "      <td>110.0</td>\n",
       "      <td>4060</td>\n",
       "      <td>19.0</td>\n",
       "      <td>77</td>\n",
       "      <td>1</td>\n",
       "      <td>-4150.300000</td>\n",
       "      <td>-0.684937</td>\n",
       "      <td>-82.626923</td>\n",
       "      <td>-53.900000</td>\n",
       "      <td>-1.369706</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>318.0</td>\n",
       "      <td>8</td>\n",
       "      <td>140.0</td>\n",
       "      <td>4080</td>\n",
       "      <td>13.7</td>\n",
       "      <td>78</td>\n",
       "      <td>1</td>\n",
       "      <td>-3389.657143</td>\n",
       "      <td>-0.670713</td>\n",
       "      <td>-81.360377</td>\n",
       "      <td>-43.457143</td>\n",
       "      <td>-1.341324</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   displacement  cylinders  horsepower  weight  acceleration  model_year  \\\n",
       "0          89.0          4        62.0    2050          17.3          81   \n",
       "1         318.0          8       150.0    4077          14.0          72   \n",
       "2         383.0          8       170.0    3563          10.0          70   \n",
       "3         260.0          8       110.0    4060          19.0          77   \n",
       "4         318.0          8       140.0    4080          13.7          78   \n",
       "\n",
       "   origin    feature_0  feature_4  feature_11   feature_1  feature_22  \n",
       "0       3 -8571.629032  -0.312535  -96.744944 -105.822581   -0.624987  \n",
       "1       1 -2488.320000  -0.786564  -75.169811  -34.560000   -1.573022  \n",
       "2       1 -2017.647059  -0.727317  -71.827676  -28.823529   -1.454460  \n",
       "3       1 -4150.300000  -0.684937  -82.626923  -53.900000   -1.369706  \n",
       "4       1 -3389.657143  -0.670713  -81.360377  -43.457143   -1.341324  "
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "generated_features = synth.transform(X_train)\n",
    "\n",
    "generated_features.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Our newly engineered features currently have generic names. However, since Featuristic synthesizes these features by the applying mathematical expressions to the data, we can look at the underlying formulas responsible for each feature's creation."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'-(abs((cube(model_year) / horsepower)))'"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "info = synth.get_feature_info()\n",
    "info[\"formula\"].iloc[0]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Feature Selection\n",
    "\n",
    "Following the synthesis of new features, the next step involves using another genetic algorithm for feature selection. This process sifts through the pool of features to identify the subset that optimally contributes to predictive performance while minimizing redundancy. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": ""
    },
    "tags": []
   },
   "source": [
    "### Define the Cost Function\n",
    "\n",
    "We set up a custom objective function that the Genetic Feature Selection algorithm will use to quantify how well the subset of features predicts the target. Please note that the function should return a value to **minimize** so a **smaller value is better**. If you want to maximize a metric, you should multiply the output of your objective_function by -1, as shown in the example below."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": ""
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "def objective_function(X, y):\n",
    "    model = LinearRegression()\n",
    "    scores = cross_val_score(model, X, y, cv=3, scoring=\"neg_mean_absolute_error\")\n",
    "    return scores.mean() * -1"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next, we set up the Genetic Feature Selector. We've configured the genetic algorithm to evolve a population consisting of 200 individuals iteratively over 100 generations. To ensure optimal performance, we've set the genetic algorithm to halt early if it fails to improve upon the best feature set identified within 25 generations. Additionally, for enhanced computational efficiency, we've set `n_jobs` as -1, enabling concurrent execution across all available CPUs on our computer."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": ""
    },
    "tags": []
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Optimising feature selection...:  27%|██▍      | 27/100 [00:05<00:13,  5.34it/s]"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "8\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\n"
     ]
    }
   ],
   "source": [
    "selector = ft.GeneticFeatureSelector(\n",
    "    objective_function,\n",
    "    population_size=200,\n",
    "    max_generations=100,\n",
    "    early_termination_iters=25,\n",
    "    n_jobs=-1,\n",
    ")\n",
    "\n",
    "selector.fit(generated_features, y_train)\n",
    "\n",
    "selected_features = selector.transform(generated_features)\n",
    "\n",
    "print(len(selected_features.columns))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": ""
    },
    "tags": []
   },
   "source": [
    "### The Selected Features\n",
    "\n",
    "Let's print out the selected features to see what the Genetic Feature Selection algorithm kept. You can see below that featuristic has kept four of the original features (\"weight\", \"acceleration\", \"model_year\" and \"origin\") plus four of the features created via the Genetic Feature Synthesis."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": ""
    },
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>weight</th>\n",
       "      <th>acceleration</th>\n",
       "      <th>model_year</th>\n",
       "      <th>origin</th>\n",
       "      <th>feature_0</th>\n",
       "      <th>feature_4</th>\n",
       "      <th>feature_11</th>\n",
       "      <th>feature_1</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>2050</td>\n",
       "      <td>17.3</td>\n",
       "      <td>81</td>\n",
       "      <td>3</td>\n",
       "      <td>-8571.629032</td>\n",
       "      <td>-0.312535</td>\n",
       "      <td>-96.744944</td>\n",
       "      <td>-105.822581</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>4077</td>\n",
       "      <td>14.0</td>\n",
       "      <td>72</td>\n",
       "      <td>1</td>\n",
       "      <td>-2488.320000</td>\n",
       "      <td>-0.786564</td>\n",
       "      <td>-75.169811</td>\n",
       "      <td>-34.560000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>3563</td>\n",
       "      <td>10.0</td>\n",
       "      <td>70</td>\n",
       "      <td>1</td>\n",
       "      <td>-2017.647059</td>\n",
       "      <td>-0.727317</td>\n",
       "      <td>-71.827676</td>\n",
       "      <td>-28.823529</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>4060</td>\n",
       "      <td>19.0</td>\n",
       "      <td>77</td>\n",
       "      <td>1</td>\n",
       "      <td>-4150.300000</td>\n",
       "      <td>-0.684937</td>\n",
       "      <td>-82.626923</td>\n",
       "      <td>-53.900000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>4080</td>\n",
       "      <td>13.7</td>\n",
       "      <td>78</td>\n",
       "      <td>1</td>\n",
       "      <td>-3389.657143</td>\n",
       "      <td>-0.670713</td>\n",
       "      <td>-81.360377</td>\n",
       "      <td>-43.457143</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   weight  acceleration  model_year  origin    feature_0  feature_4  \\\n",
       "0    2050          17.3          81       3 -8571.629032  -0.312535   \n",
       "1    4077          14.0          72       1 -2488.320000  -0.786564   \n",
       "2    3563          10.0          70       1 -2017.647059  -0.727317   \n",
       "3    4060          19.0          77       1 -4150.300000  -0.684937   \n",
       "4    4080          13.7          78       1 -3389.657143  -0.670713   \n",
       "\n",
       "   feature_11   feature_1  \n",
       "0  -96.744944 -105.822581  \n",
       "1  -75.169811  -34.560000  \n",
       "2  -71.827676  -28.823529  \n",
       "3  -82.626923  -53.900000  \n",
       "4  -81.360377  -43.457143  "
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "selected_features.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Training a Model\n",
    "\n",
    "Now that we've selected our features, let's see whether they actually help our model's predictive performance on our test data set. We'll start off with the original features as a baseline."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "editable": true,
    "slideshow": {
     "slide_type": ""
    },
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "2.5888868138669303"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "model = LinearRegression()\n",
    "model.fit(X_train, y_train)\n",
    "preds = model.predict(X_test)\n",
    "original_mae = mean_absolute_error(y_test, preds)\n",
    "original_mae"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And now, let's see how the model performs with our synthesised feature set."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "1.9497667311649802"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "model = LinearRegression()\n",
    "model.fit(selected_features, y_train)\n",
    "test_features = selector.transform(synth.transform(X_test))\n",
    "preds = model.predict(test_features)\n",
    "featuristic_mae = mean_absolute_error(y_test, preds)\n",
    "featuristic_mae"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Original MAE: 2.5888868138669303\n",
      "Featuristic MAE: 1.9497667311649802\n",
      "Improvement: 24.7%\n"
     ]
    }
   ],
   "source": [
    "print(f\"Original MAE: {original_mae}\")\n",
    "print(f\"Featuristic MAE: {featuristic_mae}\")\n",
    "print(f\"Improvement: {round((1 - (featuristic_mae / original_mae))* 100, 1)}%\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The new features generated / selected by the Genetic Feature Synthesis have successfully reduced our mean absolute error &#128512;\n",
    "\n",
    "<br />\n",
    "<br />"
   ]
  },
  {
   "cell_type": "raw",
   "metadata": {
    "editable": true,
    "raw_mimetype": "text/restructuredtext",
    "slideshow": {
     "slide_type": ""
    },
    "tags": []
   },
   "source": [
    ".. toctree::\n",
    "   :maxdepth: 1\n",
    "   :caption: Table of Contents\n",
    "\n",
    "   install/index\n",
    "   guides/index  \n",
    "   examples/index\n",
    "\n",
    ".. toctree::\n",
    "   :maxdepth: 1\n",
    "   :caption: Resources & References\n",
    "\n",
    "   release_notes\n",
    "   api_reference"
   ]
  },
  {
   "cell_type": "raw",
   "metadata": {
    "editable": true,
    "raw_mimetype": "text/restructuredtext",
    "slideshow": {
     "slide_type": ""
    },
    "tags": []
   },
   "source": [
    "Other Links\n",
    "------------------\n",
    "\n",
    "* :ref:`genindex`\n",
    "* :ref:`search`"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}