{ "cells": [ { "cell_type": "markdown", "id": "eb5834d3-94b4-468d-803e-0b008c10ac29", "metadata": {}, "source": [ "# Creating Custom Symbolic Functions for Use in the Genetic Feature Synthesis\n", "\n", "Featuristic allows you to control which symbolic functions are used within the Genetic Feature Synthesis process, and to create your custom functions too.\n", "\n", "Let's take a look at a simple example using the `cars` dataset." ] }, { "cell_type": "code", "execution_count": 1, "id": "bb06d8de-d1e2-4ad0-b386-3c287589a09d", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1.1.0\n" ] } ], "source": [ "from sklearn.linear_model import LinearRegression\n", "from sklearn.model_selection import cross_val_score, train_test_split\n", "from sklearn.pipeline import Pipeline\n", "import featuristic as ft\n", "import numpy as np\n", "import pandas as pd\n", "\n", "np.random.seed(8888)\n", "\n", "print(ft.__version__)" ] }, { "cell_type": "markdown", "id": "2cad1914-1776-485b-b1a6-5dc3c78ce993", "metadata": {}, "source": [ "### Load the Data\n", "\n", "Let's start off by downloading the `cars` dataset and splitting it into train and test datasets." ] }, { "cell_type": "code", "execution_count": 2, "id": "e8ddeb97-7128-4bc8-a265-d62bbde070cc", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
displacementcylindershorsepowerweightaccelerationmodel_yearorigin
0307.08130.0350412.0701
1350.08165.0369311.5701
2318.08150.0343611.0701
3304.08150.0343312.0701
4302.08140.0344910.5701
\n", "
" ], "text/plain": [ " displacement cylinders horsepower weight acceleration model_year \\\n", "0 307.0 8 130.0 3504 12.0 70 \n", "1 350.0 8 165.0 3693 11.5 70 \n", "2 318.0 8 150.0 3436 11.0 70 \n", "3 304.0 8 150.0 3433 12.0 70 \n", "4 302.0 8 140.0 3449 10.5 70 \n", "\n", " origin \n", "0 1 \n", "1 1 \n", "2 1 \n", "3 1 \n", "4 1 " ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "X, y = ft.fetch_cars_dataset()\n", "\n", "X.head()" ] }, { "cell_type": "code", "execution_count": 3, "id": "c1cf8c11-05a1-4e35-aee5-459607855ac1", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 18.0\n", "1 15.0\n", "2 18.0\n", "3 16.0\n", "4 17.0\n", "Name: mpg, dtype: float64" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "y.head()" ] }, { "cell_type": "code", "execution_count": 4, "id": "8b4eeb07-c15f-407f-99d6-de7c82c028da", "metadata": {}, "outputs": [], "source": [ "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)" ] }, { "cell_type": "markdown", "id": "b2a8fb83-d872-4919-a370-1e75f1fa756c", "metadata": {}, "source": [ "### Controlling which Symbolic Functions are Used in the Genetic Feature Synthesis\n", "\n", "Now that we've got some data, let's change the symbolic functions used to synthesise our new features from it. We'll start off by listing all the functions already included with Featuristic." ] }, { "cell_type": "code", "execution_count": 5, "id": "0f42d883-0a17-4aa5-a849-649e433f2266", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['add',\n", " 'subtract',\n", " 'mult',\n", " 'div',\n", " 'square',\n", " 'cube',\n", " 'abs',\n", " 'negate',\n", " 'sin',\n", " 'cos',\n", " 'tan',\n", " 'mul_constant',\n", " 'add_constant']" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ft.list_symbolic_functions()" ] }, { "cell_type": "markdown", "id": "fc1763d9-4a22-4edd-8422-5b2045658b7a", "metadata": {}, "source": [ "All these functions are used by default, except the `mul_constant` and `add_constant` functions. These multiply or add a constant to create new features and can be useful where there is an offset in the data. However, they can also increase the chance of overfitting.\n", "\n", "For this example, let's limit ourselves to only the `add`, `subtract`, `mult` and `div` symbolic functions." ] }, { "cell_type": "code", "execution_count": 6, "id": "59b7d058-b64e-4610-ba55-8185e24a389f", "metadata": {}, "outputs": [], "source": [ "funcs_to_use = [\"add\", \"subtract\", \"mult\", \"div\"]" ] }, { "cell_type": "markdown", "id": "59655166-39dd-4516-9b35-90d66d12a0f6", "metadata": {}, "source": [ "Next, let's create some custom symbolic functions to use alongside the ones we selected above. \n", "\n", "We will do this by defining two `CustomSymbolicFunction` classes, one that returns the negative of the square of the input and one that returns the `tanh` of the input. \n", "\n", "We will also need to define how many arguments each function takes, its name, and how to render its output to a string." ] }, { "cell_type": "code", "execution_count": 7, "id": "bca7844a-06e6-4a48-b373-8059f8fedc20", "metadata": {}, "outputs": [], "source": [ "func = lambda x: -(x * x)\n", "arg_count = 1\n", "name = \"negative_square\"\n", "format_str = \"negative_square({})\"\n", "\n", "negative_square = ft.CustomSymbolicFunction(func=func, arg_count=arg_count, name=name, format_str=format_str)" ] }, { "cell_type": "code", "execution_count": 8, "id": "51d8866c-3ab4-4cdf-baef-b7550774d8d0", "metadata": {}, "outputs": [], "source": [ "func = np.tanh\n", "arg_count = 1\n", "name = \"tanh\"\n", "format_str = \"tanh({})\"\n", "\n", "tanh = ft.CustomSymbolicFunction(func=func, arg_count=arg_count, name=name, format_str=format_str)" ] }, { "cell_type": "markdown", "id": "eecfcb6c-33c3-4f2e-8993-695f2c7a2288", "metadata": {}, "source": [ "Great, let's check our new symbolic functions works as expected by passing in a column from a sample dataframe" ] }, { "cell_type": "code", "execution_count": 9, "id": "356bd7ad-9df2-4dfe-97fb-d6dafa22ea89", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 -1\n", "1 -4\n", "2 -9\n", "Name: a, dtype: int64" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "test_df = pd.DataFrame({\"a\": [1, 2, 3]})\n", "\n", "negative_square(test_df[\"a\"])" ] }, { "cell_type": "code", "execution_count": 10, "id": "9ed44929-30cb-4087-814a-b48f66addd38", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 0.761594\n", "1 0.964028\n", "2 0.995055\n", "Name: a, dtype: float64" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tanh(test_df[\"a\"])" ] }, { "cell_type": "markdown", "id": "add0e65f-8b04-4799-afd4-4947a293a772", "metadata": {}, "source": [ "### Running the Genetic Feature Synthesis\n", "\n", "Now let's run the Genetic Feature Synthesis with our newly defined symbolic functions." ] }, { "cell_type": "code", "execution_count": 11, "id": "8846e305-9321-4387-b1bf-a4fe94b11428", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Creating new features...: 28%|████▍ | 28/100 [00:05<00:13, 5.48it/s]\n", "Pruning feature space...: 100%|██████████████████| 5/5 [00:00<00:00, 433.05it/s]\u001b[A\n", "Creating new features...: 28%|████▍ | 28/100 [00:05<00:13, 5.31it/s]\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
displacementcylindershorsepowerweightaccelerationmodel_yearoriginfeature_9feature_17feature_18feature_19feature_20
0122.0486.0222014.0711-202.833136-150.335148-159.745799-162.266509-159.913846
1200.0688.0306017.1811-217.640025-153.080140-166.227119-168.610009-166.418846
2302.08129.0372513.4791-161.086183-109.831489-123.601802-125.521701-123.800412
3302.08140.0429416.0721-110.284540-70.181071-80.067690-81.460172-80.224344
4120.0497.0250614.5723-188.034804-139.124017-147.698501-149.993081-147.849460
\n", "
" ], "text/plain": [ " displacement cylinders horsepower weight acceleration model_year \\\n", "0 122.0 4 86.0 2220 14.0 71 \n", "1 200.0 6 88.0 3060 17.1 81 \n", "2 302.0 8 129.0 3725 13.4 79 \n", "3 302.0 8 140.0 4294 16.0 72 \n", "4 120.0 4 97.0 2506 14.5 72 \n", "\n", " origin feature_9 feature_17 feature_18 feature_19 feature_20 \n", "0 1 -202.833136 -150.335148 -159.745799 -162.266509 -159.913846 \n", "1 1 -217.640025 -153.080140 -166.227119 -168.610009 -166.418846 \n", "2 1 -161.086183 -109.831489 -123.601802 -125.521701 -123.800412 \n", "3 1 -110.284540 -70.181071 -80.067690 -81.460172 -80.224344 \n", "4 3 -188.034804 -139.124017 -147.698501 -149.993081 -147.849460 " ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "synth = ft.GeneticFeatureSynthesis(\n", " num_features=5,\n", " population_size=200,\n", " max_generations=100,\n", " early_termination_iters=25,\n", " parsimony_coefficient=0.035,\n", " functions=funcs_to_use,\n", " custom_functions=[tanh, negative_square], \n", " n_jobs=1,\n", ")\n", "\n", "features = synth.fit_transform(X_train, y_train)\n", "\n", "features.head()" ] }, { "cell_type": "markdown", "id": "0800b55e-270c-407f-b337-ef919c4fd88f", "metadata": {}, "source": [ "When we look at the formulas selected for our new features, we can see our custom `tanh` symbolic function has been used 😀" ] }, { "cell_type": "code", "execution_count": 12, "id": "ba1ced6a-4c28-455e-8729-4a42cb08bc55", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
nameformulafitness
0feature_9((((acceleration + (model_year + tanh(displace...-0.863239
1feature_17(((((model_year - cylinders) - cylinders) + mo...-0.862401
2feature_18(((((model_year - cylinders) + model_year) - c...-0.861786
3feature_19(((((model_year - cylinders) + tanh(displaceme...-0.861776
4feature_20(((((model_year - cylinders) + model_year) + t...-0.861775
\n", "
" ], "text/plain": [ " name formula fitness\n", "0 feature_9 ((((acceleration + (model_year + tanh(displace... -0.863239\n", "1 feature_17 (((((model_year - cylinders) - cylinders) + mo... -0.862401\n", "2 feature_18 (((((model_year - cylinders) + model_year) - c... -0.861786\n", "3 feature_19 (((((model_year - cylinders) + tanh(displaceme... -0.861776\n", "4 feature_20 (((((model_year - cylinders) + model_year) + t... -0.861775" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "synth.get_feature_info()" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.6" } }, "nbformat": 4, "nbformat_minor": 5 }