Creating Custom Symbolic Functions for Use in the Genetic Feature Synthesis#

Featuristic allows you to control which symbolic functions are used within the Genetic Feature Synthesis process, and to create your custom functions too.

Let’s take a look at a simple example using the cars dataset.

[1]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.pipeline import Pipeline
import featuristic as ft
import numpy as np
import pandas as pd

np.random.seed(8888)

print(ft.__version__)
1.1.0

Load the Data#

Let’s start off by downloading the cars dataset and splitting it into train and test datasets.

[2]:
X, y = ft.fetch_cars_dataset()

X.head()
[2]:
displacement cylinders horsepower weight acceleration model_year origin
0 307.0 8 130.0 3504 12.0 70 1
1 350.0 8 165.0 3693 11.5 70 1
2 318.0 8 150.0 3436 11.0 70 1
3 304.0 8 150.0 3433 12.0 70 1
4 302.0 8 140.0 3449 10.5 70 1
[3]:
y.head()
[3]:
0    18.0
1    15.0
2    18.0
3    16.0
4    17.0
Name: mpg, dtype: float64
[4]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

Controlling which Symbolic Functions are Used in the Genetic Feature Synthesis#

Now that we’ve got some data, let’s change the symbolic functions used to synthesise our new features from it. We’ll start off by listing all the functions already included with Featuristic.

[5]:
ft.list_symbolic_functions()
[5]:
['add',
 'subtract',
 'mult',
 'div',
 'square',
 'cube',
 'abs',
 'negate',
 'sin',
 'cos',
 'tan',
 'mul_constant',
 'add_constant']

All these functions are used by default, except the mul_constant and add_constant functions. These multiply or add a constant to create new features and can be useful where there is an offset in the data. However, they can also increase the chance of overfitting.

For this example, let’s limit ourselves to only the add, subtract, mult and div symbolic functions.

[6]:
funcs_to_use = ["add", "subtract", "mult", "div"]

Next, let’s create some custom symbolic functions to use alongside the ones we selected above.

We will do this by defining two CustomSymbolicFunction classes, one that returns the negative of the square of the input and one that returns the tanh of the input.

We will also need to define how many arguments each function takes, its name, and how to render its output to a string.

[7]:
func = lambda x: -(x * x)
arg_count = 1
name = "negative_square"
format_str = "negative_square({})"

negative_square = ft.CustomSymbolicFunction(func=func, arg_count=arg_count, name=name, format_str=format_str)
[8]:
func = np.tanh
arg_count = 1
name = "tanh"
format_str = "tanh({})"

tanh = ft.CustomSymbolicFunction(func=func, arg_count=arg_count, name=name, format_str=format_str)

Great, let’s check our new symbolic functions works as expected by passing in a column from a sample dataframe

[9]:
test_df = pd.DataFrame({"a": [1, 2, 3]})

negative_square(test_df["a"])
[9]:
0   -1
1   -4
2   -9
Name: a, dtype: int64
[10]:
tanh(test_df["a"])
[10]:
0    0.761594
1    0.964028
2    0.995055
Name: a, dtype: float64

Running the Genetic Feature Synthesis#

Now let’s run the Genetic Feature Synthesis with our newly defined symbolic functions.

[11]:
synth = ft.GeneticFeatureSynthesis(
    num_features=5,
    population_size=200,
    max_generations=100,
    early_termination_iters=25,
    parsimony_coefficient=0.035,
    functions=funcs_to_use,
    custom_functions=[tanh, negative_square],
    n_jobs=1,
)

features = synth.fit_transform(X_train, y_train)

features.head()
Creating new features...:  28%|████▍           | 28/100 [00:05<00:13,  5.48it/s]
Pruning feature space...: 100%|██████████████████| 5/5 [00:00<00:00, 433.05it/s]
Creating new features...:  28%|████▍           | 28/100 [00:05<00:13,  5.31it/s]
[11]:
displacement cylinders horsepower weight acceleration model_year origin feature_9 feature_17 feature_18 feature_19 feature_20
0 122.0 4 86.0 2220 14.0 71 1 -202.833136 -150.335148 -159.745799 -162.266509 -159.913846
1 200.0 6 88.0 3060 17.1 81 1 -217.640025 -153.080140 -166.227119 -168.610009 -166.418846
2 302.0 8 129.0 3725 13.4 79 1 -161.086183 -109.831489 -123.601802 -125.521701 -123.800412
3 302.0 8 140.0 4294 16.0 72 1 -110.284540 -70.181071 -80.067690 -81.460172 -80.224344
4 120.0 4 97.0 2506 14.5 72 3 -188.034804 -139.124017 -147.698501 -149.993081 -147.849460

When we look at the formulas selected for our new features, we can see our custom tanh symbolic function has been used 😀

[12]:
synth.get_feature_info()
[12]:
name formula fitness
0 feature_9 ((((acceleration + (model_year + tanh(displace... -0.863239
1 feature_17 (((((model_year - cylinders) - cylinders) + mo... -0.862401
2 feature_18 (((((model_year - cylinders) + model_year) - c... -0.861786
3 feature_19 (((((model_year - cylinders) + tanh(displaceme... -0.861776
4 feature_20 (((((model_year - cylinders) + model_year) + t... -0.861775