Creating Custom Symbolic Functions for Use in the Genetic Feature Synthesis#
Featuristic allows you to control which symbolic functions are used within the Genetic Feature Synthesis process, and to create your custom functions too.
Let’s take a look at a simple example using the cars
dataset.
[1]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.pipeline import Pipeline
import featuristic as ft
import numpy as np
import pandas as pd
np.random.seed(8888)
print(ft.__version__)
1.1.0
Load the Data#
Let’s start off by downloading the cars
dataset and splitting it into train and test datasets.
[2]:
X, y = ft.fetch_cars_dataset()
X.head()
[2]:
displacement | cylinders | horsepower | weight | acceleration | model_year | origin | |
---|---|---|---|---|---|---|---|
0 | 307.0 | 8 | 130.0 | 3504 | 12.0 | 70 | 1 |
1 | 350.0 | 8 | 165.0 | 3693 | 11.5 | 70 | 1 |
2 | 318.0 | 8 | 150.0 | 3436 | 11.0 | 70 | 1 |
3 | 304.0 | 8 | 150.0 | 3433 | 12.0 | 70 | 1 |
4 | 302.0 | 8 | 140.0 | 3449 | 10.5 | 70 | 1 |
[3]:
y.head()
[3]:
0 18.0
1 15.0
2 18.0
3 16.0
4 17.0
Name: mpg, dtype: float64
[4]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
Controlling which Symbolic Functions are Used in the Genetic Feature Synthesis#
Now that we’ve got some data, let’s change the symbolic functions used to synthesise our new features from it. We’ll start off by listing all the functions already included with Featuristic.
[5]:
ft.list_symbolic_functions()
[5]:
['add',
'subtract',
'mult',
'div',
'square',
'cube',
'abs',
'negate',
'sin',
'cos',
'tan',
'mul_constant',
'add_constant']
All these functions are used by default, except the mul_constant
and add_constant
functions. These multiply or add a constant to create new features and can be useful where there is an offset in the data. However, they can also increase the chance of overfitting.
For this example, let’s limit ourselves to only the add
, subtract
, mult
and div
symbolic functions.
[6]:
funcs_to_use = ["add", "subtract", "mult", "div"]
Next, let’s create some custom symbolic functions to use alongside the ones we selected above.
We will do this by defining two CustomSymbolicFunction
classes, one that returns the negative of the square of the input and one that returns the tanh
of the input.
We will also need to define how many arguments each function takes, its name, and how to render its output to a string.
[7]:
func = lambda x: -(x * x)
arg_count = 1
name = "negative_square"
format_str = "negative_square({})"
negative_square = ft.CustomSymbolicFunction(func=func, arg_count=arg_count, name=name, format_str=format_str)
[8]:
func = np.tanh
arg_count = 1
name = "tanh"
format_str = "tanh({})"
tanh = ft.CustomSymbolicFunction(func=func, arg_count=arg_count, name=name, format_str=format_str)
Great, let’s check our new symbolic functions works as expected by passing in a column from a sample dataframe
[9]:
test_df = pd.DataFrame({"a": [1, 2, 3]})
negative_square(test_df["a"])
[9]:
0 -1
1 -4
2 -9
Name: a, dtype: int64
[10]:
tanh(test_df["a"])
[10]:
0 0.761594
1 0.964028
2 0.995055
Name: a, dtype: float64
Running the Genetic Feature Synthesis#
Now let’s run the Genetic Feature Synthesis with our newly defined symbolic functions.
[11]:
synth = ft.GeneticFeatureSynthesis(
num_features=5,
population_size=200,
max_generations=100,
early_termination_iters=25,
parsimony_coefficient=0.035,
functions=funcs_to_use,
custom_functions=[tanh, negative_square],
n_jobs=1,
)
features = synth.fit_transform(X_train, y_train)
features.head()
Creating new features...: 28%|████▍ | 28/100 [00:05<00:13, 5.48it/s]
Pruning feature space...: 100%|██████████████████| 5/5 [00:00<00:00, 433.05it/s]
Creating new features...: 28%|████▍ | 28/100 [00:05<00:13, 5.31it/s]
[11]:
displacement | cylinders | horsepower | weight | acceleration | model_year | origin | feature_9 | feature_17 | feature_18 | feature_19 | feature_20 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 122.0 | 4 | 86.0 | 2220 | 14.0 | 71 | 1 | -202.833136 | -150.335148 | -159.745799 | -162.266509 | -159.913846 |
1 | 200.0 | 6 | 88.0 | 3060 | 17.1 | 81 | 1 | -217.640025 | -153.080140 | -166.227119 | -168.610009 | -166.418846 |
2 | 302.0 | 8 | 129.0 | 3725 | 13.4 | 79 | 1 | -161.086183 | -109.831489 | -123.601802 | -125.521701 | -123.800412 |
3 | 302.0 | 8 | 140.0 | 4294 | 16.0 | 72 | 1 | -110.284540 | -70.181071 | -80.067690 | -81.460172 | -80.224344 |
4 | 120.0 | 4 | 97.0 | 2506 | 14.5 | 72 | 3 | -188.034804 | -139.124017 | -147.698501 | -149.993081 | -147.849460 |
When we look at the formulas selected for our new features, we can see our custom tanh
symbolic function has been used 😀
[12]:
synth.get_feature_info()
[12]:
name | formula | fitness | |
---|---|---|---|
0 | feature_9 | ((((acceleration + (model_year + tanh(displace... | -0.863239 |
1 | feature_17 | (((((model_year - cylinders) - cylinders) + mo... | -0.862401 |
2 | feature_18 | (((((model_year - cylinders) + model_year) - c... | -0.861786 |
3 | feature_19 | (((((model_year - cylinders) + tanh(displaceme... | -0.861776 |
4 | feature_20 | (((((model_year - cylinders) + model_year) + t... | -0.861775 |