Creating Custom Symbolic Functions for Use in the Genetic Feature Synthesis#

Featuristic allows you to control which symbolic functions are used within the Genetic Feature Synthesis process, and to create your custom functions too.

Let’s take a look at a simple example using the cars dataset.

[1]:

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.pipeline import Pipeline
import featuristic as ft
import numpy as np
import pandas as pd

np.random.seed(8888)

print(ft.__version__)

1.1.0

Load the Data#

Let’s start off by downloading the cars dataset and splitting it into train and test datasets.

[2]:

X, y = ft.fetch_cars_dataset()

X.head()

[2]:

	displacement	cylinders	horsepower	weight	acceleration	model_year	origin
0	307.0	8	130.0	3504	12.0	70	1
1	350.0	8	165.0	3693	11.5	70	1
2	318.0	8	150.0	3436	11.0	70	1
3	304.0	8	150.0	3433	12.0	70	1
4	302.0	8	140.0	3449	10.5	70	1

[3]:

y.head()

[3]:

0    18.0
1    15.0
2    18.0
3    16.0
4    17.0
Name: mpg, dtype: float64

[4]:

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

Controlling which Symbolic Functions are Used in the Genetic Feature Synthesis#

Now that we’ve got some data, let’s change the symbolic functions used to synthesise our new features from it. We’ll start off by listing all the functions already included with Featuristic.

[5]:

ft.list_symbolic_functions()

[5]:

['add',
 'subtract',
 'mult',
 'div',
 'square',
 'cube',
 'abs',
 'negate',
 'sin',
 'cos',
 'tan',
 'mul_constant',
 'add_constant']

All these functions are used by default, except the mul_constant and add_constant functions. These multiply or add a constant to create new features and can be useful where there is an offset in the data. However, they can also increase the chance of overfitting.

For this example, let’s limit ourselves to only the add, subtract, mult and div symbolic functions.

[6]:

funcs_to_use = ["add", "subtract", "mult", "div"]

Next, let’s create some custom symbolic functions to use alongside the ones we selected above.

We will do this by defining two CustomSymbolicFunction classes, one that returns the negative of the square of the input and one that returns the tanh of the input.

We will also need to define how many arguments each function takes, its name, and how to render its output to a string.

[7]:

func = lambda x: -(x * x)
arg_count = 1
name = "negative_square"
format_str = "negative_square({})"

negative_square = ft.CustomSymbolicFunction(func=func, arg_count=arg_count, name=name, format_str=format_str)

[8]:

func = np.tanh
arg_count = 1
name = "tanh"
format_str = "tanh({})"

tanh = ft.CustomSymbolicFunction(func=func, arg_count=arg_count, name=name, format_str=format_str)

Great, let’s check our new symbolic functions works as expected by passing in a column from a sample dataframe

[9]:

test_df = pd.DataFrame({"a": [1, 2, 3]})

negative_square(test_df["a"])

[9]:

0   -1
1   -4
2   -9
Name: a, dtype: int64

[10]:

tanh(test_df["a"])

[10]:

0    0.761594
1    0.964028
2    0.995055
Name: a, dtype: float64

Running the Genetic Feature Synthesis#

Now let’s run the Genetic Feature Synthesis with our newly defined symbolic functions.

[11]:

synth = ft.GeneticFeatureSynthesis(
    num_features=5,
    population_size=200,
    max_generations=100,
    early_termination_iters=25,
    parsimony_coefficient=0.035,
    functions=funcs_to_use,
    custom_functions=[tanh, negative_square],
    n_jobs=1,
)

features = synth.fit_transform(X_train, y_train)

features.head()

Creating new features...:  28%|████▍           | 28/100 [00:05<00:13,  5.48it/s]
Pruning feature space...: 100%|██████████████████| 5/5 [00:00<00:00, 433.05it/s]
Creating new features...:  28%|████▍           | 28/100 [00:05<00:13,  5.31it/s]

[11]:

	displacement	cylinders	horsepower	weight	acceleration	model_year	origin	feature_9	feature_17	feature_18	feature_19	feature_20
0	122.0	4	86.0	2220	14.0	71	1	-202.833136	-150.335148	-159.745799	-162.266509	-159.913846
1	200.0	6	88.0	3060	17.1	81	1	-217.640025	-153.080140	-166.227119	-168.610009	-166.418846
2	302.0	8	129.0	3725	13.4	79	1	-161.086183	-109.831489	-123.601802	-125.521701	-123.800412
3	302.0	8	140.0	4294	16.0	72	1	-110.284540	-70.181071	-80.067690	-81.460172	-80.224344
4	120.0	4	97.0	2506	14.5	72	3	-188.034804	-139.124017	-147.698501	-149.993081	-147.849460

When we look at the formulas selected for our new features, we can see our custom tanh symbolic function has been used 😀

[12]:

synth.get_feature_info()

[12]:

	name	formula	fitness
0	feature_9	((((acceleration + (model_year + tanh(displace...	-0.863239
1	feature_17	(((((model_year - cylinders) - cylinders) + mo...	-0.862401
2	feature_18	(((((model_year - cylinders) + model_year) - c...	-0.861786
3	feature_19	(((((model_year - cylinders) + tanh(displaceme...	-0.861776
4	feature_20	(((((model_year - cylinders) + model_year) + t...	-0.861775

Table of Contents

Creating Custom Symbolic Functions for Use in the Genetic Feature Synthesis#

Load the Data#

Controlling which Symbolic Functions are Used in the Genetic Feature Synthesis#

Running the Genetic Feature Synthesis#