Using Featuristic With scikit-learn Pipelines#

Featuristic is compatible with scikit-learn’s powerful Pipeline class. This functionality allows you to organize and apply a sequence of data processing steps effortlessly within scikit-learn. With the Pipeline, you can chain together various transformers provided by Featuristic or other scikit-learn-compatible libraries.

These transformers can include feature generation, feature selection, data scaling, and any other preprocessing steps required to prepare your data for modeling.

By leveraging the Pipeline class in conjunction with Featuristic, you can streamline your data preprocessing workflow, ensuring consistency and reproducibility. This allows you to construct complex data processing pipelines with ease, and combine Featuristic with model development.

Let’s take a look at a simple example using the cars dataset.

[1]:

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.pipeline import Pipeline
import featuristic as ft
import numpy as np

np.random.seed(8888)

print(ft.__version__)

1.0.1

Load the Data#

[2]:

X, y = ft.fetch_cars_dataset()

X.head()

[2]:

	displacement	cylinders	horsepower	weight	acceleration	model_year	origin
0	307.0	8	130.0	3504	12.0	70	1
1	350.0	8	165.0	3693	11.5	70	1
2	318.0	8	150.0	3436	11.0	70	1
3	304.0	8	150.0	3433	12.0	70	1
4	302.0	8	140.0	3449	10.5	70	1

[3]:

y.head()

[3]:

0    18.0
1    15.0
2    18.0
3    16.0
4    17.0
Name: mpg, dtype: float64

Split the Data in Train and Test#

[4]:

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

Objective Function#

We define a custom objective function to pass into the Genetic Feature Selection algorithm. The output of this will be minimized to find the optimal subset of features.

[5]:

def objective(X, y):
    model = LinearRegression()
    scores = cross_val_score(model, X, y, cv=3, scoring="neg_mean_absolute_error")
    return -scores.mean()

Fit a scikit-learn Pipeline Containing Featuristic#

[6]:

pipe = Pipeline(
    steps=[
        (
            "genetic_feature_synthesis",
            ft.GeneticFeatureSynthesis(
            num_features=5,
            population_size=200,
            max_generations=100,
            early_termination_iters=25,
            parsimony_coefficient=0.035,
            n_jobs=1,
            ),
        ),
        (
            "genetic_feature_selector",
            ft.GeneticFeatureSelector(
                objective,
                population_size=200,
                max_generations=50,
                early_termination_iters=25,
                n_jobs=-1,
            ),
        ),
        (
            "model",
            LinearRegression()
        )
    ]
)

model = pipe.fit(X_train, y_train)

Creating new features...:  74%|███████████▊    | 74/100 [00:15<00:05,  4.75it/s]
Pruning feature space...: 100%|██████████████████| 5/5 [00:00<00:00, 498.46it/s]
Creating new features...:  74%|███████████▊    | 74/100 [00:15<00:05,  4.74it/s]
Optimising feature selection...:  52%|█████▏    | 26/50 [00:05<00:05,  4.39it/s]

[7]:

preds = model.predict(X_test)

preds[:10]

[7]:

array([13.23930437, 35.40219719, 27.26035025, 26.25423477, 28.24773002,
       19.10148319, 36.15024042, 23.33185658, 31.19121816, 22.29169501])

Accessing Featuristic Inside the Pipeline#

We can still access the individual Featuristic steps via the pipeline’s named_steps functionality. For example, to look at the formulas used for the feature engineering or to plot the genetic algorithm’s history.

[8]:

gfs = pipe.named_steps["genetic_feature_synthesis"]

gfs.get_feature_info().head()

[8]:

	name	formula	fitness
0	feature_1	abs(abs(abs((abs(abs(abs(abs(abs(abs(abs(abs(a...	-0.874683
1	feature_8	abs(abs(((cos(((abs(horsepower) / weight) - (s...	-0.847227
2	feature_4	abs(abs((abs(abs(abs(abs(abs(abs(abs(abs(abs(a...	-0.860441
3	feature_5	abs(abs(abs(abs(abs((((cube(horsepower) / (cub...	-0.852704
4	feature_0	abs(abs((model_year / abs(weight))))	-0.880479

[9]:

gfs.plot_history()

Table of Contents