Using Featuristic With scikit-learn Pipelines#

Featuristic is compatible with scikit-learn’s powerful Pipeline class. This functionality allows you to organize and apply a sequence of data processing steps effortlessly within scikit-learn. With the Pipeline, you can chain together various transformers provided by Featuristic or other scikit-learn-compatible libraries.

These transformers can include feature generation, feature selection, data scaling, and any other preprocessing steps required to prepare your data for modeling.

By leveraging the Pipeline class in conjunction with Featuristic, you can streamline your data preprocessing workflow, ensuring consistency and reproducibility. This allows you to construct complex data processing pipelines with ease, and combine Featuristic with model development.

Let’s take a look at a simple example using the cars dataset.

[1]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.pipeline import Pipeline
import featuristic as ft
import numpy as np

np.random.seed(8888)

print(ft.__version__)
1.0.1

Load the Data#

[2]:
X, y = ft.fetch_cars_dataset()

X.head()
[2]:
displacement cylinders horsepower weight acceleration model_year origin
0 307.0 8 130.0 3504 12.0 70 1
1 350.0 8 165.0 3693 11.5 70 1
2 318.0 8 150.0 3436 11.0 70 1
3 304.0 8 150.0 3433 12.0 70 1
4 302.0 8 140.0 3449 10.5 70 1
[3]:
y.head()
[3]:
0    18.0
1    15.0
2    18.0
3    16.0
4    17.0
Name: mpg, dtype: float64

Split the Data in Train and Test#

[4]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

Objective Function#

We define a custom objective function to pass into the Genetic Feature Selection algorithm. The output of this will be minimized to find the optimal subset of features.

[5]:
def objective(X, y):
    model = LinearRegression()
    scores = cross_val_score(model, X, y, cv=3, scoring="neg_mean_absolute_error")
    return -scores.mean()

Fit a scikit-learn Pipeline Containing Featuristic#

[6]:
pipe = Pipeline(
    steps=[
        (
            "genetic_feature_synthesis",
            ft.GeneticFeatureSynthesis(
            num_features=5,
            population_size=200,
            max_generations=100,
            early_termination_iters=25,
            parsimony_coefficient=0.035,
            n_jobs=1,
            ),
        ),
        (
            "genetic_feature_selector",
            ft.GeneticFeatureSelector(
                objective,
                population_size=200,
                max_generations=50,
                early_termination_iters=25,
                n_jobs=-1,
            ),
        ),
        (
            "model",
            LinearRegression()
        )
    ]
)

model = pipe.fit(X_train, y_train)
Creating new features...:  74%|███████████▊    | 74/100 [00:15<00:05,  4.75it/s]
Pruning feature space...: 100%|██████████████████| 5/5 [00:00<00:00, 498.46it/s]
Creating new features...:  74%|███████████▊    | 74/100 [00:15<00:05,  4.74it/s]
Optimising feature selection...:  52%|█████▏    | 26/50 [00:05<00:05,  4.39it/s]
[7]:
preds = model.predict(X_test)

preds[:10]
[7]:
array([13.23930437, 35.40219719, 27.26035025, 26.25423477, 28.24773002,
       19.10148319, 36.15024042, 23.33185658, 31.19121816, 22.29169501])

Accessing Featuristic Inside the Pipeline#

We can still access the individual Featuristic steps via the pipeline’s named_steps functionality. For example, to look at the formulas used for the feature engineering or to plot the genetic algorithm’s history.

[8]:
gfs = pipe.named_steps["genetic_feature_synthesis"]

gfs.get_feature_info().head()
[8]:
name formula fitness
0 feature_1 abs(abs(abs((abs(abs(abs(abs(abs(abs(abs(abs(a... -0.874683
1 feature_8 abs(abs(((cos(((abs(horsepower) / weight) - (s... -0.847227
2 feature_4 abs(abs((abs(abs(abs(abs(abs(abs(abs(abs(abs(a... -0.860441
3 feature_5 abs(abs(abs(abs(abs((((cube(horsepower) / (cub... -0.852704
4 feature_0 abs(abs((model_year / abs(weight)))) -0.880479
[9]:
gfs.plot_history()
../_images/guides_sklearn_14_0.png