Tuning the Genetic Feature Synthesis#

There are several parameters that can be used to tune the genetic algorithms in Featuristic, which we’ll explore below with the cars dataset.

[1]:

import featuristic as ft
import numpy as np

np.random.seed(8888)

print(ft.__version__)

X, y = ft.fetch_cars_dataset()

0.1.1

Complexity of the Mathematical Expressions#

The parsimony_coefficient parameter controls the complexity of the mathematical expressions used to generate new features. When set to larger values, it penalizes larger programs more heavily, thereby encouraging the creation of smaller programs. This reduces bloat, where programs become excessively large and complex without improving their performance. By discouraging overly complex expressions, computation complexity is reduced, and the interpretability of the features is enhanced.

In the example below, the parsimony_coefficient is set to be very small, leading to larger and more complex features.

[2]:

synth = ft.GeneticFeatureSynthesis(
    num_features=5,
    population_size=100,
    max_generations=50,
    early_termination_iters=25,
    parsimony_coefficient=0.00001,
    return_all_features=False,
    n_jobs=1,
)

features = synth.fit_transform(X, y)

info = synth.get_feature_info()

info.head()["formula"].iloc[0]

Creating new features...:  58%|█████████████████████████████████████████████████▎                                   | 29/50 [00:03<00:02,  7.21it/s]
Pruning feature space...: 100%|██████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:00<00:00, 595.60it/s]
Creating new features...:  58%|█████████████████████████████████████████████████▎                                   | 29/50 [00:03<00:02,  8.91it/s]

[2]:

'((abs((-(-(-((displacement / ((model_year + displacement) + weight))))) + (weight + displacement))) - -(sin(displacement))) + displacement)'

And in the example, below the parsimony_coefficient is increased to keep the features simpler.

[3]:

synth = ft.GeneticFeatureSynthesis(
    num_features=5,
    population_size=100,
    max_generations=50,
    early_termination_iters=25,
    parsimony_coefficient=0.1,
    return_all_features=False,
    n_jobs=1,
)

features = synth.fit_transform(X, y)

info = synth.get_feature_info()

info.head()["formula"].iloc[0]

Creating new features...:  60%|███████████████████████████████████████████████████                                  | 30/50 [00:02<00:01, 11.00it/s]
Pruning feature space...: 100%|██████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:00<00:00, 607.39it/s]
Creating new features...:  60%|███████████████████████████████████████████████████                                  | 30/50 [00:02<00:01, 11.99it/s]

[3]:

'abs(-(cube(((weight + displacement) - square(model_year)))))'

Max Generations#

The max_generations parameter refers to the maximum number of evolutions or generations the algorithm will undergo before terminating. Each generation represents a cycle of selection, crossover, and mutation operations on the population of candidate solutions.

The max_generations parameter is crucial as it determines the duration or number of iterations for which the genetic algorithm will continue evolving potential solutions. Once the specified maximum number of generations is reached, the algorithm terminates, regardless of whether the optimal solution has been found.

Setting an appropriate max_generations value is important to balance computational resources and the algorithm’s ability to converge to a satisfactory solution. If max_generations is too low, the algorithm may terminate prematurely before reaching an optimal solution. Conversely, if it’s too high, it may lead to unnecessary computational overhead without significant improvements in solution quality.

Finding the right value for max_generations often requires experimentation and problem-specific considerations, such as the complexity of the optimization problem, computational resources available, and the desired level of solution quality.

Early Termination#

The early_termination_iters argument sets the number of generations after which the genetic algorithm will stop if it hasn’t managed to improve the best individual in the population. This prevents unnecessary computation if the algorithm seems to have converged or stalled.

For example, if early_termination_iters is set to 10, it means that after 10 generations (iterations) without any improvement in the best individual’s fitness, the algorithm will terminate prematurely as it assumes that further iterations are unlikely to yield better results.

This parameter is useful for efficiency purposes, especially when dealing with computationally expensive problems, as it helps avoid unnecessary computations once progress appears to have converged. It works well when combined with larger max_generations values, as it helps to avoid prolonged computation when progress appears to have plateaued.

Population Size#

The population_size refers to the number of individuals (or candidate solutions) present in each generation of the algorithm. Each individual represents a potential solution to the optimization problem being solved.

The population size is an important parameter as it influences the diversity and exploration capability of the algorithm. A larger population size typically allows for more diverse solutions to be explored, potentially leading to a broader exploration of the solution space. However, larger populations also require more computational resources and may lead to slower convergence if not properly managed.

Conversely, a smaller population size may lead to faster convergence but could suffer from premature convergence to local optima due to limited diversity.

Choosing an appropriate population size is often a balancing act between exploration (diversity) and exploitation (convergence speed). It often depends on the specific problem being solved, computational resources available, and the desired balance between exploration and exploitation. Experimentation and tuning are typically required to determine the optimal population size for a given problem.

Tournament Size#

In Featuristic’s genetic algorithms, the tournament_size parameter determines the number of individuals (candidate solutions) that participate in each tournament selection process. Tournament selection is a mechanism used to select individuals from the population for reproduction (crossover and mutation) based on their fitness scores (or objective function values).

Here’s how it works:

Randomly select a subset of individuals from the population with a size equal to the tournament size.
Evaluate the fitness of each individual in the subset.
Select the best individual (usually the one with the highest fitness) from the subset to proceed to the next generation.
Repeat the process until a sufficient number of individuals are selected for reproduction to populate the next generation.

The tournament size parameter influences the selection pressure within the genetic algorithm. A larger tournament size increases the likelihood of selecting the best individuals from each tournament, potentially leading to faster convergence towards optimal solutions. However, it also reduces the diversity of the selected individuals, which may hinder exploration of the solution space.

Conversely, a smaller tournament size introduces more randomness into the selection process, allowing for greater diversity among selected individuals. This may enhance exploration but could also slow down convergence as the algorithm may struggle to consistently select the best individuals.

Choosing an appropriate tournament size often involves trade-offs between exploration and exploitation. Experimentation and tuning are typically necessary to determine the optimal tournament size for a given problem and algorithm configuration.

Objective Function#

The GeneticFeatureSelector class takes a parameter called objective_function. This function is used by the genetic algorithm to quantify how well a particular candidate solution performs with respect to the problem being optimized.

In the context of feature selection, where the aim is often to identify the most informative subset of features, the objective_function helps gauge how well a particular feature subset performs. By evaluating various feature combinations, the algorithm navigates towards solutions that exhibit superior performance in terms of the specified objectives.

Given that feature selection typically precedes more resource-intensive tasks like model selection or hyperparameter optimization, employing a straightforward yet effective objective_function allows rapid identification of the most important features.

[ ]:

Table of Contents