featuristic.synthesis.GeneticFeatureSynthesis#

class featuristic.synthesis.GeneticFeatureSynthesis(num_features: int = 10, population_size: int = 100, max_generations: int = 25, tournament_size: int = 10, crossover_proba: float = 0.85, parsimony_coefficient: float = 0.001, early_termination_iters: int = 15, functions: List[str] | None = None, custom_functions: List[CustomSymbolicFunction] | None = None, return_all_features: bool = True, n_jobs: int = -1, pbar: bool = True, verbose: bool = False)[source]#

The Genetic Feature Synthesis class uses genetic programming to generate new features using a technique based on Symbolic Regression. This is done by initially building a population of naive random formulas that represent transformations of the input features. The population is then evolved over a number of generations using genetic functions such as mutation and crossover to find the best programs that minimize a given fitness function. The best features are then identified using a Maximum Relevance Minimum Redundancy (mRMR) algorithm to find those features that are most correlated with the target variable while being least correlated with each other.

__init__(num_features: int = 10, population_size: int = 100, max_generations: int = 25, tournament_size: int = 10, crossover_proba: float = 0.85, parsimony_coefficient: float = 0.001, early_termination_iters: int = 15, functions: List[str] | None = None, custom_functions: List[CustomSymbolicFunction] | None = None, return_all_features: bool = True, n_jobs: int = -1, pbar: bool = True, verbose: bool = False)[source]#

Initialize the Symbolic Feature Generator.

Parameters:
  • num_features (int) – The number of best features to generate. Internally, 3 * num_features programs are generated and the best num_features are selected via Maximum Relevance Minimum Redundancy (mRMR).

  • population_size (int) – The number of programs in each generation. The larger the population, the more likely it is to find a good solution, but the longer it will take.

  • max_generations (int) – The maximum number of generations to run. The larger the number of generations, the more likely it is to find a good solution, but the longer it will take.

  • tournament_size (int) – The size of the tournament for selection. The larger the tournament size, the more likely it is to select the best program, but the more computation it will take.

  • crossover_proba (float) – The probability of crossover mutation between selected parents in each generation.

  • parsimony_coefficient (float) – The parsimony coefficient. Larger values penalize larger programs more and encourage smaller programs. This helps prevent bloat where the programs become increasingly large and complex without improving the fitness, which increases computation complexity and reduces the interpretability of the features.

  • early_termination_iters (int) – If the best score does not improve for this number of generations, then the algorithm will terminate early.

  • functions (list) – The list of functions to use in the programs. If None then all the built-in functions are used. The functions must be the names of the functions returned by the list_symbolic_functions method.

  • custom_functions (list) – A list of custom functions to use in the programs. Each custom function must be an instance of the CustomSymbolicFunction class.

  • return_all_features (bool) – Whether to return all the features generated or just the best features.

  • n_jobs (int) – The number of parallel jobs to run. If -1, use all available cores else uses n_jobs. If n_jobs=1, then the computation is done in serial.

  • pbar (bool) – Whether to show a progress bar.

  • verbose (bool) – Whether to print out aditional information

Methods

__init__([num_features, population_size, ...])

Initialize the Symbolic Feature Generator.

fit(X, y)

Fit the symbolic feature generator to the data.

fit_transform(X[, y])

Fit the symbolic feature generator to the data and transform the dataframe of features.

get_feature_info()

Get the information about the best programs found.

get_metadata_routing()

Get metadata routing of this object.

get_params([deep])

Get parameters for this estimator.

plot_history([ax])

Plot the history of the fitness function.

set_output(*[, transform])

Set output container.

set_params(**params)

Set the parameters of this estimator.

transform(X[, y])

Transform the dataframe of features using the best programs found.