Computational Performance#
There are several parameters that can be used to improve the computational performance of the genetic algorithms in Featuristic, as shown below.
[1]:
import featuristic as ft
import numpy as np
np.random.seed(8888)
print(ft.__version__)
X, y = ft.fetch_cars_dataset()
0.1.1
Parsimony#
The parsimony_coefficient
parameter controls the complexity of the mathematical expressions used to generate new features. When set to larger values, it penalizes larger programs more heavily, thereby encouraging the creation of smaller programs. This reduces bloat, where programs become excessively large and complex without improving their performance.
By discouraging overly complex expressions, the computational complexity is reduced and the new features can be calculated more quickly.
In the example below, the parsimony_coefficient
is set to be very small, leading to larger and more complex features that will take more time to compute.
[2]:
synth = ft.GeneticFeatureSynthesis(
num_features=5,
population_size=100,
max_generations=50,
early_termination_iters=25,
parsimony_coefficient=0.00001,
return_all_features=False,
n_jobs=1,
)
features = synth.fit_transform(X, y)
info = synth.get_feature_info()
info.head()["formula"].iloc[0]
Creating new features...: 58%|█████████████████████████████████████████████████▎ | 29/50 [00:03<00:03, 6.83it/s]
Pruning feature space...: 100%|██████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:00<00:00, 679.06it/s]
Creating new features...: 58%|█████████████████████████████████████████████████▎ | 29/50 [00:03<00:02, 8.73it/s]
[2]:
'((abs((-(-(-((displacement / ((model_year + displacement) + weight))))) + (weight + displacement))) - -(sin(displacement))) + displacement)'
And in the example, below the parsimony_coefficient
is increased to keep the features simpler, meaning they can be calculated more quickly.
[3]:
synth = ft.GeneticFeatureSynthesis(
num_features=5,
population_size=100,
max_generations=50,
early_termination_iters=25,
parsimony_coefficient=0.1,
return_all_features=False,
n_jobs=1,
)
features = synth.fit_transform(X, y)
info = synth.get_feature_info()
info.head()["formula"].iloc[0]
Creating new features...: 60%|███████████████████████████████████████████████████ | 30/50 [00:02<00:01, 10.89it/s]
Pruning feature space...: 100%|██████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:00<00:00, 601.75it/s]
Creating new features...: 60%|███████████████████████████████████████████████████ | 30/50 [00:02<00:01, 11.88it/s]
[3]:
'abs(-(cube(((weight + displacement) - square(model_year)))))'
Parallel Processing#
By default, the GeneticFeatureSynthesis
and GeneticFeatureSelector
classes run on a single CPU of your computer. However, one of the nice features of genetic algorithms is that they are embarrassingly parallel.
Both classes take an argument called n_jobs
, which defines how many processes are spawned in parallel for running the genetic algorithms. If n_jobs
is set to 1
then it will continue to run on just one CPU, and if set to -1
it use one process per CPU of your computer.
There is a small cost associated with spawning new processes, so if your datset is small it may actually be more efficient to use n_jobs=1
. However, for moderately sized datasets upwards, you will likely see an increase in performance by increasing n_jobs
to greater than 1
, or setting it to -1
.
It is generally recommended to avoid using significantly more processes than the number of CPUs on a machine, as this uses more resources and can cause the multi-processing to run slowly.
[4]:
synth = ft.GeneticFeatureSynthesis(
num_features=5,
population_size=100,
max_generations=50,
early_termination_iters=25,
parsimony_coefficient=0.1,
return_all_features=False,
n_jobs=-1,
)