{ "cells": [ { "cell_type": "markdown", "id": "eb5834d3-94b4-468d-803e-0b008c10ac29", "metadata": {}, "source": [ "# Computational Performance\n", "\n", "There are several parameters that can be used to improve the computational performance of the genetic algorithms in Featuristic, as shown below." ] }, { "cell_type": "code", "execution_count": 1, "id": "bb06d8de-d1e2-4ad0-b386-3c287589a09d", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0.1.1\n" ] } ], "source": [ "import featuristic as ft\n", "import numpy as np\n", "\n", "np.random.seed(8888)\n", "\n", "print(ft.__version__)\n", "\n", "X, y = ft.fetch_cars_dataset()" ] }, { "cell_type": "markdown", "id": "2cad1914-1776-485b-b1a6-5dc3c78ce993", "metadata": {}, "source": [ "### Parsimony\n", "\n", "The `parsimony_coefficient` parameter controls the complexity of the mathematical expressions used to generate new features. When set to larger values, it penalizes larger programs more heavily, thereby encouraging the creation of smaller programs. This reduces bloat, where programs become excessively large and complex without improving their performance. \n", "\n", "By discouraging overly complex expressions, the computational complexity is reduced and the new features can be calculated more quickly.\n", "\n", "In the example below, the `parsimony_coefficient` is set to be very small, leading to larger and more complex features that will take more time to compute." ] }, { "cell_type": "code", "execution_count": 2, "id": "e8ddeb97-7128-4bc8-a265-d62bbde070cc", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Creating new features...: 58%|█████████████████████████████████████████████████▎ | 29/50 [00:03<00:03, 6.83it/s]\n", "Pruning feature space...: 100%|██████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:00<00:00, 679.06it/s]\u001b[A\n", "Creating new features...: 58%|█████████████████████████████████████████████████▎ | 29/50 [00:03<00:02, 8.73it/s]\n" ] }, { "data": { "text/plain": [ "'((abs((-(-(-((displacement / ((model_year + displacement) + weight))))) + (weight + displacement))) - -(sin(displacement))) + displacement)'" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "synth = ft.GeneticFeatureSynthesis(\n", " num_features=5,\n", " population_size=100,\n", " max_generations=50,\n", " early_termination_iters=25,\n", " parsimony_coefficient=0.00001,\n", " return_all_features=False,\n", " n_jobs=1,\n", ")\n", "\n", "features = synth.fit_transform(X, y)\n", "\n", "info = synth.get_feature_info()\n", "\n", "info.head()[\"formula\"].iloc[0]" ] }, { "cell_type": "markdown", "id": "1c5fb7ad-3787-4f42-9b93-0561e53042bd", "metadata": {}, "source": [ "And in the example, below the `parsimony_coefficient` is increased to keep the features simpler, meaning they can be calculated more quickly." ] }, { "cell_type": "code", "execution_count": 3, "id": "6704f5e5-9d26-4be6-8ef4-5d2df3068bf0", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Creating new features...: 60%|███████████████████████████████████████████████████ | 30/50 [00:02<00:01, 10.89it/s]\n", "Pruning feature space...: 100%|██████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:00<00:00, 601.75it/s]\u001b[A\n", "Creating new features...: 60%|███████████████████████████████████████████████████ | 30/50 [00:02<00:01, 11.88it/s]\n" ] }, { "data": { "text/plain": [ "'abs(-(cube(((weight + displacement) - square(model_year)))))'" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "synth = ft.GeneticFeatureSynthesis(\n", " num_features=5,\n", " population_size=100,\n", " max_generations=50,\n", " early_termination_iters=25,\n", " parsimony_coefficient=0.1,\n", " return_all_features=False,\n", " n_jobs=1,\n", ")\n", "\n", "features = synth.fit_transform(X, y)\n", "\n", "info = synth.get_feature_info()\n", "\n", "info.head()[\"formula\"].iloc[0]" ] }, { "cell_type": "markdown", "id": "4e90e163-5ff3-49ad-89f2-ac37e31d1274", "metadata": {}, "source": [ "### Parallel Processing\n", "\n", "By default, the `GeneticFeatureSynthesis` and `GeneticFeatureSelector` classes run on a single CPU of your computer. However, one of the nice features of genetic algorithms is that they are [embarrassingly parallel](https://en.wikipedia.org/wiki/Embarrassingly_parallel). \n", "\n", "Both classes take an argument called `n_jobs`, which defines how many processes are spawned in parallel for running the genetic algorithms. If `n_jobs` is set to `1` then it will continue to run on just one CPU, and if set to `-1` it use one process per CPU of your computer.\n", "\n", "There is a small cost associated with spawning new processes, so if your datset is small it may actually be more efficient to use `n_jobs=1`. However, for moderately sized datasets upwards, you will likely see an increase in performance by increasing `n_jobs` to greater than `1`, or setting it to `-1`.\n", "\n", "It is generally recommended to avoid using significantly more processes than the number of CPUs on a machine, as this uses more resources and can cause the multi-processing to run slowly." ] }, { "cell_type": "code", "execution_count": 4, "id": "4234ae88-ae78-46e6-bc81-e7e7e0933f2a", "metadata": {}, "outputs": [], "source": [ "synth = ft.GeneticFeatureSynthesis(\n", " num_features=5,\n", " population_size=100,\n", " max_generations=50,\n", " early_termination_iters=25,\n", " parsimony_coefficient=0.1,\n", " return_all_features=False,\n", " n_jobs=-1,\n", ")" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.6" } }, "nbformat": 4, "nbformat_minor": 5 }