{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Normalisierung und Vorverarbeitung\n", "\n", "[sklearn.preprocessing](https://scikit-learn.org/stable/modules/preprocessing.html) kann in vielfältiger Form verwendet werden um Daten zu bereinigen:\n", "\n", "* Standardisierung mit [StandardScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html), [MinMaxScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html), [MaxAbsScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MaxAbsScaler.html) oder [RobustScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.RobustScaler.html).\n", "* Zentrierung von Kernel-Matrizen mit [KernelCenterer](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.KernelCenterer.html).\n", "* Nicht-lineare Transformationen mit [QuantileTransformer](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.QuantileTransformer.html), [PowerTransformer](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.PowerTransformer.html)\n", "* Normalisierung mit [normalize](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.normalize.html).\n", "* Kodierung kategorischer Merkmale mit [OrdinalEncoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OrdinalEncoder.html), [OneHotEncoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html).\n", "* [Diskretisierung](https://de.wikipedia.org/wiki/Diskretisierung) (auch bekannt als Quantisierung oder Binning) mit [KBinsDiscretizer](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.KBinsDiscretizer.html).\n", "* Binarisierung von Merkmalen mit [Binarizer](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.Binarizer.html)\n", "* Zurechnen (engl.: _Imputation_) fehlender Werte mit [SimpleImputer](https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html), [IterativeImputer](https://scikit-learn.org/stable/modules/generated/sklearn.impute.IterativeImputer.html) oder [KNNImputer](https://scikit-learn.org/stable/modules/generated/sklearn.impute.KNNImputer.html) wobei die zugerechneten Werte mit [MissingIndicator](https://scikit-learn.org/stable/modules/generated/sklearn.impute.MissingIndicator.html) markiert werden können.\n", "\n", "
\n", "\n", "**Siehe auch**\n", "\n", "* [statsmodels](https://www.statsmodels.org/stable/index.html)\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Beispiel\n", "\n", "Im folgenden Beispiel füllen wir Mittelwerte auf und nehmen einige Skalierungen vor:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 1. Importe" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "from datetime import datetime\n", "\n", "import numpy as np\n", "import pandas as pd\n", "\n", "from sklearn import preprocessing\n", "from sklearn.impute import SimpleImputer" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "hvac = pd.read_csv(\"https://raw.githubusercontent.com/kjam/data-cleaning-101/master/data/HVAC_with_nulls.csv\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2. Datenqualität überprüfen\n", "\n", "Datentypen mit [pandas.DataFrame.dtypes](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dtypes.html) anzeigen" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Date object\n", "Time object\n", "TargetTemp float64\n", "ActualTemp int64\n", "System int64\n", "SystemAge float64\n", "BuildingID int64\n", "10 float64\n", "dtype: object" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "hvac.dtypes" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Dimensionen das DataFrame mit [pandas.DataFrame.shape](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.shape.html) als Tupel zurückgeben" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(8000, 8)" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "hvac.shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Erste *n* Zeilen mit [pandas.DataFrame.head](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.head.html) zurückgeben" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
DateTimeTargetTempActualTempSystemSystemAgeBuildingID10
06/1/130:00:0166.0581320.04NaN
16/2/131:00:01NaN68320.017NaN
26/3/132:00:0170.0731720.018NaN
36/4/133:00:0167.0632NaN15NaN
46/5/134:00:0168.074169.03NaN
\n", "
" ], "text/plain": [ " Date Time TargetTemp ActualTemp System SystemAge BuildingID 10\n", "0 6/1/13 0:00:01 66.0 58 13 20.0 4 NaN\n", "1 6/2/13 1:00:01 NaN 68 3 20.0 17 NaN\n", "2 6/3/13 2:00:01 70.0 73 17 20.0 18 NaN\n", "3 6/4/13 3:00:01 67.0 63 2 NaN 15 NaN\n", "4 6/5/13 4:00:01 68.0 74 16 9.0 3 NaN" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "hvac.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 3. Fehlenden Werten den Mittelwert zuschreiben\n", "\n", "Hierzu verwenden wir die `mean`-Strategie von [sklearn.impute.SimpleImputer](https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html)" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "imp = SimpleImputer(missing_values=np.nan, strategy=\"mean\")" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "hvac_numeric = hvac[[\"TargetTemp\", \"SystemAge\"]]" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "imp = imp.fit(hvac_numeric.loc[:10])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Weiter Infos zu `fit` erhaltet ihr in der [Scikit Learn-Dokumentation](https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html#sklearn.impute.SimpleImputer.fit).\n", "\n", "[fit_transform](https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html#sklearn.impute.SimpleImputer.fit_transform) wandelt dann die angepassten Daten um:" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "transformed = imp.fit_transform(hvac_numeric)" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[66. , 20. ],\n", " [67.50773481, 20. ],\n", " [70. , 20. ],\n", " ...,\n", " [67.50773481, 4. ],\n", " [65. , 23. ],\n", " [66. , 21. ]])" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "transformed" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "hvac[\"TargetTemp\"], hvac[\"SystemAge\"] = transformed[:,0], transformed[:,1]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Nun lassen wir uns die ersten Zeilen mit den geänderten Datensätzen anzeigen." ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
DateTimeTargetTempActualTempSystemSystemAgeBuildingID10
06/1/130:00:0166.000000581320.0000004NaN
16/2/131:00:0167.50773568320.00000017NaN
26/3/132:00:0170.000000731720.00000018NaN
36/4/133:00:0167.00000063215.38664315NaN
46/5/134:00:0168.00000074169.0000003NaN
\n", "
" ], "text/plain": [ " Date Time TargetTemp ActualTemp System SystemAge BuildingID 10\n", "0 6/1/13 0:00:01 66.000000 58 13 20.000000 4 NaN\n", "1 6/2/13 1:00:01 67.507735 68 3 20.000000 17 NaN\n", "2 6/3/13 2:00:01 70.000000 73 17 20.000000 18 NaN\n", "3 6/4/13 3:00:01 67.000000 63 2 15.386643 15 NaN\n", "4 6/5/13 4:00:01 68.000000 74 16 9.000000 3 NaN" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "hvac.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 4. Skalieren\n", "\n", "Zur Standardisierung von Datensätzen, die wie standardnormalverteilte Daten aussehen, können wir [sklearn.preprocessing.scale](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.scale.html) verwenden. Damit lassen sich die Faktoren ermitteln, um die ein Wert sich vergrößert oder verkleinert. Dies können wir für die Skalierung der aktuellen Temperatur verwenden." ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [], "source": [ "hvac[\"ScaledTemp\"] = preprocessing.scale(hvac[\"ActualTemp\"])" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 -1.293272\n", "1 0.048732\n", "2 0.719733\n", "3 -0.622270\n", "4 0.853934\n", "Name: ScaledTemp, dtype: float64" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "hvac[\"ScaledTemp\"].head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[sklearn.preprocessing.MinMaxScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html) skaliert die Mermale so, dass sie zwischen einem bestimmten Minimal- und Maximalwert liegen, häufig zwischen Null und Eins. Dies hat den Vorteil, dass die Skalierung robuster gegenüber sehr kleinen Standardabweichungen von Merkmalen wird." ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [], "source": [ "min_max_scaler = preprocessing.MinMaxScaler()" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [], "source": [ "temp_minmax = min_max_scaler.fit_transform(hvac[[\"ActualTemp\"]])" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[0.12],\n", " [0.52],\n", " [0.72],\n", " ...,\n", " [0.56],\n", " [0.32],\n", " [0.44]])" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "temp_minmax" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Nun fügen wir auch `temp_minmax` noch als neue Spalte ein:" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 0.12\n", "1 0.52\n", "2 0.72\n", "3 0.32\n", "4 0.76\n", "Name: MinMaxScaledTemp, dtype: float64" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "hvac[\"MinMaxScaledTemp\"] = temp_minmax[:,0]\n", "hvac[\"MinMaxScaledTemp\"].head()" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
DateTimeTargetTempActualTempSystemSystemAgeBuildingID10ScaledTempMinMaxScaledTemp
06/1/130:00:0166.000000581320.0000004NaN-1.2932720.12
16/2/131:00:0167.50773568320.00000017NaN0.0487320.52
26/3/132:00:0170.000000731720.00000018NaN0.7197330.72
36/4/133:00:0167.00000063215.38664315NaN-0.6222700.32
46/5/134:00:0168.00000074169.0000003NaN0.8539340.76
\n", "
" ], "text/plain": [ " Date Time TargetTemp ActualTemp System SystemAge BuildingID 10 \\\n", "0 6/1/13 0:00:01 66.000000 58 13 20.000000 4 NaN \n", "1 6/2/13 1:00:01 67.507735 68 3 20.000000 17 NaN \n", "2 6/3/13 2:00:01 70.000000 73 17 20.000000 18 NaN \n", "3 6/4/13 3:00:01 67.000000 63 2 15.386643 15 NaN \n", "4 6/5/13 4:00:01 68.000000 74 16 9.000000 3 NaN \n", "\n", " ScaledTemp MinMaxScaledTemp \n", "0 -1.293272 0.12 \n", "1 0.048732 0.52 \n", "2 0.719733 0.72 \n", "3 -0.622270 0.32 \n", "4 0.853934 0.76 " ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "hvac.head()" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3.11 Kernel", "language": "python", "name": "python311" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.4" }, "latex_envs": { "LaTeX_envs_menu_present": true, "autoclose": false, "autocomplete": true, "bibliofile": "biblio.bib", "cite_by": "apalike", "current_citInitial": 1, "eqLabelWithNumbers": true, "eqNumInitial": 1, "hotkeys": { "equation": "Ctrl-E", "itemize": "Ctrl-I" }, "labels_anchors": false, "latex_user_defs": false, "report_style_numbering": false, "user_envs_cfg": false }, "widgets": { "application/vnd.jupyter.widget-state+json": { "state": {}, "version_major": 2, "version_minor": 0 } } }, "nbformat": 4, "nbformat_minor": 4 }