{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Normalisierung und Vorverarbeitung\n",
"\n",
"[sklearn.preprocessing](https://scikit-learn.org/stable/modules/preprocessing.html) kann in vielfältiger Form verwendet werden um Daten zu bereinigen:\n",
"\n",
"* Standardisierung mit [StandardScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html), [MinMaxScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html), [MaxAbsScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MaxAbsScaler.html) oder [RobustScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.RobustScaler.html).\n",
"* Zentrierung von Kernel-Matrizen mit [KernelCenterer](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.KernelCenterer.html).\n",
"* Nicht-lineare Transformationen mit [QuantileTransformer](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.QuantileTransformer.html), [PowerTransformer](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.PowerTransformer.html)\n",
"* Normalisierung mit [normalize](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.normalize.html).\n",
"* Kodierung kategorischer Merkmale mit [OrdinalEncoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OrdinalEncoder.html), [OneHotEncoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html).\n",
"* [Diskretisierung](https://de.wikipedia.org/wiki/Diskretisierung) (auch bekannt als Quantisierung oder Binning) mit [KBinsDiscretizer](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.KBinsDiscretizer.html).\n",
"* Binarisierung von Merkmalen mit [Binarizer](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.Binarizer.html)\n",
"* Zurechnen (engl.: _Imputation_) fehlender Werte mit [SimpleImputer](https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html), [IterativeImputer](https://scikit-learn.org/stable/modules/generated/sklearn.impute.IterativeImputer.html) oder [KNNImputer](https://scikit-learn.org/stable/modules/generated/sklearn.impute.KNNImputer.html) wobei die zugerechneten Werte mit [MissingIndicator](https://scikit-learn.org/stable/modules/generated/sklearn.impute.MissingIndicator.html) markiert werden können.\n",
"\n",
"
\n",
"\n",
"**Siehe auch**\n",
"\n",
"* [statsmodels](https://www.statsmodels.org/stable/index.html)\n",
"
"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Beispiel\n",
"\n",
"Im folgenden Beispiel füllen wir Mittelwerte auf und nehmen einige Skalierungen vor:"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 1. Importe"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"from datetime import datetime\n",
"\n",
"import numpy as np\n",
"import pandas as pd\n",
"\n",
"from sklearn import preprocessing\n",
"from sklearn.impute import SimpleImputer"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"hvac = pd.read_csv(\"https://raw.githubusercontent.com/kjam/data-cleaning-101/master/data/HVAC_with_nulls.csv\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 2. Datenqualität überprüfen\n",
"\n",
"Datentypen mit [pandas.DataFrame.dtypes](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dtypes.html) anzeigen"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Date object\n",
"Time object\n",
"TargetTemp float64\n",
"ActualTemp int64\n",
"System int64\n",
"SystemAge float64\n",
"BuildingID int64\n",
"10 float64\n",
"dtype: object"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"hvac.dtypes"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Dimensionen das DataFrame mit [pandas.DataFrame.shape](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.shape.html) als Tupel zurückgeben"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(8000, 8)"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"hvac.shape"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Erste *n* Zeilen mit [pandas.DataFrame.head](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.head.html) zurückgeben"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Date | \n",
" Time | \n",
" TargetTemp | \n",
" ActualTemp | \n",
" System | \n",
" SystemAge | \n",
" BuildingID | \n",
" 10 | \n",
"
\n",
" \n",
" \n",
" \n",
" | 0 | \n",
" 6/1/13 | \n",
" 0:00:01 | \n",
" 66.0 | \n",
" 58 | \n",
" 13 | \n",
" 20.0 | \n",
" 4 | \n",
" NaN | \n",
"
\n",
" \n",
" | 1 | \n",
" 6/2/13 | \n",
" 1:00:01 | \n",
" NaN | \n",
" 68 | \n",
" 3 | \n",
" 20.0 | \n",
" 17 | \n",
" NaN | \n",
"
\n",
" \n",
" | 2 | \n",
" 6/3/13 | \n",
" 2:00:01 | \n",
" 70.0 | \n",
" 73 | \n",
" 17 | \n",
" 20.0 | \n",
" 18 | \n",
" NaN | \n",
"
\n",
" \n",
" | 3 | \n",
" 6/4/13 | \n",
" 3:00:01 | \n",
" 67.0 | \n",
" 63 | \n",
" 2 | \n",
" NaN | \n",
" 15 | \n",
" NaN | \n",
"
\n",
" \n",
" | 4 | \n",
" 6/5/13 | \n",
" 4:00:01 | \n",
" 68.0 | \n",
" 74 | \n",
" 16 | \n",
" 9.0 | \n",
" 3 | \n",
" NaN | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Date Time TargetTemp ActualTemp System SystemAge BuildingID 10\n",
"0 6/1/13 0:00:01 66.0 58 13 20.0 4 NaN\n",
"1 6/2/13 1:00:01 NaN 68 3 20.0 17 NaN\n",
"2 6/3/13 2:00:01 70.0 73 17 20.0 18 NaN\n",
"3 6/4/13 3:00:01 67.0 63 2 NaN 15 NaN\n",
"4 6/5/13 4:00:01 68.0 74 16 9.0 3 NaN"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"hvac.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 3. Fehlenden Werten den Mittelwert zuschreiben\n",
"\n",
"Hierzu verwenden wir die `mean`-Strategie von [sklearn.impute.SimpleImputer](https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html)"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [],
"source": [
"imp = SimpleImputer(missing_values=np.nan, strategy=\"mean\")"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [],
"source": [
"hvac_numeric = hvac[[\"TargetTemp\", \"SystemAge\"]]"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [],
"source": [
"imp = imp.fit(hvac_numeric.loc[:10])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Weiter Infos zu `fit` erhaltet ihr in der [Scikit Learn-Dokumentation](https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html#sklearn.impute.SimpleImputer.fit).\n",
"\n",
"[fit_transform](https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html#sklearn.impute.SimpleImputer.fit_transform) wandelt dann die angepassten Daten um:"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [],
"source": [
"transformed = imp.fit_transform(hvac_numeric)"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([[66. , 20. ],\n",
" [67.50773481, 20. ],\n",
" [70. , 20. ],\n",
" ...,\n",
" [67.50773481, 4. ],\n",
" [65. , 23. ],\n",
" [66. , 21. ]])"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"transformed"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [],
"source": [
"hvac[\"TargetTemp\"], hvac[\"SystemAge\"] = transformed[:,0], transformed[:,1]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Nun lassen wir uns die ersten Zeilen mit den geänderten Datensätzen anzeigen."
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Date | \n",
" Time | \n",
" TargetTemp | \n",
" ActualTemp | \n",
" System | \n",
" SystemAge | \n",
" BuildingID | \n",
" 10 | \n",
"
\n",
" \n",
" \n",
" \n",
" | 0 | \n",
" 6/1/13 | \n",
" 0:00:01 | \n",
" 66.000000 | \n",
" 58 | \n",
" 13 | \n",
" 20.000000 | \n",
" 4 | \n",
" NaN | \n",
"
\n",
" \n",
" | 1 | \n",
" 6/2/13 | \n",
" 1:00:01 | \n",
" 67.507735 | \n",
" 68 | \n",
" 3 | \n",
" 20.000000 | \n",
" 17 | \n",
" NaN | \n",
"
\n",
" \n",
" | 2 | \n",
" 6/3/13 | \n",
" 2:00:01 | \n",
" 70.000000 | \n",
" 73 | \n",
" 17 | \n",
" 20.000000 | \n",
" 18 | \n",
" NaN | \n",
"
\n",
" \n",
" | 3 | \n",
" 6/4/13 | \n",
" 3:00:01 | \n",
" 67.000000 | \n",
" 63 | \n",
" 2 | \n",
" 15.386643 | \n",
" 15 | \n",
" NaN | \n",
"
\n",
" \n",
" | 4 | \n",
" 6/5/13 | \n",
" 4:00:01 | \n",
" 68.000000 | \n",
" 74 | \n",
" 16 | \n",
" 9.000000 | \n",
" 3 | \n",
" NaN | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Date Time TargetTemp ActualTemp System SystemAge BuildingID 10\n",
"0 6/1/13 0:00:01 66.000000 58 13 20.000000 4 NaN\n",
"1 6/2/13 1:00:01 67.507735 68 3 20.000000 17 NaN\n",
"2 6/3/13 2:00:01 70.000000 73 17 20.000000 18 NaN\n",
"3 6/4/13 3:00:01 67.000000 63 2 15.386643 15 NaN\n",
"4 6/5/13 4:00:01 68.000000 74 16 9.000000 3 NaN"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"hvac.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 4. Skalieren\n",
"\n",
"Zur Standardisierung von Datensätzen, die wie standardnormalverteilte Daten aussehen, können wir [sklearn.preprocessing.scale](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.scale.html) verwenden. Damit lassen sich die Faktoren ermitteln, um die ein Wert sich vergrößert oder verkleinert. Dies können wir für die Skalierung der aktuellen Temperatur verwenden."
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [],
"source": [
"hvac[\"ScaledTemp\"] = preprocessing.scale(hvac[\"ActualTemp\"])"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0 -1.293272\n",
"1 0.048732\n",
"2 0.719733\n",
"3 -0.622270\n",
"4 0.853934\n",
"Name: ScaledTemp, dtype: float64"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"hvac[\"ScaledTemp\"].head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"[sklearn.preprocessing.MinMaxScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html) skaliert die Mermale so, dass sie zwischen einem bestimmten Minimal- und Maximalwert liegen, häufig zwischen Null und Eins. Dies hat den Vorteil, dass die Skalierung robuster gegenüber sehr kleinen Standardabweichungen von Merkmalen wird."
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [],
"source": [
"min_max_scaler = preprocessing.MinMaxScaler()"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [],
"source": [
"temp_minmax = min_max_scaler.fit_transform(hvac[[\"ActualTemp\"]])"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([[0.12],\n",
" [0.52],\n",
" [0.72],\n",
" ...,\n",
" [0.56],\n",
" [0.32],\n",
" [0.44]])"
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"temp_minmax"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Nun fügen wir auch `temp_minmax` noch als neue Spalte ein:"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0 0.12\n",
"1 0.52\n",
"2 0.72\n",
"3 0.32\n",
"4 0.76\n",
"Name: MinMaxScaledTemp, dtype: float64"
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"hvac[\"MinMaxScaledTemp\"] = temp_minmax[:,0]\n",
"hvac[\"MinMaxScaledTemp\"].head()"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Date | \n",
" Time | \n",
" TargetTemp | \n",
" ActualTemp | \n",
" System | \n",
" SystemAge | \n",
" BuildingID | \n",
" 10 | \n",
" ScaledTemp | \n",
" MinMaxScaledTemp | \n",
"
\n",
" \n",
" \n",
" \n",
" | 0 | \n",
" 6/1/13 | \n",
" 0:00:01 | \n",
" 66.000000 | \n",
" 58 | \n",
" 13 | \n",
" 20.000000 | \n",
" 4 | \n",
" NaN | \n",
" -1.293272 | \n",
" 0.12 | \n",
"
\n",
" \n",
" | 1 | \n",
" 6/2/13 | \n",
" 1:00:01 | \n",
" 67.507735 | \n",
" 68 | \n",
" 3 | \n",
" 20.000000 | \n",
" 17 | \n",
" NaN | \n",
" 0.048732 | \n",
" 0.52 | \n",
"
\n",
" \n",
" | 2 | \n",
" 6/3/13 | \n",
" 2:00:01 | \n",
" 70.000000 | \n",
" 73 | \n",
" 17 | \n",
" 20.000000 | \n",
" 18 | \n",
" NaN | \n",
" 0.719733 | \n",
" 0.72 | \n",
"
\n",
" \n",
" | 3 | \n",
" 6/4/13 | \n",
" 3:00:01 | \n",
" 67.000000 | \n",
" 63 | \n",
" 2 | \n",
" 15.386643 | \n",
" 15 | \n",
" NaN | \n",
" -0.622270 | \n",
" 0.32 | \n",
"
\n",
" \n",
" | 4 | \n",
" 6/5/13 | \n",
" 4:00:01 | \n",
" 68.000000 | \n",
" 74 | \n",
" 16 | \n",
" 9.000000 | \n",
" 3 | \n",
" NaN | \n",
" 0.853934 | \n",
" 0.76 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Date Time TargetTemp ActualTemp System SystemAge BuildingID 10 \\\n",
"0 6/1/13 0:00:01 66.000000 58 13 20.000000 4 NaN \n",
"1 6/2/13 1:00:01 67.507735 68 3 20.000000 17 NaN \n",
"2 6/3/13 2:00:01 70.000000 73 17 20.000000 18 NaN \n",
"3 6/4/13 3:00:01 67.000000 63 2 15.386643 15 NaN \n",
"4 6/5/13 4:00:01 68.000000 74 16 9.000000 3 NaN \n",
"\n",
" ScaledTemp MinMaxScaledTemp \n",
"0 -1.293272 0.12 \n",
"1 0.048732 0.52 \n",
"2 0.719733 0.72 \n",
"3 -0.622270 0.32 \n",
"4 0.853934 0.76 "
]
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"hvac.head()"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3.11 Kernel",
"language": "python",
"name": "python311"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.4"
},
"latex_envs": {
"LaTeX_envs_menu_present": true,
"autoclose": false,
"autocomplete": true,
"bibliofile": "biblio.bib",
"cite_by": "apalike",
"current_citInitial": 1,
"eqLabelWithNumbers": true,
"eqNumInitial": 1,
"hotkeys": {
"equation": "Ctrl-E",
"itemize": "Ctrl-I"
},
"labels_anchors": false,
"latex_user_defs": false,
"report_style_numbering": false,
"user_envs_cfg": false
},
"widgets": {
"application/vnd.jupyter.widget-state+json": {
"state": {},
"version_major": 2,
"version_minor": 0
}
}
},
"nbformat": 4,
"nbformat_minor": 4
}