{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "290ea5a6",
   "metadata": {},
   "source": [
    "# Unterteilen und Kategorisieren von Daten\n",
    "\n",
    "Kontinuierliche Daten werden häufig in Bereiche unterteilt oder auf andere Weise für die Analyse gruppiert."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "391c3176",
   "metadata": {},
   "source": [
    "Angenommen, ihr habt Daten über eine Gruppe von Personen in einer Studie, die ihr in diskrete Altersgruppen einteilen möchtet. Hierfür generieren wir uns einen Dataframe mit 250 Einträgen zwischen `0` und `99`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "deec7725",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2026-05-21T14:12:58.523247Z",
     "iopub.status.busy": "2026-05-21T14:12:58.523059Z",
     "iopub.status.idle": "2026-05-21T14:12:58.742174Z",
     "shell.execute_reply": "2026-05-21T14:12:58.741883Z",
     "shell.execute_reply.started": "2026-05-21T14:12:58.523208Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Age</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>23</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>60</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>8</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>34</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>36</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>245</th>\n",
       "      <td>41</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>246</th>\n",
       "      <td>66</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>247</th>\n",
       "      <td>84</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>248</th>\n",
       "      <td>78</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>249</th>\n",
       "      <td>33</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>250 rows × 1 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "     Age\n",
       "0     23\n",
       "1     60\n",
       "2      8\n",
       "3     34\n",
       "4     36\n",
       "..   ...\n",
       "245   41\n",
       "246   66\n",
       "247   84\n",
       "248   78\n",
       "249   33\n",
       "\n",
       "[250 rows x 1 columns]"
      ]
     },
     "execution_count": 1,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import numpy as np\n",
    "import pandas as pd\n",
    "\n",
    "\n",
    "rng = np.random.default_rng()\n",
    "ages = rng.integers(0, 99, 250)\n",
    "df = pd.DataFrame({\"Age\": ages})\n",
    "\n",
    "df"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2c991e35",
   "metadata": {},
   "source": [
    "Anschließend bietet uns pandas mit [pandas.cut](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.cut.html) eine einfache Möglichkeit, die Ergebnisse in zehn Bereiche aufzuteilen. Um nur ganze Jahre zu erhalten, setzen wir zusätzlich `precision=0`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "e9d30508",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2026-05-21T14:12:58.742687Z",
     "iopub.status.busy": "2026-05-21T14:12:58.742561Z",
     "iopub.status.idle": "2026-05-21T14:12:58.746210Z",
     "shell.execute_reply": "2026-05-21T14:12:58.745978Z",
     "shell.execute_reply.started": "2026-05-21T14:12:58.742678Z"
    },
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[(20.0, 29.0], (59.0, 69.0], (-0.1, 10.0], (29.0, 39.0], (29.0, 39.0], ..., (39.0, 49.0], (59.0, 69.0], (78.0, 88.0], (69.0, 78.0], (29.0, 39.0]]\n",
       "Length: 250\n",
       "Categories (10, interval[float64, right]): [(-0.1, 10.0] < (10.0, 20.0] < (20.0, 29.0] < (29.0, 39.0] ... (59.0, 69.0] < (69.0, 78.0] < (78.0, 88.0] < (88.0, 98.0]]"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "cats = pd.cut(ages, 10, precision=0)\n",
    "\n",
    "cats"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ca81953c",
   "metadata": {},
   "source": [
    "Mit [pandas.Categorical.categories](https://pandas.pydata.org/docs/reference/api/pandas.Categorical.categories.html) könnt ihr euch die Kategorien anzeigen lassen:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "1b8fb7b8",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2026-05-21T14:12:58.746581Z",
     "iopub.status.busy": "2026-05-21T14:12:58.746506Z",
     "iopub.status.idle": "2026-05-21T14:12:58.748707Z",
     "shell.execute_reply": "2026-05-21T14:12:58.748498Z",
     "shell.execute_reply.started": "2026-05-21T14:12:58.746574Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "IntervalIndex([(-0.1, 10.0], (10.0, 20.0], (20.0, 29.0], (29.0, 39.0],\n",
       "               (39.0, 49.0], (49.0, 59.0], (59.0, 69.0], (69.0, 78.0],\n",
       "               (78.0, 88.0], (88.0, 98.0]],\n",
       "              dtype='interval[float64, right]')"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "cats.categories"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ed9275b5",
   "metadata": {},
   "source": [
    "…oder auch nur eine einzelne Kategorie:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "ab486178",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2026-05-21T14:12:58.749152Z",
     "iopub.status.busy": "2026-05-21T14:12:58.749051Z",
     "iopub.status.idle": "2026-05-21T14:12:58.751151Z",
     "shell.execute_reply": "2026-05-21T14:12:58.750822Z",
     "shell.execute_reply.started": "2026-05-21T14:12:58.749144Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Interval(-0.1, 10.0, closed='right')"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "cats.categories[0]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "077d8eeb",
   "metadata": {},
   "source": [
    "Mit [pandas.Categorical.codes](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Categorical.codes.html) könnt ihr euch ein Array anzeigen lassen, in dem für jeden Wert die zugehörige Kategorie angezeigt wird:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "27faf8a0",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2026-05-21T14:12:58.751702Z",
     "iopub.status.busy": "2026-05-21T14:12:58.751610Z",
     "iopub.status.idle": "2026-05-21T14:12:58.753884Z",
     "shell.execute_reply": "2026-05-21T14:12:58.753569Z",
     "shell.execute_reply.started": "2026-05-21T14:12:58.751694Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([2, 6, 0, 3, 3, 2, 5, 6, 2, 3, 6, 9, 2, 0, 8, 1, 9, 6, 7, 2, 2, 2,\n",
       "       5, 7, 2, 8, 9, 0, 2, 1, 7, 6, 8, 0, 0, 8, 0, 0, 6, 0, 7, 5, 3, 9,\n",
       "       7, 7, 9, 6, 0, 9, 4, 7, 7, 9, 2, 1, 5, 2, 4, 2, 1, 0, 2, 8, 8, 4,\n",
       "       9, 3, 6, 8, 7, 0, 3, 0, 0, 3, 1, 1, 9, 0, 5, 4, 4, 3, 6, 6, 2, 7,\n",
       "       2, 7, 3, 6, 2, 5, 7, 1, 5, 9, 1, 5, 7, 3, 5, 6, 7, 9, 9, 8, 3, 1,\n",
       "       8, 4, 5, 7, 7, 0, 6, 0, 9, 4, 1, 6, 2, 8, 2, 3, 2, 6, 8, 6, 9, 9,\n",
       "       9, 5, 6, 5, 1, 7, 7, 5, 9, 0, 3, 1, 6, 0, 0, 6, 8, 1, 5, 3, 9, 4,\n",
       "       8, 9, 0, 0, 7, 1, 9, 1, 1, 3, 0, 7, 6, 4, 1, 2, 3, 9, 4, 5, 4, 8,\n",
       "       9, 2, 5, 6, 0, 6, 7, 2, 6, 3, 9, 3, 1, 1, 9, 3, 1, 6, 9, 6, 0, 9,\n",
       "       1, 8, 0, 6, 8, 8, 9, 9, 4, 7, 4, 7, 3, 4, 1, 2, 4, 2, 8, 8, 6, 9,\n",
       "       7, 0, 5, 6, 1, 4, 3, 8, 1, 5, 9, 2, 7, 0, 0, 4, 2, 0, 8, 0, 3, 0,\n",
       "       3, 4, 9, 4, 6, 8, 7, 3], dtype=int8)"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "cats.codes"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5629393d",
   "metadata": {},
   "source": [
    "Mit `value_counts` können wir uns nun anschauen, wie sich die Anzahl auf die einzelnen Bereiche verteilt:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "b9656f7f",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2026-05-21T14:12:58.755605Z",
     "iopub.status.busy": "2026-05-21T14:12:58.755482Z",
     "iopub.status.idle": "2026-05-21T14:12:58.759038Z",
     "shell.execute_reply": "2026-05-21T14:12:58.758718Z",
     "shell.execute_reply.started": "2026-05-21T14:12:58.755596Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(-0.1, 10.0]    31\n",
       "(88.0, 98.0]    31\n",
       "(59.0, 69.0]    29\n",
       "(20.0, 29.0]    26\n",
       "(69.0, 78.0]    26\n",
       "(10.0, 20.0]    24\n",
       "(29.0, 39.0]    24\n",
       "(78.0, 88.0]    22\n",
       "(39.0, 49.0]    19\n",
       "(49.0, 59.0]    18\n",
       "Name: count, dtype: int64"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pd.Series(cats).value_counts()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a3574d82",
   "metadata": {},
   "source": [
    "Auffalend ist, dass die Altersbereiche nicht gleich viele Jahre enthalten, sondern mit `20.0, 29.0` und `69.0, 78.0` zwei Bereiche nur 9 Jahre umfassen. Dies hängt damit zusammen, dass der Altersumfang nur von `0`bis `98` reicht:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "cfefc1cd",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2026-05-21T14:12:58.759612Z",
     "iopub.status.busy": "2026-05-21T14:12:58.759491Z",
     "iopub.status.idle": "2026-05-21T14:12:58.761943Z",
     "shell.execute_reply": "2026-05-21T14:12:58.761696Z",
     "shell.execute_reply.started": "2026-05-21T14:12:58.759604Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Age    0\n",
       "dtype: int64"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.min()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "d2f2e234",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2026-05-21T14:12:58.762254Z",
     "iopub.status.busy": "2026-05-21T14:12:58.762186Z",
     "iopub.status.idle": "2026-05-21T14:12:58.764972Z",
     "shell.execute_reply": "2026-05-21T14:12:58.764704Z",
     "shell.execute_reply.started": "2026-05-21T14:12:58.762246Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Age    98\n",
       "dtype: int64"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.max()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b5ca6122",
   "metadata": {},
   "source": [
    "Mit [pandas.qcut](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.qcut.html) wird die Menge hingegen in Bereiche unterteilt, die annähernd gleich groß sind:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "1c7cba64",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2026-05-21T14:12:58.765278Z",
     "iopub.status.busy": "2026-05-21T14:12:58.765217Z",
     "iopub.status.idle": "2026-05-21T14:12:58.768380Z",
     "shell.execute_reply": "2026-05-21T14:12:58.768052Z",
     "shell.execute_reply.started": "2026-05-21T14:12:58.765272Z"
    }
   },
   "outputs": [],
   "source": [
    "cats = pd.qcut(ages, 10, precision=0)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "92f7c0a6",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2026-05-21T14:12:58.768805Z",
     "iopub.status.busy": "2026-05-21T14:12:58.768731Z",
     "iopub.status.idle": "2026-05-21T14:12:58.772028Z",
     "shell.execute_reply": "2026-05-21T14:12:58.771761Z",
     "shell.execute_reply.started": "2026-05-21T14:12:58.768798Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(-1.0, 8.0]     30\n",
       "(38.0, 50.0]    28\n",
       "(81.0, 91.0]    27\n",
       "(17.0, 26.0]    26\n",
       "(26.0, 38.0]    26\n",
       "(60.0, 70.0]    26\n",
       "(70.0, 81.0]    24\n",
       "(91.0, 98.0]    23\n",
       "(8.0, 17.0]     20\n",
       "(50.0, 60.0]    20\n",
       "Name: count, dtype: int64"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pd.Series(cats).value_counts()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "13fa613d",
   "metadata": {},
   "source": [
    "Wollen wir gewährleisten, dass jede Altersgruppe tatsächlich genau zehn Jahre umfasst, können wir dies mit [pandas.Categorical](https://pandas.pydata.org/docs/reference/api/pandas.Categorical.html) direkt angeben:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "fa011ab7",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2026-05-21T14:12:58.772494Z",
     "iopub.status.busy": "2026-05-21T14:12:58.772400Z",
     "iopub.status.idle": "2026-05-21T14:12:58.775009Z",
     "shell.execute_reply": "2026-05-21T14:12:58.774723Z",
     "shell.execute_reply.started": "2026-05-21T14:12:58.772483Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Index(['0 - 9', '10 - 19', '100 - 109', '20 - 29', '30 - 39', '40 - 49',\n",
       "       '50 - 59', '60 - 69', '70 - 79', '80 - 89', '90 - 99'],\n",
       "      dtype='object')"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "age_groups = [f\"{i} - {i + 9}\" for i in range(0, 109, 10)]\n",
    "cats = pd.Categorical(age_groups)\n",
    "\n",
    "cats.categories"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4c620f5a",
   "metadata": {},
   "source": [
    "Für die Gruppierung wird nun [pandas.cut](https://pandas.pydata.org/docs/reference/api/pandas.cut.html) verwendet:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "00c6b05b",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2026-05-21T14:12:58.775515Z",
     "iopub.status.busy": "2026-05-21T14:12:58.775436Z",
     "iopub.status.idle": "2026-05-21T14:12:58.779602Z",
     "shell.execute_reply": "2026-05-21T14:12:58.779265Z",
     "shell.execute_reply.started": "2026-05-21T14:12:58.775508Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Age</th>\n",
       "      <th>Age group</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>23</td>\n",
       "      <td>20 - 29</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>60</td>\n",
       "      <td>60 - 69</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>8</td>\n",
       "      <td>0 - 9</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>34</td>\n",
       "      <td>30 - 39</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>36</td>\n",
       "      <td>30 - 39</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>245</th>\n",
       "      <td>41</td>\n",
       "      <td>40 - 49</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>246</th>\n",
       "      <td>66</td>\n",
       "      <td>60 - 69</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>247</th>\n",
       "      <td>84</td>\n",
       "      <td>80 - 89</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>248</th>\n",
       "      <td>78</td>\n",
       "      <td>70 - 79</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>249</th>\n",
       "      <td>33</td>\n",
       "      <td>30 - 39</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>250 rows × 2 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "     Age Age group\n",
       "0     23   20 - 29\n",
       "1     60   60 - 69\n",
       "2      8     0 - 9\n",
       "3     34   30 - 39\n",
       "4     36   30 - 39\n",
       "..   ...       ...\n",
       "245   41   40 - 49\n",
       "246   66   60 - 69\n",
       "247   84   80 - 89\n",
       "248   78   70 - 79\n",
       "249   33   30 - 39\n",
       "\n",
       "[250 rows x 2 columns]"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df[\"Age group\"] = pd.cut(df.Age, range(0, 111, 10), right=False, labels=cats)\n",
    "\n",
    "df"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3.13 Kernel",
   "language": "python",
   "name": "python313"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.13.0"
  },
  "widgets": {
   "application/vnd.jupyter.widget-state+json": {
    "state": {},
    "version_major": 2,
    "version_minor": 0
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}