{ "cells": [ { "cell_type": "markdown", "id": "290ea5a6", "metadata": {}, "source": [ "# Unterteilen und Kategorisieren von Daten\n", "\n", "Kontinuierliche Daten werden häufig in Bereiche unterteilt oder auf andere Weise für die Analyse gruppiert." ] }, { "cell_type": "markdown", "id": "391c3176", "metadata": {}, "source": [ "Angenommen, ihr habt Daten über eine Gruppe von Personen in einer Studie, die ihr in diskrete Altersgruppen einteilen möchtet. Hierfür generieren wir uns einen Dataframe mit 250 Einträgen zwischen `0` und `99`:" ] }, { "cell_type": "code", "execution_count": 1, "id": "deec7725", "metadata": { "execution": { "iopub.execute_input": "2026-05-21T14:12:58.523247Z", "iopub.status.busy": "2026-05-21T14:12:58.523059Z", "iopub.status.idle": "2026-05-21T14:12:58.742174Z", "shell.execute_reply": "2026-05-21T14:12:58.741883Z", "shell.execute_reply.started": "2026-05-21T14:12:58.523208Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Age
023
160
28
334
436
......
24541
24666
24784
24878
24933
\n", "

250 rows × 1 columns

\n", "
" ], "text/plain": [ " Age\n", "0 23\n", "1 60\n", "2 8\n", "3 34\n", "4 36\n", ".. ...\n", "245 41\n", "246 66\n", "247 84\n", "248 78\n", "249 33\n", "\n", "[250 rows x 1 columns]" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import numpy as np\n", "import pandas as pd\n", "\n", "\n", "rng = np.random.default_rng()\n", "ages = rng.integers(0, 99, 250)\n", "df = pd.DataFrame({\"Age\": ages})\n", "\n", "df" ] }, { "cell_type": "markdown", "id": "2c991e35", "metadata": {}, "source": [ "Anschließend bietet uns pandas mit [pandas.cut](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.cut.html) eine einfache Möglichkeit, die Ergebnisse in zehn Bereiche aufzuteilen. Um nur ganze Jahre zu erhalten, setzen wir zusätzlich `precision=0`:" ] }, { "cell_type": "code", "execution_count": 2, "id": "e9d30508", "metadata": { "execution": { "iopub.execute_input": "2026-05-21T14:12:58.742687Z", "iopub.status.busy": "2026-05-21T14:12:58.742561Z", "iopub.status.idle": "2026-05-21T14:12:58.746210Z", "shell.execute_reply": "2026-05-21T14:12:58.745978Z", "shell.execute_reply.started": "2026-05-21T14:12:58.742678Z" }, "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "[(20.0, 29.0], (59.0, 69.0], (-0.1, 10.0], (29.0, 39.0], (29.0, 39.0], ..., (39.0, 49.0], (59.0, 69.0], (78.0, 88.0], (69.0, 78.0], (29.0, 39.0]]\n", "Length: 250\n", "Categories (10, interval[float64, right]): [(-0.1, 10.0] < (10.0, 20.0] < (20.0, 29.0] < (29.0, 39.0] ... (59.0, 69.0] < (69.0, 78.0] < (78.0, 88.0] < (88.0, 98.0]]" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "cats = pd.cut(ages, 10, precision=0)\n", "\n", "cats" ] }, { "cell_type": "markdown", "id": "ca81953c", "metadata": {}, "source": [ "Mit [pandas.Categorical.categories](https://pandas.pydata.org/docs/reference/api/pandas.Categorical.categories.html) könnt ihr euch die Kategorien anzeigen lassen:" ] }, { "cell_type": "code", "execution_count": 3, "id": "1b8fb7b8", "metadata": { "execution": { "iopub.execute_input": "2026-05-21T14:12:58.746581Z", "iopub.status.busy": "2026-05-21T14:12:58.746506Z", "iopub.status.idle": "2026-05-21T14:12:58.748707Z", "shell.execute_reply": "2026-05-21T14:12:58.748498Z", "shell.execute_reply.started": "2026-05-21T14:12:58.746574Z" } }, "outputs": [ { "data": { "text/plain": [ "IntervalIndex([(-0.1, 10.0], (10.0, 20.0], (20.0, 29.0], (29.0, 39.0],\n", " (39.0, 49.0], (49.0, 59.0], (59.0, 69.0], (69.0, 78.0],\n", " (78.0, 88.0], (88.0, 98.0]],\n", " dtype='interval[float64, right]')" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "cats.categories" ] }, { "cell_type": "markdown", "id": "ed9275b5", "metadata": {}, "source": [ "…oder auch nur eine einzelne Kategorie:" ] }, { "cell_type": "code", "execution_count": 4, "id": "ab486178", "metadata": { "execution": { "iopub.execute_input": "2026-05-21T14:12:58.749152Z", "iopub.status.busy": "2026-05-21T14:12:58.749051Z", "iopub.status.idle": "2026-05-21T14:12:58.751151Z", "shell.execute_reply": "2026-05-21T14:12:58.750822Z", "shell.execute_reply.started": "2026-05-21T14:12:58.749144Z" } }, "outputs": [ { "data": { "text/plain": [ "Interval(-0.1, 10.0, closed='right')" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "cats.categories[0]" ] }, { "cell_type": "markdown", "id": "077d8eeb", "metadata": {}, "source": [ "Mit [pandas.Categorical.codes](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Categorical.codes.html) könnt ihr euch ein Array anzeigen lassen, in dem für jeden Wert die zugehörige Kategorie angezeigt wird:" ] }, { "cell_type": "code", "execution_count": 5, "id": "27faf8a0", "metadata": { "execution": { "iopub.execute_input": "2026-05-21T14:12:58.751702Z", "iopub.status.busy": "2026-05-21T14:12:58.751610Z", "iopub.status.idle": "2026-05-21T14:12:58.753884Z", "shell.execute_reply": "2026-05-21T14:12:58.753569Z", "shell.execute_reply.started": "2026-05-21T14:12:58.751694Z" } }, "outputs": [ { "data": { "text/plain": [ "array([2, 6, 0, 3, 3, 2, 5, 6, 2, 3, 6, 9, 2, 0, 8, 1, 9, 6, 7, 2, 2, 2,\n", " 5, 7, 2, 8, 9, 0, 2, 1, 7, 6, 8, 0, 0, 8, 0, 0, 6, 0, 7, 5, 3, 9,\n", " 7, 7, 9, 6, 0, 9, 4, 7, 7, 9, 2, 1, 5, 2, 4, 2, 1, 0, 2, 8, 8, 4,\n", " 9, 3, 6, 8, 7, 0, 3, 0, 0, 3, 1, 1, 9, 0, 5, 4, 4, 3, 6, 6, 2, 7,\n", " 2, 7, 3, 6, 2, 5, 7, 1, 5, 9, 1, 5, 7, 3, 5, 6, 7, 9, 9, 8, 3, 1,\n", " 8, 4, 5, 7, 7, 0, 6, 0, 9, 4, 1, 6, 2, 8, 2, 3, 2, 6, 8, 6, 9, 9,\n", " 9, 5, 6, 5, 1, 7, 7, 5, 9, 0, 3, 1, 6, 0, 0, 6, 8, 1, 5, 3, 9, 4,\n", " 8, 9, 0, 0, 7, 1, 9, 1, 1, 3, 0, 7, 6, 4, 1, 2, 3, 9, 4, 5, 4, 8,\n", " 9, 2, 5, 6, 0, 6, 7, 2, 6, 3, 9, 3, 1, 1, 9, 3, 1, 6, 9, 6, 0, 9,\n", " 1, 8, 0, 6, 8, 8, 9, 9, 4, 7, 4, 7, 3, 4, 1, 2, 4, 2, 8, 8, 6, 9,\n", " 7, 0, 5, 6, 1, 4, 3, 8, 1, 5, 9, 2, 7, 0, 0, 4, 2, 0, 8, 0, 3, 0,\n", " 3, 4, 9, 4, 6, 8, 7, 3], dtype=int8)" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "cats.codes" ] }, { "cell_type": "markdown", "id": "5629393d", "metadata": {}, "source": [ "Mit `value_counts` können wir uns nun anschauen, wie sich die Anzahl auf die einzelnen Bereiche verteilt:" ] }, { "cell_type": "code", "execution_count": 6, "id": "b9656f7f", "metadata": { "execution": { "iopub.execute_input": "2026-05-21T14:12:58.755605Z", "iopub.status.busy": "2026-05-21T14:12:58.755482Z", "iopub.status.idle": "2026-05-21T14:12:58.759038Z", "shell.execute_reply": "2026-05-21T14:12:58.758718Z", "shell.execute_reply.started": "2026-05-21T14:12:58.755596Z" } }, "outputs": [ { "data": { "text/plain": [ "(-0.1, 10.0] 31\n", "(88.0, 98.0] 31\n", "(59.0, 69.0] 29\n", "(20.0, 29.0] 26\n", "(69.0, 78.0] 26\n", "(10.0, 20.0] 24\n", "(29.0, 39.0] 24\n", "(78.0, 88.0] 22\n", "(39.0, 49.0] 19\n", "(49.0, 59.0] 18\n", "Name: count, dtype: int64" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.Series(cats).value_counts()" ] }, { "cell_type": "markdown", "id": "a3574d82", "metadata": {}, "source": [ "Auffalend ist, dass die Altersbereiche nicht gleich viele Jahre enthalten, sondern mit `20.0, 29.0` und `69.0, 78.0` zwei Bereiche nur 9 Jahre umfassen. Dies hängt damit zusammen, dass der Altersumfang nur von `0`bis `98` reicht:" ] }, { "cell_type": "code", "execution_count": 7, "id": "cfefc1cd", "metadata": { "execution": { "iopub.execute_input": "2026-05-21T14:12:58.759612Z", "iopub.status.busy": "2026-05-21T14:12:58.759491Z", "iopub.status.idle": "2026-05-21T14:12:58.761943Z", "shell.execute_reply": "2026-05-21T14:12:58.761696Z", "shell.execute_reply.started": "2026-05-21T14:12:58.759604Z" } }, "outputs": [ { "data": { "text/plain": [ "Age 0\n", "dtype: int64" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.min()" ] }, { "cell_type": "code", "execution_count": 8, "id": "d2f2e234", "metadata": { "execution": { "iopub.execute_input": "2026-05-21T14:12:58.762254Z", "iopub.status.busy": "2026-05-21T14:12:58.762186Z", "iopub.status.idle": "2026-05-21T14:12:58.764972Z", "shell.execute_reply": "2026-05-21T14:12:58.764704Z", "shell.execute_reply.started": "2026-05-21T14:12:58.762246Z" } }, "outputs": [ { "data": { "text/plain": [ "Age 98\n", "dtype: int64" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.max()" ] }, { "cell_type": "markdown", "id": "b5ca6122", "metadata": {}, "source": [ "Mit [pandas.qcut](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.qcut.html) wird die Menge hingegen in Bereiche unterteilt, die annähernd gleich groß sind:" ] }, { "cell_type": "code", "execution_count": 9, "id": "1c7cba64", "metadata": { "execution": { "iopub.execute_input": "2026-05-21T14:12:58.765278Z", "iopub.status.busy": "2026-05-21T14:12:58.765217Z", "iopub.status.idle": "2026-05-21T14:12:58.768380Z", "shell.execute_reply": "2026-05-21T14:12:58.768052Z", "shell.execute_reply.started": "2026-05-21T14:12:58.765272Z" } }, "outputs": [], "source": [ "cats = pd.qcut(ages, 10, precision=0)" ] }, { "cell_type": "code", "execution_count": 10, "id": "92f7c0a6", "metadata": { "execution": { "iopub.execute_input": "2026-05-21T14:12:58.768805Z", "iopub.status.busy": "2026-05-21T14:12:58.768731Z", "iopub.status.idle": "2026-05-21T14:12:58.772028Z", "shell.execute_reply": "2026-05-21T14:12:58.771761Z", "shell.execute_reply.started": "2026-05-21T14:12:58.768798Z" } }, "outputs": [ { "data": { "text/plain": [ "(-1.0, 8.0] 30\n", "(38.0, 50.0] 28\n", "(81.0, 91.0] 27\n", "(17.0, 26.0] 26\n", "(26.0, 38.0] 26\n", "(60.0, 70.0] 26\n", "(70.0, 81.0] 24\n", "(91.0, 98.0] 23\n", "(8.0, 17.0] 20\n", "(50.0, 60.0] 20\n", "Name: count, dtype: int64" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.Series(cats).value_counts()" ] }, { "cell_type": "markdown", "id": "13fa613d", "metadata": {}, "source": [ "Wollen wir gewährleisten, dass jede Altersgruppe tatsächlich genau zehn Jahre umfasst, können wir dies mit [pandas.Categorical](https://pandas.pydata.org/docs/reference/api/pandas.Categorical.html) direkt angeben:" ] }, { "cell_type": "code", "execution_count": 11, "id": "fa011ab7", "metadata": { "execution": { "iopub.execute_input": "2026-05-21T14:12:58.772494Z", "iopub.status.busy": "2026-05-21T14:12:58.772400Z", "iopub.status.idle": "2026-05-21T14:12:58.775009Z", "shell.execute_reply": "2026-05-21T14:12:58.774723Z", "shell.execute_reply.started": "2026-05-21T14:12:58.772483Z" } }, "outputs": [ { "data": { "text/plain": [ "Index(['0 - 9', '10 - 19', '100 - 109', '20 - 29', '30 - 39', '40 - 49',\n", " '50 - 59', '60 - 69', '70 - 79', '80 - 89', '90 - 99'],\n", " dtype='object')" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "age_groups = [f\"{i} - {i + 9}\" for i in range(0, 109, 10)]\n", "cats = pd.Categorical(age_groups)\n", "\n", "cats.categories" ] }, { "cell_type": "markdown", "id": "4c620f5a", "metadata": {}, "source": [ "Für die Gruppierung wird nun [pandas.cut](https://pandas.pydata.org/docs/reference/api/pandas.cut.html) verwendet:" ] }, { "cell_type": "code", "execution_count": 12, "id": "00c6b05b", "metadata": { "execution": { "iopub.execute_input": "2026-05-21T14:12:58.775515Z", "iopub.status.busy": "2026-05-21T14:12:58.775436Z", "iopub.status.idle": "2026-05-21T14:12:58.779602Z", "shell.execute_reply": "2026-05-21T14:12:58.779265Z", "shell.execute_reply.started": "2026-05-21T14:12:58.775508Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
AgeAge group
02320 - 29
16060 - 69
280 - 9
33430 - 39
43630 - 39
.........
2454140 - 49
2466660 - 69
2478480 - 89
2487870 - 79
2493330 - 39
\n", "

250 rows × 2 columns

\n", "
" ], "text/plain": [ " Age Age group\n", "0 23 20 - 29\n", "1 60 60 - 69\n", "2 8 0 - 9\n", "3 34 30 - 39\n", "4 36 30 - 39\n", ".. ... ...\n", "245 41 40 - 49\n", "246 66 60 - 69\n", "247 84 80 - 89\n", "248 78 70 - 79\n", "249 33 30 - 39\n", "\n", "[250 rows x 2 columns]" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df[\"Age group\"] = pd.cut(df.Age, range(0, 111, 10), right=False, labels=cats)\n", "\n", "df" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3.13 Kernel", "language": "python", "name": "python313" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.13.0" }, "widgets": { "application/vnd.jupyter.widget-state+json": { "state": {}, "version_major": 2, "version_minor": 0 } } }, "nbformat": 4, "nbformat_minor": 5 }