{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# TDDA: Test-Driven Data Analysis\n",
"\n",
"[TDDA](https://github.com/tdda/tdda) verwendet Dateieingaben (wie NumPy-Arrays oder Pandas DataFrames) und eine Reihe von Einschränkungen (engl.: _constraints_), die als JSON-Datei gespeichert werden.\n",
"\n",
"* [Reference Test](https://tdda.readthedocs.io/en/latest/referencetest.html) unterstützt die Erstellung von Referenztests, die entweder auf `unittest` oder `pytest` basieren.\n",
"* [Constraints](https://tdda.readthedocs.io/en/v1.0.30/constraints.html) wird verwendet, um Constraints aus einem (Pandas)-DataFrame zu ermitteln, sie als JSON auszuschreiben und zu überprüfen, ob Datensätze die Constraints in der Constraints-Datei erfüllen. Es unterstützt auch Tabellen in einer Vielzahl von relationalen Datenbanken.\n",
"* [Rexpy](https://tdda.readthedocs.io/en/v1.0.30/rexpy.html) ist ein Werkzeug zur automatischen Ableitung von regulären Ausdrücken aus einer Spalte in einem Pandas DataFrame oder aus einer (Python)-Liste von Beispielen."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 1. Importe"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"execution": {
"iopub.execute_input": "2026-05-22T13:21:04.077623Z",
"iopub.status.busy": "2026-05-22T13:21:04.077531Z",
"iopub.status.idle": "2026-05-22T13:21:04.420445Z",
"shell.execute_reply": "2026-05-22T13:21:04.419894Z",
"shell.execute_reply.started": "2026-05-22T13:21:04.077613Z"
}
},
"outputs": [],
"source": [
"from pathlib import Path\n",
"\n",
"import pandas as pd\n",
"\n",
"from tdda.constraints import detect_df, discover_df, verify_df"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"execution": {
"iopub.execute_input": "2026-05-22T13:21:04.420929Z",
"iopub.status.busy": "2026-05-22T13:21:04.420773Z",
"iopub.status.idle": "2026-05-22T13:21:06.406058Z",
"shell.execute_reply": "2026-05-22T13:21:06.405321Z",
"shell.execute_reply.started": "2026-05-22T13:21:04.420920Z"
}
},
"outputs": [],
"source": [
"df = pd.read_csv(\n",
" \"https://raw.githubusercontent.com/kjam/data-cleaning-101/master/data/iot_example.csv\",\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 2. Daten überprüfen\n",
"\n",
"Mit [pandas.DataFrame.sample](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sample.html) lassen wir uns die ersten zehn Datensätze anzeigen:"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"execution": {
"iopub.execute_input": "2026-05-22T13:21:06.406444Z",
"iopub.status.busy": "2026-05-22T13:21:06.406359Z",
"iopub.status.idle": "2026-05-22T13:21:06.415709Z",
"shell.execute_reply": "2026-05-22T13:21:06.414950Z",
"shell.execute_reply.started": "2026-05-22T13:21:06.406436Z"
}
},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" timestamp | \n",
" username | \n",
" temperature | \n",
" heartrate | \n",
" build | \n",
" latest | \n",
" note | \n",
"
\n",
" \n",
" \n",
" \n",
" | 77827 | \n",
" 2017-02-01T14:54:10 | \n",
" lindsay77 | \n",
" 23 | \n",
" 62 | \n",
" 26317041-1f28-d128-a1af-a44c5dc1eda0 | \n",
" 0 | \n",
" update | \n",
"
\n",
" \n",
" | 65826 | \n",
" 2017-01-27T19:31:16 | \n",
" manderson | \n",
" 10 | \n",
" 70 | \n",
" 5700c2c7-e5e7-5bda-9a17-c3d340344bb4 | \n",
" 0 | \n",
" user | \n",
"
\n",
" \n",
" | 123647 | \n",
" 2017-02-19T21:31:01 | \n",
" robertneal | \n",
" 27 | \n",
" 86 | \n",
" bc9da618-bcfe-6b4c-b7ff-30f6b595c450 | \n",
" 0 | \n",
" NaN | \n",
"
\n",
" \n",
" | 82447 | \n",
" 2017-02-03T11:19:17 | \n",
" davidanderson | \n",
" 28 | \n",
" 60 | \n",
" 81298dd3-1a0f-6ad4-26d2-81907f047b5d | \n",
" 1 | \n",
" sleep | \n",
"
\n",
" \n",
" | 123673 | \n",
" 2017-02-19T21:46:39 | \n",
" ayang | \n",
" 13 | \n",
" 60 | \n",
" 45224df1-bd40-ab37-4a47-afd19c6949f2 | \n",
" 1 | \n",
" update | \n",
"
\n",
" \n",
" | 131661 | \n",
" 2017-02-23T02:32:03 | \n",
" fmedina | \n",
" 16 | \n",
" 81 | \n",
" 0f90ef61-e3ef-4433-2d57-14227bde3aa6 | \n",
" 0 | \n",
" wake | \n",
"
\n",
" \n",
" | 73189 | \n",
" 2017-01-30T18:23:46 | \n",
" jasminfleming | \n",
" 22 | \n",
" 75 | \n",
" 16205322-995b-510f-895d-ca2ecde39901 | \n",
" 0 | \n",
" sleep | \n",
"
\n",
" \n",
" | 92468 | \n",
" 2017-02-07T11:20:06 | \n",
" hudsonangela | \n",
" 9 | \n",
" 61 | \n",
" f7b8f4e2-dfbe-7f8e-fc68-ee37cf6ba4e2 | \n",
" 0 | \n",
" test | \n",
"
\n",
" \n",
" | 78680 | \n",
" 2017-02-01T23:11:35 | \n",
" richardnewton | \n",
" 23 | \n",
" 78 | \n",
" 1ce40a86-f5a8-767c-5afd-396a1dde224f | \n",
" 0 | \n",
" interval | \n",
"
\n",
" \n",
" | 20694 | \n",
" 2017-01-09T18:21:41 | \n",
" zachary74 | \n",
" 22 | \n",
" 83 | \n",
" d1b5c037-8499-893f-6b14-53584b03b050 | \n",
" 0 | \n",
" wake | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" timestamp username temperature heartrate \\\n",
"77827 2017-02-01T14:54:10 lindsay77 23 62 \n",
"65826 2017-01-27T19:31:16 manderson 10 70 \n",
"123647 2017-02-19T21:31:01 robertneal 27 86 \n",
"82447 2017-02-03T11:19:17 davidanderson 28 60 \n",
"123673 2017-02-19T21:46:39 ayang 13 60 \n",
"131661 2017-02-23T02:32:03 fmedina 16 81 \n",
"73189 2017-01-30T18:23:46 jasminfleming 22 75 \n",
"92468 2017-02-07T11:20:06 hudsonangela 9 61 \n",
"78680 2017-02-01T23:11:35 richardnewton 23 78 \n",
"20694 2017-01-09T18:21:41 zachary74 22 83 \n",
"\n",
" build latest note \n",
"77827 26317041-1f28-d128-a1af-a44c5dc1eda0 0 update \n",
"65826 5700c2c7-e5e7-5bda-9a17-c3d340344bb4 0 user \n",
"123647 bc9da618-bcfe-6b4c-b7ff-30f6b595c450 0 NaN \n",
"82447 81298dd3-1a0f-6ad4-26d2-81907f047b5d 1 sleep \n",
"123673 45224df1-bd40-ab37-4a47-afd19c6949f2 1 update \n",
"131661 0f90ef61-e3ef-4433-2d57-14227bde3aa6 0 wake \n",
"73189 16205322-995b-510f-895d-ca2ecde39901 0 sleep \n",
"92468 f7b8f4e2-dfbe-7f8e-fc68-ee37cf6ba4e2 0 test \n",
"78680 1ce40a86-f5a8-767c-5afd-396a1dde224f 0 interval \n",
"20694 d1b5c037-8499-893f-6b14-53584b03b050 0 wake "
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.sample(10)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Und mit [pandas.DataFrame.dtypes](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dtypes.html) lassen wir uns die Datentypen für die einzelnen Spalten anzeigen:"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"execution": {
"iopub.execute_input": "2026-05-22T13:21:06.416107Z",
"iopub.status.busy": "2026-05-22T13:21:06.416018Z",
"iopub.status.idle": "2026-05-22T13:21:06.418708Z",
"shell.execute_reply": "2026-05-22T13:21:06.418176Z",
"shell.execute_reply.started": "2026-05-22T13:21:06.416098Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"timestamp object\n",
"username object\n",
"temperature int64\n",
"heartrate int64\n",
"build object\n",
"latest int64\n",
"note object\n",
"dtype: object"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.dtypes"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 3. Erstellen eines _constraints_-Objekt\n",
"\n",
"Mit `discover_constraints` kann ein Vonstraints-Objekt erzeugt werden."
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"execution": {
"iopub.execute_input": "2026-05-22T13:21:06.419063Z",
"iopub.status.busy": "2026-05-22T13:21:06.418986Z",
"iopub.status.idle": "2026-05-22T13:21:06.629419Z",
"shell.execute_reply": "2026-05-22T13:21:06.628890Z",
"shell.execute_reply.started": "2026-05-22T13:21:06.419055Z"
}
},
"outputs": [],
"source": [
"constraints = discover_df(df)"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"execution": {
"iopub.execute_input": "2026-05-22T13:21:06.630309Z",
"iopub.status.busy": "2026-05-22T13:21:06.630178Z",
"iopub.status.idle": "2026-05-22T13:21:06.632807Z",
"shell.execute_reply": "2026-05-22T13:21:06.632391Z",
"shell.execute_reply.started": "2026-05-22T13:21:06.630298Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
""
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"constraints"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"execution": {
"iopub.execute_input": "2026-05-22T13:21:06.633285Z",
"iopub.status.busy": "2026-05-22T13:21:06.633206Z",
"iopub.status.idle": "2026-05-22T13:21:06.635901Z",
"shell.execute_reply": "2026-05-22T13:21:06.635388Z",
"shell.execute_reply.started": "2026-05-22T13:21:06.633277Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"Fields([('timestamp', ),\n",
" ('username', ),\n",
" ('temperature',\n",
" ),\n",
" ('heartrate', ),\n",
" ('build', ),\n",
" ('latest', ),\n",
" ('note', )])"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"constraints.fields"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 4. Schreiben der _Constraints_ in eine Datei"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"execution": {
"iopub.execute_input": "2026-05-22T13:21:06.636308Z",
"iopub.status.busy": "2026-05-22T13:21:06.636238Z",
"iopub.status.idle": "2026-05-22T13:21:06.638775Z",
"shell.execute_reply": "2026-05-22T13:21:06.638331Z",
"shell.execute_reply.started": "2026-05-22T13:21:06.636301Z"
}
},
"outputs": [],
"source": [
"with Path.open(\"../../data/iot_example.json\", \"w\") as f:\n",
" f.write(constraints.to_json())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Wenn wir uns die Datei genauer betrachten können wir erkennen, dass z.B. für die `timestamp`-Spalte eine Zeichenkette mit 19 Zeichen erwartet wird und `temperature` Integer mit Werten von 5–29 erwartet."
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"execution": {
"iopub.execute_input": "2026-05-22T13:21:06.639151Z",
"iopub.status.busy": "2026-05-22T13:21:06.639088Z",
"iopub.status.idle": "2026-05-22T13:21:06.766514Z",
"shell.execute_reply": "2026-05-22T13:21:06.765668Z",
"shell.execute_reply.started": "2026-05-22T13:21:06.639145Z"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"{\n",
" \"creation_metadata\": {\n",
" \"local_time\": \"2026-05-22T15:21:06\",\n",
" \"utc_time\": \"2026-05-22T13:21:06+00:00\",\n",
" \"creator\": \"TDDA 2.2.17\",\n",
" \"host\": \"fay.local\",\n",
" \"user\": \"veit\",\n",
" \"n_records\": 146397,\n",
" \"n_selected\": 146397\n",
" },\n",
" \"fields\": {\n",
" \"timestamp\": {\n",
" \"type\": \"string\",\n",
" \"min_length\": 19,\n",
" \"max_length\": 19,\n",
" \"max_nulls\": 0,\n",
" \"no_duplicates\": true\n",
" },\n",
" \"username\": {\n",
" \"type\": \"string\",\n",
" \"min_length\": 3,\n",
" \"max_length\": 21,\n",
" \"max_nulls\": 0\n",
" },\n",
" \"temperature\": {\n",
" \"type\": \"int\",\n",
" \"min\": 5,\n",
" \"max\": 29,\n",
" \"sign\": \"positive\",\n",
" \"max_nulls\": 0\n",
" },\n",
" \"heartrate\": {\n",
" \"type\": \"int\",\n",
" \"min\": 60,\n",
" \"max\": 89,\n",
" \"sign\": \"positive\",\n",
" \"max_nulls\": 0\n",
" },\n",
" \"build\": {\n",
" \"type\": \"string\",\n",
" \"min_length\": 36,\n",
" \"max_length\": 36,\n",
" \"max_nulls\": 0,\n",
" \"no_duplicates\": true\n",
" },\n",
" \"latest\": {\n",
" \"type\": \"int\",\n",
" \"min\": 0,\n",
" \"max\": 1,\n",
" \"sign\": \"non-negative\",\n",
" \"max_nulls\": 0\n",
" },\n",
" \"note\": {\n",
" \"type\": \"string\",\n",
" \"min_length\": 4,\n",
" \"max_length\": 8,\n",
" \"allowed_values\": [\n",
" \"interval\",\n",
" \"sleep\",\n",
" \"test\",\n",
" \"update\",\n",
" \"user\",\n",
" \"wake\"\n",
" ]\n",
" }\n",
" }\n",
"}\n"
]
}
],
"source": [
"!cat ../../data/iot_example.json"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 5. Überprüfen von Dataframes\n",
"\n",
"Hierfür lesen wir zunächst eine neue csv-Datei mit Pandas ein und lassen uns dann zehn Datensätze exemplarisch ausgeben:"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {
"execution": {
"iopub.execute_input": "2026-05-22T13:21:06.767281Z",
"iopub.status.busy": "2026-05-22T13:21:06.767151Z",
"iopub.status.idle": "2026-05-22T13:21:08.925123Z",
"shell.execute_reply": "2026-05-22T13:21:08.924734Z",
"shell.execute_reply.started": "2026-05-22T13:21:06.767273Z"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" timestamp | \n",
" username | \n",
" temperature | \n",
" heartrate | \n",
" build | \n",
" latest | \n",
" note | \n",
"
\n",
" \n",
" \n",
" \n",
" | 50968 | \n",
" 2017-01-21T20:52:57 | \n",
" cguerra | \n",
" 16.0 | \n",
" 60 | \n",
" d745f0ec-cc32-d105-da12-2b78760c031c | \n",
" 1.0 | \n",
" wake | \n",
"
\n",
" \n",
" | 102289 | \n",
" 2017-02-11T09:20:55 | \n",
" james22 | \n",
" NaN | \n",
" 67 | \n",
" 62650884-e7a8-c655-ec93-68481ba40254 | \n",
" 0.0 | \n",
" NaN | \n",
"
\n",
" \n",
" | 3516 | \n",
" 2017-01-02T21:59:41 | \n",
" sandersmatthew | \n",
" NaN | \n",
" 69 | \n",
" 3a1202f8-d362-e7a0-0590-c440dddb7251 | \n",
" NaN | \n",
" test | \n",
"
\n",
" \n",
" | 73137 | \n",
" 2017-01-30T17:50:59 | \n",
" dixonalison | \n",
" 12.0 | \n",
" 81 | \n",
" 44ab03bb-a015-04f5-1547-75984ab022b3 | \n",
" 1.0 | \n",
" test | \n",
"
\n",
" \n",
" | 18967 | \n",
" 2017-01-09T01:43:42 | \n",
" dillon91 | \n",
" NaN | \n",
" 63 | \n",
" 944d5821-2a9a-56f9-60f6-4d329f76da46 | \n",
" 0.0 | \n",
" NaN | \n",
"
\n",
" \n",
" | 122212 | \n",
" 2017-02-19T07:54:50 | \n",
" philip23 | \n",
" 8.0 | \n",
" 68 | \n",
" e0f57946-2df2-0e9b-1af7-04b174d6fd13 | \n",
" 1.0 | \n",
" sleep | \n",
"
\n",
" \n",
" | 108919 | \n",
" 2017-02-14T00:50:51 | \n",
" wadebrian | \n",
" 18.0 | \n",
" 63 | \n",
" 1cfe6513-21ef-b87b-f21a-e6b1e63dbe29 | \n",
" 0.0 | \n",
" NaN | \n",
"
\n",
" \n",
" | 132834 | \n",
" 2017-02-23T13:46:52 | \n",
" walkerkimberly | \n",
" 5.0 | \n",
" 67 | \n",
" NaN | \n",
" 0.0 | \n",
" NaN | \n",
"
\n",
" \n",
" | 76472 | \n",
" 2017-02-01T01:50:30 | \n",
" raymond34 | \n",
" 24.0 | \n",
" 85 | \n",
" NaN | \n",
" 0.0 | \n",
" wake | \n",
"
\n",
" \n",
" | 44192 | \n",
" 2017-01-19T03:49:03 | \n",
" brenda45 | \n",
" 13.0 | \n",
" 60 | \n",
" c72d9894-95ca-4cb5-7517-637d26dad91f | \n",
" 1.0 | \n",
" test | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" timestamp username temperature heartrate \\\n",
"50968 2017-01-21T20:52:57 cguerra 16.0 60 \n",
"102289 2017-02-11T09:20:55 james22 NaN 67 \n",
"3516 2017-01-02T21:59:41 sandersmatthew NaN 69 \n",
"73137 2017-01-30T17:50:59 dixonalison 12.0 81 \n",
"18967 2017-01-09T01:43:42 dillon91 NaN 63 \n",
"122212 2017-02-19T07:54:50 philip23 8.0 68 \n",
"108919 2017-02-14T00:50:51 wadebrian 18.0 63 \n",
"132834 2017-02-23T13:46:52 walkerkimberly 5.0 67 \n",
"76472 2017-02-01T01:50:30 raymond34 24.0 85 \n",
"44192 2017-01-19T03:49:03 brenda45 13.0 60 \n",
"\n",
" build latest note \n",
"50968 d745f0ec-cc32-d105-da12-2b78760c031c 1.0 wake \n",
"102289 62650884-e7a8-c655-ec93-68481ba40254 0.0 NaN \n",
"3516 3a1202f8-d362-e7a0-0590-c440dddb7251 NaN test \n",
"73137 44ab03bb-a015-04f5-1547-75984ab022b3 1.0 test \n",
"18967 944d5821-2a9a-56f9-60f6-4d329f76da46 0.0 NaN \n",
"122212 e0f57946-2df2-0e9b-1af7-04b174d6fd13 1.0 sleep \n",
"108919 1cfe6513-21ef-b87b-f21a-e6b1e63dbe29 0.0 NaN \n",
"132834 NaN 0.0 NaN \n",
"76472 NaN 0.0 wake \n",
"44192 c72d9894-95ca-4cb5-7517-637d26dad91f 1.0 test "
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"new_df = pd.read_csv(\n",
" \"https://raw.githubusercontent.com/kjam/data-cleaning-101/master/data/iot_example_with_nulls.csv\"\n",
")\n",
"\n",
"new_df.sample(10)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Wir sehen mehrere Felder, die als `NaN` ausgegeben werden. Um dies nun systematisch zu analysieren, wenden wir [verify_df](https://tdda.readthedocs.io/en/v1.0.31/constraints.html#tdda.constraints.verify_df) auf unseren neuen DataFrame an. Dabei gibt `passes` gibt die Anzahl der bestandenen, `failures` die Anzahl der fehlgeschlagenen Constraints zurück."
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {
"execution": {
"iopub.execute_input": "2026-05-22T13:21:08.925614Z",
"iopub.status.busy": "2026-05-22T13:21:08.925522Z",
"iopub.status.idle": "2026-05-22T13:21:09.133268Z",
"shell.execute_reply": "2026-05-22T13:21:09.132794Z",
"shell.execute_reply.started": "2026-05-22T13:21:08.925605Z"
}
},
"outputs": [],
"source": [
"v = verify_df(new_df, \"../../data/iot_example.json\")"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {
"execution": {
"iopub.execute_input": "2026-05-22T13:21:09.133717Z",
"iopub.status.busy": "2026-05-22T13:21:09.133633Z",
"iopub.status.idle": "2026-05-22T13:21:09.137406Z",
"shell.execute_reply": "2026-05-22T13:21:09.135945Z",
"shell.execute_reply.started": "2026-05-22T13:21:09.133708Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
""
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"v"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {
"execution": {
"iopub.execute_input": "2026-05-22T13:21:09.137969Z",
"iopub.status.busy": "2026-05-22T13:21:09.137819Z",
"iopub.status.idle": "2026-05-22T13:21:09.140420Z",
"shell.execute_reply": "2026-05-22T13:21:09.139700Z",
"shell.execute_reply.started": "2026-05-22T13:21:09.137960Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"30"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"v.passes"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {
"execution": {
"iopub.execute_input": "2026-05-22T13:21:09.140807Z",
"iopub.status.busy": "2026-05-22T13:21:09.140738Z",
"iopub.status.idle": "2026-05-22T13:21:09.143540Z",
"shell.execute_reply": "2026-05-22T13:21:09.143088Z",
"shell.execute_reply.started": "2026-05-22T13:21:09.140800Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"3"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"v.failures"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Wir können uns auch anzeigen lassen, in welchen Spalten welche Constraints bestanden und fehlgeschlagen sind: "
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {
"execution": {
"iopub.execute_input": "2026-05-22T13:21:09.143934Z",
"iopub.status.busy": "2026-05-22T13:21:09.143863Z",
"iopub.status.idle": "2026-05-22T13:21:09.146470Z",
"shell.execute_reply": "2026-05-22T13:21:09.145469Z",
"shell.execute_reply.started": "2026-05-22T13:21:09.143927Z"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"FIELDS:\n",
"\n",
"timestamp: 0 failures 5 passes type ✓ min_length ✓ max_length ✓ max_nulls ✓ no_duplicates ✓\n",
"\n",
"username: 0 failures 4 passes type ✓ min_length ✓ max_length ✓ max_nulls ✓\n",
"\n",
"temperature: 1 failure 4 passes type ✓ min ✓ max ✓ sign ✓ max_nulls ✗\n",
"\n",
"heartrate: 0 failures 5 passes type ✓ min ✓ max ✓ sign ✓ max_nulls ✓\n",
"\n",
"build: 1 failure 4 passes type ✓ min_length ✓ max_length ✓ max_nulls ✗ no_duplicates ✓\n",
"\n",
"latest: 1 failure 4 passes type ✓ min ✓ max ✓ sign ✓ max_nulls ✗\n",
"\n",
"note: 0 failures 4 passes type ✓ min_length ✓ max_length ✓ allowed_values ✓\n",
"\n",
"SUMMARY:\n",
"\n",
"Constraints passing: 30\n",
"Constraints failing: 3\n"
]
}
],
"source": [
"print(str(v))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Alternativ können wir uns diese Ergebnisse auch tabellarisch anzeigen lassen:"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {
"execution": {
"iopub.execute_input": "2026-05-22T13:21:09.146849Z",
"iopub.status.busy": "2026-05-22T13:21:09.146769Z",
"iopub.status.idle": "2026-05-22T13:21:09.153570Z",
"shell.execute_reply": "2026-05-22T13:21:09.153002Z",
"shell.execute_reply.started": "2026-05-22T13:21:09.146841Z"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" field | \n",
" failures | \n",
" passes | \n",
" type | \n",
" min | \n",
" min_length | \n",
" max | \n",
" max_length | \n",
" sign | \n",
" max_nulls | \n",
" no_duplicates | \n",
" allowed_values | \n",
"
\n",
" \n",
" \n",
" \n",
" | 0 | \n",
" timestamp | \n",
" 0 | \n",
" 5 | \n",
" True | \n",
" NaN | \n",
" True | \n",
" NaN | \n",
" True | \n",
" NaN | \n",
" True | \n",
" True | \n",
" NaN | \n",
"
\n",
" \n",
" | 1 | \n",
" username | \n",
" 0 | \n",
" 4 | \n",
" True | \n",
" NaN | \n",
" True | \n",
" NaN | \n",
" True | \n",
" NaN | \n",
" True | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" | 2 | \n",
" temperature | \n",
" 1 | \n",
" 4 | \n",
" True | \n",
" True | \n",
" NaN | \n",
" True | \n",
" NaN | \n",
" True | \n",
" False | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" | 3 | \n",
" heartrate | \n",
" 0 | \n",
" 5 | \n",
" True | \n",
" True | \n",
" NaN | \n",
" True | \n",
" NaN | \n",
" True | \n",
" True | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" | 4 | \n",
" build | \n",
" 1 | \n",
" 4 | \n",
" True | \n",
" NaN | \n",
" True | \n",
" NaN | \n",
" True | \n",
" NaN | \n",
" False | \n",
" True | \n",
" NaN | \n",
"
\n",
" \n",
" | 5 | \n",
" latest | \n",
" 1 | \n",
" 4 | \n",
" True | \n",
" True | \n",
" NaN | \n",
" True | \n",
" NaN | \n",
" True | \n",
" False | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" | 6 | \n",
" note | \n",
" 0 | \n",
" 4 | \n",
" True | \n",
" NaN | \n",
" True | \n",
" NaN | \n",
" True | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" True | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" field failures passes type min min_length max max_length \\\n",
"0 timestamp 0 5 True NaN True NaN True \n",
"1 username 0 4 True NaN True NaN True \n",
"2 temperature 1 4 True True NaN True NaN \n",
"3 heartrate 0 5 True True NaN True NaN \n",
"4 build 1 4 True NaN True NaN True \n",
"5 latest 1 4 True True NaN True NaN \n",
"6 note 0 4 True NaN True NaN True \n",
"\n",
" sign max_nulls no_duplicates allowed_values \n",
"0 NaN True True NaN \n",
"1 NaN True NaN NaN \n",
"2 True False NaN NaN \n",
"3 True True NaN NaN \n",
"4 NaN False True NaN \n",
"5 True False NaN NaN \n",
"6 NaN NaN NaN True "
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"v.to_frame()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 6. Finden der fehlerhaften Zeilen\n",
"\n",
"`tdda.constraints.pd.constraints.detect_df()` erkennt Datensätze des pandas DataFrame, die gegen eine der Einschränkungen in der bereitgestellten JSON-Datei verstoßen. Anschließend können wir über dem erstellten `PandasDetection`-Objekt die Funktion `detected()` aufrufen um uns die Zeilen ausgeben zu lassen, die fehlerhaft sind:"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {
"execution": {
"iopub.execute_input": "2026-05-22T13:21:09.154415Z",
"iopub.status.busy": "2026-05-22T13:21:09.154234Z",
"iopub.status.idle": "2026-05-22T13:21:09.389681Z",
"shell.execute_reply": "2026-05-22T13:21:09.388911Z",
"shell.execute_reply.started": "2026-05-22T13:21:09.154403Z"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" n_failures | \n",
"
\n",
" \n",
" | Index | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" | 3 | \n",
" 1 | \n",
"
\n",
" \n",
" | 4 | \n",
" 1 | \n",
"
\n",
" \n",
" | 7 | \n",
" 1 | \n",
"
\n",
" \n",
" | 10 | \n",
" 2 | \n",
"
\n",
" \n",
" | 12 | \n",
" 1 | \n",
"
\n",
" \n",
" | ... | \n",
" ... | \n",
"
\n",
" \n",
" | 146385 | \n",
" 1 | \n",
"
\n",
" \n",
" | 146387 | \n",
" 2 | \n",
"
\n",
" \n",
" | 146391 | \n",
" 2 | \n",
"
\n",
" \n",
" | 146393 | \n",
" 2 | \n",
"
\n",
" \n",
" | 146394 | \n",
" 1 | \n",
"
\n",
" \n",
"
\n",
"
77260 rows × 1 columns
\n",
"
"
],
"text/plain": [
" n_failures\n",
"Index \n",
"3 1\n",
"4 1\n",
"7 1\n",
"10 2\n",
"12 1\n",
"... ...\n",
"146385 1\n",
"146387 2\n",
"146391 2\n",
"146393 2\n",
"146394 1\n",
"\n",
"[77260 rows x 1 columns]"
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"d = detect_df(new_df, \"iot_example.json\")\n",
"\n",
"d.detected()"
]
},
{
"cell_type": "markdown",
"metadata": {
"execution": {
"iopub.execute_input": "2025-12-15T18:10:58.201847Z",
"iopub.status.busy": "2025-12-15T18:10:58.201200Z",
"iopub.status.idle": "2025-12-15T18:10:58.205904Z",
"shell.execute_reply": "2025-12-15T18:10:58.205159Z",
"shell.execute_reply.started": "2025-12-15T18:10:58.201804Z"
}
},
"source": [
"Wir können uns alle fehlerhaften Datensätze anzeigen lassen, indem wir nur den Teil des Index von `new_df` verwenden, der auch in `d.detected()` vorkommt:"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {
"execution": {
"iopub.execute_input": "2026-05-22T13:21:09.390181Z",
"iopub.status.busy": "2026-05-22T13:21:09.390054Z",
"iopub.status.idle": "2026-05-22T13:21:09.392493Z",
"shell.execute_reply": "2026-05-22T13:21:09.391996Z",
"shell.execute_reply.started": "2026-05-22T13:21:09.390172Z"
}
},
"outputs": [],
"source": [
"d_index = d.detected().index"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {
"execution": {
"iopub.execute_input": "2026-05-22T13:21:09.392817Z",
"iopub.status.busy": "2026-05-22T13:21:09.392742Z",
"iopub.status.idle": "2026-05-22T13:21:09.402264Z",
"shell.execute_reply": "2026-05-22T13:21:09.401889Z",
"shell.execute_reply.started": "2026-05-22T13:21:09.392809Z"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" timestamp | \n",
" username | \n",
" temperature | \n",
" heartrate | \n",
" build | \n",
" latest | \n",
" note | \n",
"
\n",
" \n",
" \n",
" \n",
" | 3 | \n",
" 2017-01-01T12:02:09 | \n",
" eddierodriguez | \n",
" 28.0 | \n",
" 76 | \n",
" NaN | \n",
" 0.0 | \n",
" update | \n",
"
\n",
" \n",
" | 4 | \n",
" 2017-01-01T12:02:36 | \n",
" kenneth94 | \n",
" 29.0 | \n",
" 62 | \n",
" 122f1c6a-403c-2221-6ed1-b5caa08f11e0 | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" | 7 | \n",
" 2017-01-01T12:04:35 | \n",
" scott28 | \n",
" 16.0 | \n",
" 76 | \n",
" 7a60219f-6621-e548-180e-ca69624f9824 | \n",
" NaN | \n",
" interval | \n",
"
\n",
" \n",
" | 10 | \n",
" 2017-01-01T12:06:21 | \n",
" njohnson | \n",
" NaN | \n",
" 63 | \n",
" e09b6001-125d-51cf-9c3f-9cb686c19d02 | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" | 12 | \n",
" 2017-01-01T12:07:41 | \n",
" jessica48 | \n",
" 22.0 | \n",
" 83 | \n",
" 03e1a07b-3e14-412c-3a69-6b45bc79f81c | \n",
" NaN | \n",
" update | \n",
"
\n",
" \n",
" | ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
"
\n",
" \n",
" | 146385 | \n",
" 2017-02-28T23:53:59 | \n",
" powelleric | \n",
" 20.0 | \n",
" 86 | \n",
" 152eda10-676a-069c-b664-19443f2c8081 | \n",
" NaN | \n",
" test | \n",
"
\n",
" \n",
" | 146387 | \n",
" 2017-02-28T23:54:50 | \n",
" jthompson | \n",
" NaN | \n",
" 66 | \n",
" 8da10303-fe49-e313-8fda-0d5e79ded054 | \n",
" NaN | \n",
" update | \n",
"
\n",
" \n",
" | 146391 | \n",
" 2017-02-28T23:57:21 | \n",
" aaronbecker | \n",
" NaN | \n",
" 87 | \n",
" 7e52f4a8-345c-5ee0-e515-b8c392213062 | \n",
" NaN | \n",
" sleep | \n",
"
\n",
" \n",
" | 146393 | \n",
" 2017-02-28T23:58:43 | \n",
" joelrusso | \n",
" NaN | \n",
" 89 | \n",
" NaN | \n",
" 0.0 | \n",
" NaN | \n",
"
\n",
" \n",
" | 146394 | \n",
" 2017-02-28T23:59:23 | \n",
" lellis | \n",
" NaN | \n",
" 84 | \n",
" dac87426-e147-9c39-6e4c-790bb11f8fc9 | \n",
" 0.0 | \n",
" update | \n",
"
\n",
" \n",
"
\n",
"
77260 rows × 7 columns
\n",
"
"
],
"text/plain": [
" timestamp username temperature heartrate \\\n",
"3 2017-01-01T12:02:09 eddierodriguez 28.0 76 \n",
"4 2017-01-01T12:02:36 kenneth94 29.0 62 \n",
"7 2017-01-01T12:04:35 scott28 16.0 76 \n",
"10 2017-01-01T12:06:21 njohnson NaN 63 \n",
"12 2017-01-01T12:07:41 jessica48 22.0 83 \n",
"... ... ... ... ... \n",
"146385 2017-02-28T23:53:59 powelleric 20.0 86 \n",
"146387 2017-02-28T23:54:50 jthompson NaN 66 \n",
"146391 2017-02-28T23:57:21 aaronbecker NaN 87 \n",
"146393 2017-02-28T23:58:43 joelrusso NaN 89 \n",
"146394 2017-02-28T23:59:23 lellis NaN 84 \n",
"\n",
" build latest note \n",
"3 NaN 0.0 update \n",
"4 122f1c6a-403c-2221-6ed1-b5caa08f11e0 NaN NaN \n",
"7 7a60219f-6621-e548-180e-ca69624f9824 NaN interval \n",
"10 e09b6001-125d-51cf-9c3f-9cb686c19d02 NaN NaN \n",
"12 03e1a07b-3e14-412c-3a69-6b45bc79f81c NaN update \n",
"... ... ... ... \n",
"146385 152eda10-676a-069c-b664-19443f2c8081 NaN test \n",
"146387 8da10303-fe49-e313-8fda-0d5e79ded054 NaN update \n",
"146391 7e52f4a8-345c-5ee0-e515-b8c392213062 NaN sleep \n",
"146393 NaN 0.0 NaN \n",
"146394 dac87426-e147-9c39-6e4c-790bb11f8fc9 0.0 update \n",
"\n",
"[77260 rows x 7 columns]"
]
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"new_df[new_df.index.isin(d_index)]"
]
},
{
"cell_type": "markdown",
"metadata": {
"execution": {
"iopub.execute_input": "2025-12-15T18:10:58.201847Z",
"iopub.status.busy": "2025-12-15T18:10:58.201200Z",
"iopub.status.idle": "2025-12-15T18:10:58.205904Z",
"shell.execute_reply": "2025-12-15T18:10:58.205159Z",
"shell.execute_reply.started": "2025-12-15T18:10:58.201804Z"
}
},
"source": [
"Alternativ können wir uns auch alle fehlerfreien Datensätze anzeigen lassen."
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {
"execution": {
"iopub.execute_input": "2026-05-22T13:21:09.402784Z",
"iopub.status.busy": "2026-05-22T13:21:09.402707Z",
"iopub.status.idle": "2026-05-22T13:21:09.411657Z",
"shell.execute_reply": "2026-05-22T13:21:09.411190Z",
"shell.execute_reply.started": "2026-05-22T13:21:09.402777Z"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" timestamp | \n",
" username | \n",
" temperature | \n",
" heartrate | \n",
" build | \n",
" latest | \n",
" note | \n",
"
\n",
" \n",
" \n",
" \n",
" | 0 | \n",
" 2017-01-01T12:00:23 | \n",
" michaelsmith | \n",
" 12.0 | \n",
" 67 | \n",
" 4e6a7805-8faa-2768-6ef6-eb3198b483ac | \n",
" 0.0 | \n",
" interval | \n",
"
\n",
" \n",
" | 1 | \n",
" 2017-01-01T12:01:09 | \n",
" kharrison | \n",
" 6.0 | \n",
" 78 | \n",
" 7256b7b0-e502-f576-62ec-ed73533c9c84 | \n",
" 0.0 | \n",
" wake | \n",
"
\n",
" \n",
" | 2 | \n",
" 2017-01-01T12:01:34 | \n",
" smithadam | \n",
" 5.0 | \n",
" 89 | \n",
" 9226c94b-bb4b-a6c8-8e02-cb42b53e9c90 | \n",
" 0.0 | \n",
" NaN | \n",
"
\n",
" \n",
" | 5 | \n",
" 2017-01-01T12:03:04 | \n",
" bryanttodd | \n",
" 13.0 | \n",
" 86 | \n",
" 0897dbe5-9c5b-71ca-73a1-7586959ca198 | \n",
" 0.0 | \n",
" interval | \n",
"
\n",
" \n",
" | 6 | \n",
" 2017-01-01T12:03:51 | \n",
" andrea98 | \n",
" 17.0 | \n",
" 81 | \n",
" 1c07ab9b-5f66-137d-a74f-921a41001f4e | \n",
" 1.0 | \n",
" NaN | \n",
"
\n",
" \n",
" | ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
"
\n",
" \n",
" | 146389 | \n",
" 2017-02-28T23:56:05 | \n",
" kathy63 | \n",
" 5.0 | \n",
" 88 | \n",
" c2f76050-abd4-aee4-7bc0-3498325d0573 | \n",
" 0.0 | \n",
" NaN | \n",
"
\n",
" \n",
" | 146390 | \n",
" 2017-02-28T23:56:34 | \n",
" cookallison | \n",
" 16.0 | \n",
" 84 | \n",
" f0b0c1f9-900b-276c-bca9-ac4d4ec4e88e | \n",
" 0.0 | \n",
" user | \n",
"
\n",
" \n",
" | 146392 | \n",
" 2017-02-28T23:58:06 | \n",
" mcontreras | \n",
" 15.0 | \n",
" 63 | \n",
" 69e61a15-d2d0-47a7-1a27-e07b3eeeba10 | \n",
" 0.0 | \n",
" NaN | \n",
"
\n",
" \n",
" | 146395 | \n",
" 2017-02-28T23:59:48 | \n",
" grayjasmin | \n",
" 17.0 | \n",
" 64 | \n",
" 4911a589-3a15-4bbf-1de1-e5a69ab739da | \n",
" 1.0 | \n",
" update | \n",
"
\n",
" \n",
" | 146396 | \n",
" 2017-03-01T00:00:30 | \n",
" jgreene | \n",
" 23.0 | \n",
" 70 | \n",
" 4f95bbca-26a7-29e7-1f19-aaedf1a51741 | \n",
" 0.0 | \n",
" interval | \n",
"
\n",
" \n",
"
\n",
"
69137 rows × 7 columns
\n",
"
"
],
"text/plain": [
" timestamp username temperature heartrate \\\n",
"0 2017-01-01T12:00:23 michaelsmith 12.0 67 \n",
"1 2017-01-01T12:01:09 kharrison 6.0 78 \n",
"2 2017-01-01T12:01:34 smithadam 5.0 89 \n",
"5 2017-01-01T12:03:04 bryanttodd 13.0 86 \n",
"6 2017-01-01T12:03:51 andrea98 17.0 81 \n",
"... ... ... ... ... \n",
"146389 2017-02-28T23:56:05 kathy63 5.0 88 \n",
"146390 2017-02-28T23:56:34 cookallison 16.0 84 \n",
"146392 2017-02-28T23:58:06 mcontreras 15.0 63 \n",
"146395 2017-02-28T23:59:48 grayjasmin 17.0 64 \n",
"146396 2017-03-01T00:00:30 jgreene 23.0 70 \n",
"\n",
" build latest note \n",
"0 4e6a7805-8faa-2768-6ef6-eb3198b483ac 0.0 interval \n",
"1 7256b7b0-e502-f576-62ec-ed73533c9c84 0.0 wake \n",
"2 9226c94b-bb4b-a6c8-8e02-cb42b53e9c90 0.0 NaN \n",
"5 0897dbe5-9c5b-71ca-73a1-7586959ca198 0.0 interval \n",
"6 1c07ab9b-5f66-137d-a74f-921a41001f4e 1.0 NaN \n",
"... ... ... ... \n",
"146389 c2f76050-abd4-aee4-7bc0-3498325d0573 0.0 NaN \n",
"146390 f0b0c1f9-900b-276c-bca9-ac4d4ec4e88e 0.0 user \n",
"146392 69e61a15-d2d0-47a7-1a27-e07b3eeeba10 0.0 NaN \n",
"146395 4911a589-3a15-4bbf-1de1-e5a69ab739da 1.0 update \n",
"146396 4f95bbca-26a7-29e7-1f19-aaedf1a51741 0.0 interval \n",
"\n",
"[69137 rows x 7 columns]"
]
},
"execution_count": 20,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"new_df[~new_df.index.isin(d_index)]"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3.14 Kernel",
"language": "python",
"name": "python314"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.14.0"
},
"latex_envs": {
"LaTeX_envs_menu_present": true,
"autoclose": false,
"autocomplete": true,
"bibliofile": "biblio.bib",
"cite_by": "apalike",
"current_citInitial": 1,
"eqLabelWithNumbers": true,
"eqNumInitial": 1,
"hotkeys": {
"equation": "Ctrl-E",
"itemize": "Ctrl-I"
},
"labels_anchors": false,
"latex_user_defs": false,
"report_style_numbering": false,
"user_envs_cfg": false
},
"widgets": {
"application/vnd.jupyter.widget-state+json": {
"state": {},
"version_major": 2,
"version_minor": 0
}
}
},
"nbformat": 4,
"nbformat_minor": 4
}