{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Datenvalidierung mit Voluptuous (Schemadefinitionen)\n", "\n", "In diesem Notebook verwenden wir [Voluptuous](https://github.com/alecthomas/voluptuous), um Schemata für unsere Daten zu definieren. Wir können dann die Schemaprüfung an verschiedenen Stellen unserer Bereinigung verwenden, um sicherzustellen, dass wir die Kriterien erfüllen. Schließllich können wir Ausnahmen für die Schemaüberprüfung verwenden, um unreine oder ungültige Daten zu markieren, beiseite zu legen oder zu entfernen.\n", "\n", "
\n", "\n", "**Siehe auch**\n", "\n", "* [Validr](https://github.com/guyskk/validr)\n", "* [marshmallow](https://marshmallow.readthedocs.io/en/latest/)\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1. Importe" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "execution": { "iopub.execute_input": "2026-05-22T13:14:38.052699Z", "iopub.status.busy": "2026-05-22T13:14:38.052524Z", "iopub.status.idle": "2026-05-22T13:14:38.263804Z", "shell.execute_reply": "2026-05-22T13:14:38.263365Z", "shell.execute_reply.started": "2026-05-22T13:14:38.052681Z" } }, "outputs": [], "source": [ "import datetime\n", "import logging\n", "\n", "import pandas as pd\n", "\n", "from voluptuous import ALLOW_EXTRA, All, Range, Required, Schema\n", "from voluptuous.error import Invalid, MultipleInvalid" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* `Required` markiert den Knoten eines Schemas als erforderlich und gibt optional einen Standardwert an, siehe auch [voluptuous.schema_builder.Required](http://alecthomas.github.io/voluptuous/docs/_build/html/voluptuous.html?highlight=required#voluptuous.schema_builder.Required).\n", "* `Range` begrenzt den Wert auf einen Bereich, wobei entweder `min` oder `max` weggelassen werden kann; siehe auch [voluptuous.validators.Range](http://alecthomas.github.io/voluptuous/docs/_build/html/voluptuous.html?highlight=required#voluptuous.validators.Range).\n", "* `ALL` wird für feldübergreifende Validierungen verwendet: prüft die Grundstruktur der Daten in einem ersten Durchgang und erst im zweiten Durchgang wird die feldübergreifende Validierung angewendet; siehe auch [voluptuous.validators.All](http://alecthomas.github.io/voluptuous/docs/_build/html/voluptuous.html?highlight=required#voluptuous.validators.All).\n", "* `ALLOW_EXTRA` erlaubt zusätzliche Wörterbuchschlüssel\n", "* `MultipleInvalid` basiert auf `Invalid`, siehe auch [voluptuous.error.MultipleInvalid](http://alecthomas.github.io/voluptuous/docs/_build/html/voluptuous.html?highlight=required#voluptuous.error.MultipleInvalid).\n", "* `Invalid` kennzeichnet Daten als ungültig, siehe auch [voluptuous.error.Invalid](http://alecthomas.github.io/voluptuous/docs/_build/html/voluptuous.html?highlight=required#voluptuous.error.Invalid)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2. Logger" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "execution": { "iopub.execute_input": "2026-05-22T13:14:38.264371Z", "iopub.status.busy": "2026-05-22T13:14:38.264267Z", "iopub.status.idle": "2026-05-22T13:14:38.266508Z", "shell.execute_reply": "2026-05-22T13:14:38.266059Z", "shell.execute_reply.started": "2026-05-22T13:14:38.264362Z" } }, "outputs": [], "source": [ "logger = logging.getLogger(__name__)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 3. Beispieldaten lesen" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "execution": { "iopub.execute_input": "2026-05-22T13:14:38.266930Z", "iopub.status.busy": "2026-05-22T13:14:38.266855Z", "iopub.status.idle": "2026-05-22T13:14:38.345696Z", "shell.execute_reply": "2026-05-22T13:14:38.345321Z", "shell.execute_reply.started": "2026-05-22T13:14:38.266923Z" } }, "outputs": [], "source": [ "sales = pd.read_csv(\n", " \"https://raw.githubusercontent.com/kjam/data-cleaning-101/master/data/sales_data.csv\"\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 4. Daten untersuchen" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "execution": { "iopub.execute_input": "2026-05-22T13:14:38.346151Z", "iopub.status.busy": "2026-05-22T13:14:38.346074Z", "iopub.status.idle": "2026-05-22T13:14:38.353736Z", "shell.execute_reply": "2026-05-22T13:14:38.353460Z", "shell.execute_reply.started": "2026-05-22T13:14:38.346143Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Unnamed: 0timestampcitystore_idsale_numbersale_amountassociate
002018-09-10 05:00:45Williamburgh615301167.0Gary Lee
112018-09-12 10:01:27Ibarraberg12744258.0Daniel Davis
222018-09-13 12:01:48Sarachester21908266.0Michael Roth
332018-09-14 20:02:19Caldwellbury14771-108.0Michaela Stewart
442018-09-16 01:03:21Erikaland111571-372.0Mark Taylor
\n", "
" ], "text/plain": [ " Unnamed: 0 timestamp city store_id sale_number \\\n", "0 0 2018-09-10 05:00:45 Williamburgh 6 1530 \n", "1 1 2018-09-12 10:01:27 Ibarraberg 1 2744 \n", "2 2 2018-09-13 12:01:48 Sarachester 2 1908 \n", "3 3 2018-09-14 20:02:19 Caldwellbury 14 771 \n", "4 4 2018-09-16 01:03:21 Erikaland 11 1571 \n", "\n", " sale_amount associate \n", "0 1167.0 Gary Lee \n", "1 258.0 Daniel Davis \n", "2 266.0 Michael Roth \n", "3 -108.0 Michaela Stewart \n", "4 -372.0 Mark Taylor " ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sales.head()" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "execution": { "iopub.execute_input": "2026-05-22T13:14:38.354331Z", "iopub.status.busy": "2026-05-22T13:14:38.354153Z", "iopub.status.idle": "2026-05-22T13:14:38.356797Z", "shell.execute_reply": "2026-05-22T13:14:38.356489Z", "shell.execute_reply.started": "2026-05-22T13:14:38.354322Z" } }, "outputs": [ { "data": { "text/plain": [ "Unnamed: 0 int64\n", "timestamp object\n", "city object\n", "store_id int64\n", "sale_number int64\n", "sale_amount float64\n", "associate object\n", "dtype: object" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sales.dtypes" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 5. Schema definieren\n", "\n", "In der Spalte `sale_amount` sollen alle Werte zwischen 2,5 and 1450,99 sein." ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "execution": { "iopub.execute_input": "2026-05-22T13:14:38.358482Z", "iopub.status.busy": "2026-05-22T13:14:38.358383Z", "iopub.status.idle": "2026-05-22T13:14:38.362043Z", "shell.execute_reply": "2026-05-22T13:14:38.361726Z", "shell.execute_reply.started": "2026-05-22T13:14:38.358475Z" } }, "outputs": [], "source": [ "schema = Schema(\n", " {\n", " Required(\"sale_amount\"): All(float, Range(min=2.50, max=1450.99)),\n", " },\n", " extra=ALLOW_EXTRA,\n", ")" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "execution": { "iopub.execute_input": "2026-05-22T13:14:38.362535Z", "iopub.status.busy": "2026-05-22T13:14:38.362458Z", "iopub.status.idle": "2026-05-22T13:14:38.384467Z", "shell.execute_reply": "2026-05-22T13:14:38.383916Z", "shell.execute_reply.started": "2026-05-22T13:14:38.362529Z" } }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "issue with sale: 3 (2018-09-14 20:02:19) - value must be at least 2.5 for dictionary value @ data['sale_amount']\n", "issue with sale: 4 (2018-09-16 01:03:21) - value must be at least 2.5 for dictionary value @ data['sale_amount']\n", "issue with sale: 5 (2018-09-18 03:04:11) - value must be at least 2.5 for dictionary value @ data['sale_amount']\n", "issue with sale: 6 (2018-09-20 12:04:49) - value must be at least 2.5 for dictionary value @ data['sale_amount']\n", "issue with sale: 7 (2018-09-21 15:05:42) - value must be at least 2.5 for dictionary value @ data['sale_amount']\n", "issue with sale: 10 (2018-09-27 04:07:32) - value must be at least 2.5 for dictionary value @ data['sale_amount']\n", "issue with sale: 13 (2018-10-02 17:08:42) - value must be at least 2.5 for dictionary value @ data['sale_amount']\n", "issue with sale: 15 (2018-10-07 06:09:00) - value must be at least 2.5 for dictionary value @ data['sale_amount']\n", "issue with sale: 19 (2018-10-14 08:10:23) - value must be at least 2.5 for dictionary value @ data['sale_amount']\n", "issue with sale: 20 (2018-10-16 12:10:55) - value must be at least 2.5 for dictionary value @ data['sale_amount']\n", "issue with sale: 22 (2018-10-20 02:11:21) - value must be at least 2.5 for dictionary value @ data['sale_amount']\n", "issue with sale: 23 (2018-10-22 03:12:13) - value must be at most 1450.99 for dictionary value @ data['sale_amount']\n", "issue with sale: 25 (2018-10-24 14:13:28) - value must be at least 2.5 for dictionary value @ data['sale_amount']\n", "issue with sale: 28 (2018-10-29 22:15:38) - value must be at most 1450.99 for dictionary value @ data['sale_amount']\n", "issue with sale: 31 (2018-11-03 09:17:33) - value must be at least 2.5 for dictionary value @ data['sale_amount']\n", "issue with sale: 38 (2018-11-18 08:19:35) - value must be at least 2.5 for dictionary value @ data['sale_amount']\n", "issue with sale: 40 (2018-11-21 19:20:45) - value must be at least 2.5 for dictionary value @ data['sale_amount']\n", "issue with sale: 41 (2018-11-23 02:21:43) - value must be at most 1450.99 for dictionary value @ data['sale_amount']\n", "issue with sale: 45 (2018-11-30 00:24:21) - value must be at most 1450.99 for dictionary value @ data['sale_amount']\n", "issue with sale: 46 (2018-12-01 05:24:30) - value must be at least 2.5 for dictionary value @ data['sale_amount']\n", "issue with sale: 48 (2018-12-05 16:25:18) - value must be at least 2.5 for dictionary value @ data['sale_amount']\n", "issue with sale: 51 (2018-12-10 12:26:57) - value must be at least 2.5 for dictionary value @ data['sale_amount']\n", "issue with sale: 55 (2018-12-16 01:28:14) - value must be at least 2.5 for dictionary value @ data['sale_amount']\n", "issue with sale: 56 (2018-12-18 07:29:18) - value must be at least 2.5 for dictionary value @ data['sale_amount']\n", "issue with sale: 59 (2018-12-22 15:29:58) - value must be at most 1450.99 for dictionary value @ data['sale_amount']\n", "issue with sale: 60 (2018-12-24 20:30:45) - value must be at least 2.5 for dictionary value @ data['sale_amount']\n", "issue with sale: 63 (2018-12-29 17:32:50) - value must be at most 1450.99 for dictionary value @ data['sale_amount']\n", "issue with sale: 65 (2019-01-01 08:34:21) - value must be at least 2.5 for dictionary value @ data['sale_amount']\n", "issue with sale: 69 (2019-01-08 02:37:14) - value must be at least 2.5 for dictionary value @ data['sale_amount']\n", "issue with sale: 70 (2019-01-10 08:37:44) - value must be at most 1450.99 for dictionary value @ data['sale_amount']\n", "issue with sale: 71 (2019-01-11 10:37:50) - value must be at least 2.5 for dictionary value @ data['sale_amount']\n", "issue with sale: 74 (2019-01-15 22:39:00) - value must be at least 2.5 for dictionary value @ data['sale_amount']\n", "issue with sale: 76 (2019-01-20 15:40:17) - value must be at least 2.5 for dictionary value @ data['sale_amount']\n", "issue with sale: 86 (2019-02-06 18:46:26) - value must be at most 1450.99 for dictionary value @ data['sale_amount']\n", "issue with sale: 101 (2019-03-03 11:55:51) - value must be at least 2.5 for dictionary value @ data['sale_amount']\n", "issue with sale: 103 (2019-03-05 19:56:30) - value must be at least 2.5 for dictionary value @ data['sale_amount']\n", "issue with sale: 104 (2019-03-06 20:56:50) - value must be at least 2.5 for dictionary value @ data['sale_amount']\n", "issue with sale: 107 (2019-03-13 13:58:38) - value must be at least 2.5 for dictionary value @ data['sale_amount']\n", "issue with sale: 111 (2019-03-19 15:00:18) - value must be at least 2.5 for dictionary value @ data['sale_amount']\n", "issue with sale: 112 (2019-03-20 16:00:25) - value must be at least 2.5 for dictionary value @ data['sale_amount']\n", "issue with sale: 116 (2019-03-28 19:02:35) - value must be at most 1450.99 for dictionary value @ data['sale_amount']\n", "issue with sale: 120 (2019-04-03 12:04:38) - value must be at most 1450.99 for dictionary value @ data['sale_amount']\n", "issue with sale: 123 (2019-04-10 00:05:47) - value must be at least 2.5 for dictionary value @ data['sale_amount']\n", "issue with sale: 124 (2019-04-12 07:06:20) - value must be at least 2.5 for dictionary value @ data['sale_amount']\n", "issue with sale: 129 (2019-04-20 17:09:07) - value must be at most 1450.99 for dictionary value @ data['sale_amount']\n", "issue with sale: 130 (2019-04-21 23:09:18) - value must be at least 2.5 for dictionary value @ data['sale_amount']\n", "issue with sale: 132 (2019-04-26 10:10:45) - value must be at least 2.5 for dictionary value @ data['sale_amount']\n", "issue with sale: 141 (2019-05-11 19:14:07) - value must be at most 1450.99 for dictionary value @ data['sale_amount']\n", "issue with sale: 142 (2019-05-14 02:14:32) - value must be at least 2.5 for dictionary value @ data['sale_amount']\n", "issue with sale: 149 (2019-05-25 18:18:18) - value must be at most 1450.99 for dictionary value @ data['sale_amount']\n", "issue with sale: 152 (2019-05-30 13:20:01) - value must be at least 2.5 for dictionary value @ data['sale_amount']\n", "issue with sale: 155 (2019-06-03 03:22:02) - value must be at most 1450.99 for dictionary value @ data['sale_amount']\n", "issue with sale: 156 (2019-06-05 07:22:10) - value must be at most 1450.99 for dictionary value @ data['sale_amount']\n", "issue with sale: 157 (2019-06-07 08:23:07) - value must be at least 2.5 for dictionary value @ data['sale_amount']\n", "issue with sale: 162 (2019-06-15 08:25:52) - value must be at least 2.5 for dictionary value @ data['sale_amount']\n", "issue with sale: 164 (2019-06-17 19:26:37) - value must be at least 2.5 for dictionary value @ data['sale_amount']\n", "issue with sale: 167 (2019-06-22 21:27:21) - value must be at least 2.5 for dictionary value @ data['sale_amount']\n", "issue with sale: 171 (2019-06-30 01:29:59) - value must be at least 2.5 for dictionary value @ data['sale_amount']\n", "issue with sale: 178 (2019-07-11 07:33:55) - value must be at least 2.5 for dictionary value @ data['sale_amount']\n", "issue with sale: 180 (2019-07-13 14:35:35) - value must be at least 2.5 for dictionary value @ data['sale_amount']\n", "issue with sale: 187 (2019-07-28 21:38:54) - value must be at most 1450.99 for dictionary value @ data['sale_amount']\n", "issue with sale: 194 (2019-08-11 09:41:58) - value must be at least 2.5 for dictionary value @ data['sale_amount']\n", "issue with sale: 195 (2019-08-12 12:42:27) - value must be at least 2.5 for dictionary value @ data['sale_amount']\n", "issue with sale: 196 (2019-08-13 16:42:30) - value must be at least 2.5 for dictionary value @ data['sale_amount']\n", "issue with sale: 203 (2019-08-25 01:45:08) - value must be at least 2.5 for dictionary value @ data['sale_amount']\n", "issue with sale: 206 (2019-08-29 16:45:54) - value must be at least 2.5 for dictionary value @ data['sale_amount']\n", "issue with sale: 207 (2019-08-31 00:46:11) - value must be at least 2.5 for dictionary value @ data['sale_amount']\n", "issue with sale: 209 (2019-09-03 12:47:26) - value must be at most 1450.99 for dictionary value @ data['sale_amount']\n", "issue with sale: 211 (2019-09-07 23:48:08) - value must be at least 2.5 for dictionary value @ data['sale_amount']\n" ] } ], "source": [ "error_count = 0\n", "for s_id, sale in sales.T.to_dict().items():\n", " try:\n", " schema(sale)\n", " except MultipleInvalid as e:\n", " logger.warning(\n", " f\"issue with sale: {s_id} ({sale['timestamp']}) - {e}\",\n", " )\n", " error_count += 1" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Um die Elemente einer Spalte als Schlüssel und die Elemente einer anderen Spalte als Werte verwenden zu können, machen wir einfach die gewünschte Spalte zum Index des DataFrame und transponieren sie mit der Funktion `.T()`; siehe auch [pandas.DataFrame.transpose](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.transpose.html)." ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "execution": { "iopub.execute_input": "2026-05-22T13:14:38.385150Z", "iopub.status.busy": "2026-05-22T13:14:38.384932Z", "iopub.status.idle": "2026-05-22T13:14:38.387164Z", "shell.execute_reply": "2026-05-22T13:14:38.386927Z", "shell.execute_reply.started": "2026-05-22T13:14:38.385140Z" } }, "outputs": [ { "data": { "text/plain": [ "69" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "error_count" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Aktuell wissen wir jedoch noch nicht, ob\n", "\n", "* wir ein falsch definiertes Schema haben\n", "* möglicherweise negative Werte zurückgegeben oder falsch markiert werden\n", "* höhere Werte kombinierte Einkäufe oder Sonderverkäufe sind" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 6. Hinzufügen einer benutzerdefinierten Validierung" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "execution": { "iopub.execute_input": "2026-05-22T13:14:38.387652Z", "iopub.status.busy": "2026-05-22T13:14:38.387532Z", "iopub.status.idle": "2026-05-22T13:14:38.390099Z", "shell.execute_reply": "2026-05-22T13:14:38.389793Z", "shell.execute_reply.started": "2026-05-22T13:14:38.387644Z" } }, "outputs": [], "source": [ "def valid_date(fmt=\"%Y-%m-%d %H:%M:%S\"):\n", " return lambda v: datetime.datetime.strptime(v, fmt).replace(\n", " tzinfo=datetime.timezone.utc\n", " )" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "execution": { "iopub.execute_input": "2026-05-22T13:14:38.390446Z", "iopub.status.busy": "2026-05-22T13:14:38.390369Z", "iopub.status.idle": "2026-05-22T13:14:38.392866Z", "shell.execute_reply": "2026-05-22T13:14:38.392464Z", "shell.execute_reply.started": "2026-05-22T13:14:38.390439Z" } }, "outputs": [], "source": [ "schema = Schema(\n", " {\n", " Required(\"timestamp\"): All(valid_date()),\n", " },\n", " extra=ALLOW_EXTRA,\n", ")" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "execution": { "iopub.execute_input": "2026-05-22T13:14:38.393313Z", "iopub.status.busy": "2026-05-22T13:14:38.393198Z", "iopub.status.idle": "2026-05-22T13:14:38.400002Z", "shell.execute_reply": "2026-05-22T13:14:38.399663Z", "shell.execute_reply.started": "2026-05-22T13:14:38.393303Z" } }, "outputs": [], "source": [ "error_count = 0\n", "for s_id, sale in sales.T.to_dict().items():\n", " try:\n", " schema(sale)\n", " except MultipleInvalid as e:\n", " logger.warning(\n", " f\"issue with sale: {s_id} ({sale['timestamp']}) - {e}\",\n", " )\n", " error_count += 1" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "execution": { "iopub.execute_input": "2026-05-22T13:14:38.400551Z", "iopub.status.busy": "2026-05-22T13:14:38.400483Z", "iopub.status.idle": "2026-05-22T13:14:38.403165Z", "shell.execute_reply": "2026-05-22T13:14:38.402862Z", "shell.execute_reply.started": "2026-05-22T13:14:38.400544Z" } }, "outputs": [ { "data": { "text/plain": [ "0" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "error_count" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 7. Gültige Datumsstrukturen sind noch keine gültigen Daten" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "execution": { "iopub.execute_input": "2026-05-22T13:14:38.403571Z", "iopub.status.busy": "2026-05-22T13:14:38.403496Z", "iopub.status.idle": "2026-05-22T13:14:38.405550Z", "shell.execute_reply": "2026-05-22T13:14:38.405359Z", "shell.execute_reply.started": "2026-05-22T13:14:38.403565Z" } }, "outputs": [], "source": [ "def valid_date(fmt=\"%Y-%m-%d %H:%M:%S\"):\n", " def validation_func(v):\n", " try:\n", " assert datetime.datetime.strptime(v, fmt).replace(\n", " tzinfo=datetime.timezone.utc\n", " ) <= datetime.datetime.now(tz=datetime.timezone.utc)\n", " except AssertionError:\n", " msg = f\"The date is in the future! {v}\"\n", " raise Invalid(msg) from AssertionError\n", "\n", " return validation_func" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "execution": { "iopub.execute_input": "2026-05-22T13:14:38.405821Z", "iopub.status.busy": "2026-05-22T13:14:38.405766Z", "iopub.status.idle": "2026-05-22T13:14:38.407511Z", "shell.execute_reply": "2026-05-22T13:14:38.407218Z", "shell.execute_reply.started": "2026-05-22T13:14:38.405814Z" } }, "outputs": [], "source": [ "schema = Schema(\n", " {\n", " Required(\"timestamp\"): All(valid_date()),\n", " },\n", " extra=ALLOW_EXTRA,\n", ")" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "execution": { "iopub.execute_input": "2026-05-22T13:14:38.408354Z", "iopub.status.busy": "2026-05-22T13:14:38.407823Z", "iopub.status.idle": "2026-05-22T13:14:38.414010Z", "shell.execute_reply": "2026-05-22T13:14:38.413714Z", "shell.execute_reply.started": "2026-05-22T13:14:38.408345Z" } }, "outputs": [], "source": [ "error_count = 0\n", "for s_id, sale in sales.T.to_dict().items():\n", " try:\n", " schema(sale)\n", " except MultipleInvalid as e:\n", " logger.warning(\n", " f\"issue with sale: {s_id} ({sale['timestamp']}) - {e}\",\n", " )\n", " error_count += 1" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "execution": { "iopub.execute_input": "2026-05-22T13:14:38.414672Z", "iopub.status.busy": "2026-05-22T13:14:38.414598Z", "iopub.status.idle": "2026-05-22T13:14:38.417892Z", "shell.execute_reply": "2026-05-22T13:14:38.417509Z", "shell.execute_reply.started": "2026-05-22T13:14:38.414665Z" } }, "outputs": [ { "data": { "text/plain": [ "0" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "error_count" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3.13 Kernel", "language": "python", "name": "python313" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.13.0" }, "latex_envs": { "LaTeX_envs_menu_present": true, "autoclose": false, "autocomplete": true, "bibliofile": "biblio.bib", "cite_by": "apalike", "current_citInitial": 1, "eqLabelWithNumbers": true, "eqNumInitial": 1, "hotkeys": { "equation": "Ctrl-E", "itemize": "Ctrl-I" }, "labels_anchors": false, "latex_user_defs": false, "report_style_numbering": false, "user_envs_cfg": false }, "widgets": { "application/vnd.jupyter.widget-state+json": { "state": {}, "version_major": 2, "version_minor": 0 } } }, "nbformat": 4, "nbformat_minor": 4 }