{ "cells": [ { "cell_type": "markdown", "id": "6527c868", "metadata": {}, "source": [ "# CSV-Beispiel" ] }, { "cell_type": "code", "execution_count": 1, "id": "cc2a9ad2", "metadata": { "execution": { "iopub.execute_input": "2026-05-22T08:32:27.171825Z", "iopub.status.busy": "2026-05-22T08:32:27.171628Z", "iopub.status.idle": "2026-05-22T08:32:27.388484Z", "shell.execute_reply": "2026-05-22T08:32:27.388106Z", "shell.execute_reply.started": "2026-05-22T08:32:27.171807Z" } }, "outputs": [], "source": [ "import pandas as pd" ] }, { "cell_type": "markdown", "id": "2a297c06", "metadata": {}, "source": [ "Nach dem Import von pandas lesen wir zunächst mit `read_csv` eine CSV-Datei ein:" ] }, { "cell_type": "code", "execution_count": 2, "id": "d709242b", "metadata": { "execution": { "iopub.execute_input": "2026-05-22T08:32:27.389022Z", "iopub.status.busy": "2026-05-22T08:32:27.388887Z", "iopub.status.idle": "2026-05-22T08:32:27.724958Z", "shell.execute_reply": "2026-05-22T08:32:27.724342Z", "shell.execute_reply.started": "2026-05-22T08:32:27.389013Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Python basicsenVeit SchieleBSD-3-Clause2021-10-28
0Jupyter TutorialenVeit SchieleBSD-3-Clause2019-06-27
1Jupyter TutorialdeVeit SchieleBSD-3-Clause2020-10-26
2PyViz TutorialenVeit SchieleBSD-3-Clause2020-04-13
\n", "
" ], "text/plain": [ " Python basics en Veit Schiele BSD-3-Clause 2021-10-28\n", "0 Jupyter Tutorial en Veit Schiele BSD-3-Clause 2019-06-27\n", "1 Jupyter Tutorial de Veit Schiele BSD-3-Clause 2020-10-26\n", "2 PyViz Tutorial en Veit Schiele BSD-3-Clause 2020-04-13" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.read_csv(\n", " \"https://raw.githubusercontent.com/veit/python-basics-tutorial-de/main/docs/save-data/books.csv\"\n", ")" ] }, { "cell_type": "markdown", "id": "3a64230a", "metadata": {}, "source": [ "Wie ihr seht, hat diese Datei keine Kopfzeile. Um dem DataFrame eine Kopfzeile zu geben, habt ihr mehrere Möglichkeiten. Ihr könnt pandas erlauben, Standard-Spaltennamen zuzuweisen, oder ihr könnt die Namen auch selbst festlegen:" ] }, { "cell_type": "code", "execution_count": 3, "id": "206ad065", "metadata": { "execution": { "iopub.execute_input": "2026-05-22T08:32:27.726201Z", "iopub.status.busy": "2026-05-22T08:32:27.725808Z", "iopub.status.idle": "2026-05-22T08:32:27.789372Z", "shell.execute_reply": "2026-05-22T08:32:27.788652Z", "shell.execute_reply.started": "2026-05-22T08:32:27.726182Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
01234
0Python basicsenVeit SchieleBSD-3-Clause2021-10-28
1Jupyter TutorialenVeit SchieleBSD-3-Clause2019-06-27
2Jupyter TutorialdeVeit SchieleBSD-3-Clause2020-10-26
3PyViz TutorialenVeit SchieleBSD-3-Clause2020-04-13
\n", "
" ], "text/plain": [ " 0 1 2 3 4\n", "0 Python basics en Veit Schiele BSD-3-Clause 2021-10-28\n", "1 Jupyter Tutorial en Veit Schiele BSD-3-Clause 2019-06-27\n", "2 Jupyter Tutorial de Veit Schiele BSD-3-Clause 2020-10-26\n", "3 PyViz Tutorial en Veit Schiele BSD-3-Clause 2020-04-13" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.read_csv(\n", " \"https://raw.githubusercontent.com/veit/python-basics-tutorial-de/main/docs/save-data/books.csv\",\n", " header=None,\n", ")" ] }, { "cell_type": "code", "execution_count": 4, "id": "c34aa2ba", "metadata": { "execution": { "iopub.execute_input": "2026-05-22T08:32:27.790983Z", "iopub.status.busy": "2026-05-22T08:32:27.790608Z", "iopub.status.idle": "2026-05-22T08:32:27.857562Z", "shell.execute_reply": "2026-05-22T08:32:27.856458Z", "shell.execute_reply.started": "2026-05-22T08:32:27.790953Z" }, "scrolled": true }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
TitelSpracheAutor*innenLizenzVeröffentlichungsdatum
0Python basicsenVeit SchieleBSD-3-Clause2021-10-28
1Jupyter TutorialenVeit SchieleBSD-3-Clause2019-06-27
2Jupyter TutorialdeVeit SchieleBSD-3-Clause2020-10-26
3PyViz TutorialenVeit SchieleBSD-3-Clause2020-04-13
\n", "
" ], "text/plain": [ " Titel Sprache Autor*innen Lizenz Veröffentlichungsdatum\n", "0 Python basics en Veit Schiele BSD-3-Clause 2021-10-28\n", "1 Jupyter Tutorial en Veit Schiele BSD-3-Clause 2019-06-27\n", "2 Jupyter Tutorial de Veit Schiele BSD-3-Clause 2020-10-26\n", "3 PyViz Tutorial en Veit Schiele BSD-3-Clause 2020-04-13" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.read_csv(\n", " \"https://raw.githubusercontent.com/veit/python-basics-tutorial-de/main/docs/save-data/books.csv\",\n", " names=[\n", " \"Titel\",\n", " \"Sprache\",\n", " \"Autor*innen\",\n", " \"Lizenz\",\n", " \"Veröffentlichungsdatum\",\n", " ],\n", ")" ] }, { "cell_type": "markdown", "id": "bc632510", "metadata": {}, "source": [ "Angenommen, ihr möchtet, dass die Spalte `Autor*innen` der Index des zurückgegebenen DataFrame ist. Ihr könnt entweder angeben, dass ihr die Spalte bei Index 3 oder mit dem Namen `Autor*innen` haben möchtet, indem ihr das Argument `index_col` verwendet:" ] }, { "cell_type": "code", "execution_count": 5, "id": "b6493950", "metadata": { "execution": { "iopub.execute_input": "2026-05-22T08:32:27.859035Z", "iopub.status.busy": "2026-05-22T08:32:27.858717Z", "iopub.status.idle": "2026-05-22T08:32:27.925375Z", "shell.execute_reply": "2026-05-22T08:32:27.924507Z", "shell.execute_reply.started": "2026-05-22T08:32:27.859008Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
TitelSpracheLizenzVeröffentlichungsdatum
Autor*innen
Veit SchielePython basicsenBSD-3-Clause2021-10-28
Veit SchieleJupyter TutorialenBSD-3-Clause2019-06-27
Veit SchieleJupyter TutorialdeBSD-3-Clause2020-10-26
Veit SchielePyViz TutorialenBSD-3-Clause2020-04-13
\n", "
" ], "text/plain": [ " Titel Sprache Lizenz Veröffentlichungsdatum\n", "Autor*innen \n", "Veit Schiele Python basics en BSD-3-Clause 2021-10-28\n", "Veit Schiele Jupyter Tutorial en BSD-3-Clause 2019-06-27\n", "Veit Schiele Jupyter Tutorial de BSD-3-Clause 2020-10-26\n", "Veit Schiele PyViz Tutorial en BSD-3-Clause 2020-04-13" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.read_csv(\n", " \"https://raw.githubusercontent.com/veit/python-basics-tutorial-de/main/docs/save-data/books.csv\",\n", " index_col=[\"Autor*innen\"],\n", " names=[\n", " \"Titel\",\n", " \"Sprache\",\n", " \"Autor*innen\",\n", " \"Lizenz\",\n", " \"Veröffentlichungsdatum\",\n", " ],\n", ")" ] }, { "cell_type": "markdown", "id": "8f30e05a", "metadata": {}, "source": [ "Für den Fall, dass ihr einen hierarchischen Index aus mehreren Spalten bilden wollt, übergebt eine Liste von Spaltennummern oder -namen:" ] }, { "cell_type": "code", "execution_count": 6, "id": "b3f16b11", "metadata": { "execution": { "iopub.execute_input": "2026-05-22T08:32:27.929574Z", "iopub.status.busy": "2026-05-22T08:32:27.929233Z", "iopub.status.idle": "2026-05-22T08:32:27.996694Z", "shell.execute_reply": "2026-05-22T08:32:27.996271Z", "shell.execute_reply.started": "2026-05-22T08:32:27.929546Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
SpracheLizenzVeröffentlichungsdatum
Autor*innenTitel
Veit SchielePython basicsenBSD-3-Clause2021-10-28
Jupyter TutorialenBSD-3-Clause2019-06-27
Jupyter TutorialdeBSD-3-Clause2020-10-26
PyViz TutorialenBSD-3-Clause2020-04-13
\n", "
" ], "text/plain": [ " Sprache Lizenz Veröffentlichungsdatum\n", "Autor*innen Titel \n", "Veit Schiele Python basics en BSD-3-Clause 2021-10-28\n", " Jupyter Tutorial en BSD-3-Clause 2019-06-27\n", " Jupyter Tutorial de BSD-3-Clause 2020-10-26\n", " PyViz Tutorial en BSD-3-Clause 2020-04-13" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.read_csv(\n", " \"https://raw.githubusercontent.com/veit/python-basics-tutorial-de/main/docs/save-data/books.csv\",\n", " index_col=[2, 0],\n", " names=[\n", " \"Titel\",\n", " \"Sprache\",\n", " \"Autor*innen\",\n", " \"Lizenz\",\n", " \"Veröffentlichungsdatum\",\n", " ],\n", ")" ] }, { "cell_type": "markdown", "id": "7df4d815", "metadata": {}, "source": [ "In manchen Fällen hat eine Tabelle kein festes Trennzeichen, sondern verwendet mehrere Leerzeichen oder ein anderes Muster zur Trennung von Feldern. Angenommen, eine Datei sieht folgendermaßen aus:" ] }, { "cell_type": "code", "execution_count": 7, "id": "a9100e41", "metadata": { "execution": { "iopub.execute_input": "2026-05-22T08:32:27.997654Z", "iopub.status.busy": "2026-05-22T08:32:27.997368Z", "iopub.status.idle": "2026-05-22T08:32:28.001408Z", "shell.execute_reply": "2026-05-22T08:32:28.001023Z", "shell.execute_reply.started": "2026-05-22T08:32:27.997638Z" } }, "outputs": [ { "data": { "text/plain": [ "[' Titel Sprache Autor*innen Lizenz Veröffentlichungsdatum\\n',\n", " '1 Python basics en Veit Schiele BSD-3-Clause 2021-10-28\\n',\n", " '2 Jupyter Tutorial en Veit Schiele BSD-3-Clause 2019-06-27\\n',\n", " '3 Jupyter Tutorial de Veit Schiele BSD-3-Clause 2020-10-26\\n',\n", " '4 PyViz Tutorial en Veit Schiele BSD-3-Clause 2020-04-13\\n']" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from pathlib import Path\n", "\n", "\n", "list(Path.open(\"books.txt\"))" ] }, { "cell_type": "markdown", "id": "982a5db3", "metadata": {}, "source": [ "In solchen Fällen könnt ihr einen regulären Ausdruck als Trennzeichen für `read_csv` übergeben. Dies kann durch den regulären Ausdruck `\\s\\s+` ausgedrückt werden, also haben wir dann:" ] }, { "cell_type": "code", "execution_count": 8, "id": "48e383d8", "metadata": { "execution": { "iopub.execute_input": "2026-05-22T08:32:28.002124Z", "iopub.status.busy": "2026-05-22T08:32:28.001963Z", "iopub.status.idle": "2026-05-22T08:32:28.009364Z", "shell.execute_reply": "2026-05-22T08:32:28.009017Z", "shell.execute_reply.started": "2026-05-22T08:32:28.002110Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
TitelSpracheAutor*innenLizenzVeröffentlichungsdatum
1Python basicsenVeit SchieleBSD-3-Clause2021-10-28
2Jupyter TutorialenVeit SchieleBSD-3-Clause2019-06-27
3Jupyter TutorialdeVeit SchieleBSD-3-Clause2020-10-26
4PyViz TutorialenVeit SchieleBSD-3-Clause2020-04-13
\n", "
" ], "text/plain": [ " Titel Sprache Autor*innen Lizenz Veröffentlichungsdatum\n", "1 Python basics en Veit Schiele BSD-3-Clause 2021-10-28\n", "2 Jupyter Tutorial en Veit Schiele BSD-3-Clause 2019-06-27\n", "3 Jupyter Tutorial de Veit Schiele BSD-3-Clause 2020-10-26\n", "4 PyViz Tutorial en Veit Schiele BSD-3-Clause 2020-04-13" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.read_csv(\"books.txt\", sep=r\"\\s\\s+\", engine=\"python\")" ] }, { "cell_type": "markdown", "id": "39884dfd", "metadata": {}, "source": [ "Da es einen Spaltennamen weniger als die Anzahl der Datenzeilen gab, folgert `read_csv`, dass in diesem Fall die erste Spalte der Index des DataFrame sein sollte." ] }, { "cell_type": "markdown", "id": "eaa94e3c", "metadata": {}, "source": [ "Die Parser-Funktionen haben viele zusätzliche Argumente, die euch helfen, die große Vielfalt der auftretenden Ausnahmedateiformate zu handhaben. So könnt ihr beispielsweise mit `skiprows` einzelne Zeilen einer Datei überspringen:" ] }, { "cell_type": "code", "execution_count": 9, "id": "8021c303", "metadata": { "execution": { "iopub.execute_input": "2026-05-22T08:32:28.010286Z", "iopub.status.busy": "2026-05-22T08:32:28.010065Z", "iopub.status.idle": "2026-05-22T08:32:28.067675Z", "shell.execute_reply": "2026-05-22T08:32:28.067095Z", "shell.execute_reply.started": "2026-05-22T08:32:28.010270Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
TitelSpracheAutor*innenLizenzVeröffentlichungsdatum
0Python basicsenVeit SchieleBSD-3-Clause2021-10-28
1Jupyter TutorialenVeit SchieleBSD-3-Clause2019-06-27
2PyViz TutorialenVeit SchieleBSD-3-Clause2020-04-13
\n", "
" ], "text/plain": [ " Titel Sprache Autor*innen Lizenz Veröffentlichungsdatum\n", "0 Python basics en Veit Schiele BSD-3-Clause 2021-10-28\n", "1 Jupyter Tutorial en Veit Schiele BSD-3-Clause 2019-06-27\n", "2 PyViz Tutorial en Veit Schiele BSD-3-Clause 2020-04-13" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.read_csv(\n", " \"https://raw.githubusercontent.com/veit/python-basics-tutorial-de/main/docs/save-data/books.csv\",\n", " skiprows=[2],\n", " names=[\n", " \"Titel\",\n", " \"Sprache\",\n", " \"Autor*innen\",\n", " \"Lizenz\",\n", " \"Veröffentlichungsdatum\",\n", " ],\n", ")" ] }, { "cell_type": "markdown", "id": "f6d1ac58", "metadata": {}, "source": [ "Der Umgang mit fehlenden Werten ist ein wichtiger und häufig komplizierter Teil beim Parsen von Daten. Fehlende Daten sind normalerweise entweder nicht vorhanden (leerer String) oder durch einen Platzhalter gekennzeichnet. Standardmäßig verwendet Pandas eine Reihe von häufig vorkommenden Platzhalter, wie `NA` und `NULL`:" ] }, { "cell_type": "code", "execution_count": 10, "id": "ccd28c11", "metadata": { "execution": { "iopub.execute_input": "2026-05-22T08:32:28.068839Z", "iopub.status.busy": "2026-05-22T08:32:28.068588Z", "iopub.status.idle": "2026-05-22T08:32:28.137896Z", "shell.execute_reply": "2026-05-22T08:32:28.137429Z", "shell.execute_reply.started": "2026-05-22T08:32:28.068822Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
TitelSpracheAutor*innenLizenzVeröffentlichungsdatumdoi
0Python basicsenVeit SchieleBSD-3-Clause2021-10-28NaN
1Jupyter TutorialenVeit SchieleBSD-3-Clause2019-06-27NaN
2Jupyter TutorialdeVeit SchieleBSD-3-Clause2020-10-26NaN
3PyViz TutorialenVeit SchieleBSD-3-Clause2020-04-13NaN
\n", "
" ], "text/plain": [ " Titel Sprache Autor*innen Lizenz \\\n", "0 Python basics en Veit Schiele BSD-3-Clause \n", "1 Jupyter Tutorial en Veit Schiele BSD-3-Clause \n", "2 Jupyter Tutorial de Veit Schiele BSD-3-Clause \n", "3 PyViz Tutorial en Veit Schiele BSD-3-Clause \n", "\n", " Veröffentlichungsdatum doi \n", "0 2021-10-28 NaN \n", "1 2019-06-27 NaN \n", "2 2020-10-26 NaN \n", "3 2020-04-13 NaN " ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.read_csv(\n", " \"https://raw.githubusercontent.com/veit/python-basics-tutorial-de/main/docs/save-data/books.csv\",\n", " names=[\n", " \"Titel\",\n", " \"Sprache\",\n", " \"Autor*innen\",\n", " \"Lizenz\",\n", " \"Veröffentlichungsdatum\",\n", " \"doi\",\n", " ],\n", ")\n", "\n", "df" ] }, { "cell_type": "code", "execution_count": 11, "id": "7583b131", "metadata": { "execution": { "iopub.execute_input": "2026-05-22T08:32:28.138761Z", "iopub.status.busy": "2026-05-22T08:32:28.138531Z", "iopub.status.idle": "2026-05-22T08:32:28.148287Z", "shell.execute_reply": "2026-05-22T08:32:28.147723Z", "shell.execute_reply.started": "2026-05-22T08:32:28.138744Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
TitelSpracheAutor*innenLizenzVeröffentlichungsdatumdoi
0FalseFalseFalseFalseFalseTrue
1FalseFalseFalseFalseFalseTrue
2FalseFalseFalseFalseFalseTrue
3FalseFalseFalseFalseFalseTrue
\n", "
" ], "text/plain": [ " Titel Sprache Autor*innen Lizenz Veröffentlichungsdatum doi\n", "0 False False False False False True\n", "1 False False False False False True\n", "2 False False False False False True\n", "3 False False False False False True" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.isna()" ] }, { "cell_type": "markdown", "id": "4fbd2fb9", "metadata": {}, "source": [ "Die Option `na_values` kann entweder eine Liste oder eine Reihe von Strings annehmen, um fehlende Werte zu berücksichtigen:" ] }, { "cell_type": "code", "execution_count": 12, "id": "72fe5154", "metadata": { "execution": { "iopub.execute_input": "2026-05-22T08:32:28.149414Z", "iopub.status.busy": "2026-05-22T08:32:28.149229Z", "iopub.status.idle": "2026-05-22T08:32:28.210251Z", "shell.execute_reply": "2026-05-22T08:32:28.209721Z", "shell.execute_reply.started": "2026-05-22T08:32:28.149398Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
TitelSpracheAutor*innenLizenzVeröffentlichungsdatumdoi
0Python basicsenVeit SchieleNaN2021-10-28NaN
1Jupyter TutorialenVeit SchieleNaN2019-06-27NaN
2Jupyter TutorialdeVeit SchieleNaN2020-10-26NaN
3PyViz TutorialenVeit SchieleNaN2020-04-13NaN
\n", "
" ], "text/plain": [ " Titel Sprache Autor*innen Lizenz Veröffentlichungsdatum doi\n", "0 Python basics en Veit Schiele NaN 2021-10-28 NaN\n", "1 Jupyter Tutorial en Veit Schiele NaN 2019-06-27 NaN\n", "2 Jupyter Tutorial de Veit Schiele NaN 2020-10-26 NaN\n", "3 PyViz Tutorial en Veit Schiele NaN 2020-04-13 NaN" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.read_csv(\n", " \"https://raw.githubusercontent.com/veit/python-basics-tutorial-de/main/docs/save-data/books.csv\",\n", " na_values=[\"BSD-3-Clause\"],\n", " names=[\n", " \"Titel\",\n", " \"Sprache\",\n", " \"Autor*innen\",\n", " \"Lizenz\",\n", " \"Veröffentlichungsdatum\",\n", " \"doi\",\n", " ],\n", ")\n", "\n", "df" ] }, { "cell_type": "markdown", "id": "26c6276b", "metadata": {}, "source": [ "Die häufigsten Argumente der Funktion `read_csv`:\n", "\n", "Argument | Beschreibung\n", ":------- | :-----------\n", "`path` | Zeichenfolge, die den Speicherort im Dateisystem, eine URL oder ein dateiähnliches Objekt angibt\n", "`sep` oder `delimiter` | Zeichenfolge oder regulärer Ausdruck zum Trennen der Felder in jeder Zeile\n", "`header` | Zeilennummer, die als Spaltennamen zu verwenden ist; Standardwert ist `0`, also die erste Zeile, sollte aber `None` sein, wenn es keine Kopfzeile gibt\n", "`index_col` | index_col Zeilennummern oder -namen, die als Zeilenindex im Ergebnis verwendet werden sollen; kann ein einzelner Name/eine einzelne Zahl oder eine Liste von ihnen für einen hierarchischen Index sein\n", "`names` | Liste der Spaltennamen\n", "`skiprows` | Anzahl der zu ignorierenden Zeilen am Anfang der Datei oder Liste der Zeilennummern beginnend bei `0`, die übersprungen werden sollen\n", "`na_values` | Folge von Werten, die durch NA ersetzt werden sollen\n", "`comment` | Zeichen, um Kommentare vom Zeilenende abzutrennen\n", "`parse_dates` | Versuch, Daten mit datetime zu parsen; standardmäßig `False`. Wenn `True`, wird versucht, alle Spalten zu parsen. Andernfalls kann eine Liste von Spaltennummern oder -namen angegeben werden, die analysiert werden sollen. Wenn das Element der Liste ein Tupel oder eine Liste ist, werden mehrere Spalten miteinander kombiniert und in ein Datum umgewandelt, z.B. wenn Datum und Uhrzeit auf zwei Spalten aufgeteilt sind\n", "`keep_date_col` | wenn Spalten zum Parsen des Datums kombiniert werden, werden die kombinierten Spalten beibehalten; Standardeinstellung: `False`.\n", "`converters` | Dict, das die Spaltennummer der Namen enthält, die auf Funktionen abgebildet werden, z.B. würde `{'Titel': f}` die Funktion `f` auf alle Werte in der Spalte `Titel` anwenden\n", "`dayfirst` | beim Parsen potenziell mehrdeutiger Datumsangaben als internationales Format behandeln, z.B. `28/6/2021` → `28. Juni 2021`; standardmäßig `False`\n", "`date_parser` | zu verwendende Funktion zum Parsen von Datumsangaben\n", "`nrows` | Anzahl der zu lesenden Zeilen vom Anfang der Datei\n", "`iterator` | Rückgabe eines `TextFileReader`-Objekts zum stückweisen Einlesen der Datei; dieses Objekt kann auch mit der `with`-Anweisung verwendet werden\n", "`chunksize` | Für die Iteration die Größe der Datenblöcke.\n", "`skip_footer` | Anzahl der Zeilen, die am Ende der Datei ignoriert werden sollen\n", "`verbose` | gibt verschiedene Informationen zur Parser-Ausgabe aus, z.B. die Anzahl der fehlenden Werte in nicht-numerischen Spalten.\n", "`encoding` | Textkodierung für Unicode, z.B. `utf-8` für UTF-8-kodierten Text\n", "`squeeze` | wenn die geparsten Daten nur eine Spalte enthalten, wird eine Serie zurückgegeben\n", "`thousands` | Trennzeichen für Tausender, z.B. `,` oder `.`" ] }, { "cell_type": "markdown", "id": "8d3cd94c", "metadata": {}, "source": [ "## Stückweises Einlesen von Textdateien\n", "\n", "Wenn ihr sehr große Dateien verarbeiten wollt, könnt ihr auch nur einen kleinen Teil einer Datei einlesen oder durch kleinere Teile einer Datei iterieren." ] }, { "cell_type": "markdown", "id": "93250dea", "metadata": {}, "source": [ "Bevor wir uns eine große Datei ansehen, verringern wir mit `options.display.max_rows` die Anzahl der angezeigten Zeilen:" ] }, { "cell_type": "code", "execution_count": 13, "id": "d462fe9e", "metadata": { "execution": { "iopub.execute_input": "2026-05-22T08:32:28.211363Z", "iopub.status.busy": "2026-05-22T08:32:28.211031Z", "iopub.status.idle": "2026-05-22T08:32:28.214088Z", "shell.execute_reply": "2026-05-22T08:32:28.213561Z", "shell.execute_reply.started": "2026-05-22T08:32:28.211336Z" } }, "outputs": [], "source": [ "pd.options.display.max_rows = 10" ] }, { "cell_type": "code", "execution_count": 14, "id": "7ad44d21", "metadata": { "execution": { "iopub.execute_input": "2026-05-22T08:32:28.215288Z", "iopub.status.busy": "2026-05-22T08:32:28.214981Z", "iopub.status.idle": "2026-05-22T08:32:28.237508Z", "shell.execute_reply": "2026-05-22T08:32:28.237084Z", "shell.execute_reply.started": "2026-05-22T08:32:28.215263Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
DateMon.Tues.Wed.Thurs.Fri.Sat.Sun.
01996-01-010.129453-0.0238361.1214601.698286-0.5985061.042221-0.726412
11996-01-02-0.094021-0.7279420.698641-1.1980401.9275051.147445-1.134103
21996-01-03-0.5608570.145222-0.9902021.2002140.7173391.117095-1.793565
31996-01-04-0.169755-0.677391-1.533519-0.343477-0.1097051.038236-0.799088
41996-01-051.344705-1.8172610.460991-0.8396330.2658140.4776590.636383
...........................
91272020-12-27-0.881800-0.074270-0.3517691.381641-0.0495481.664180-1.032204
91282020-12-28-0.1433860.198217-1.2438611.1965761.338166-0.212333-0.023131
91292020-12-290.398787-0.8487861.791707-1.167592-0.033881-0.285559-0.323477
91302020-12-300.5878460.4115801.1503800.444638-1.0935770.6054561.463345
91312020-12-310.7363500.436292-0.260171-0.066066-0.328324-0.586792-1.204582
\n", "

9132 rows × 8 columns

\n", "
" ], "text/plain": [ " Date Mon. Tues. Wed. Thurs. Fri. Sat. \\\n", "0 1996-01-01 0.129453 -0.023836 1.121460 1.698286 -0.598506 1.042221 \n", "1 1996-01-02 -0.094021 -0.727942 0.698641 -1.198040 1.927505 1.147445 \n", "2 1996-01-03 -0.560857 0.145222 -0.990202 1.200214 0.717339 1.117095 \n", "3 1996-01-04 -0.169755 -0.677391 -1.533519 -0.343477 -0.109705 1.038236 \n", "4 1996-01-05 1.344705 -1.817261 0.460991 -0.839633 0.265814 0.477659 \n", "... ... ... ... ... ... ... ... \n", "9127 2020-12-27 -0.881800 -0.074270 -0.351769 1.381641 -0.049548 1.664180 \n", "9128 2020-12-28 -0.143386 0.198217 -1.243861 1.196576 1.338166 -0.212333 \n", "9129 2020-12-29 0.398787 -0.848786 1.791707 -1.167592 -0.033881 -0.285559 \n", "9130 2020-12-30 0.587846 0.411580 1.150380 0.444638 -1.093577 0.605456 \n", "9131 2020-12-31 0.736350 0.436292 -0.260171 -0.066066 -0.328324 -0.586792 \n", "\n", " Sun. \n", "0 -0.726412 \n", "1 -1.134103 \n", "2 -1.793565 \n", "3 -0.799088 \n", "4 0.636383 \n", "... ... \n", "9127 -1.032204 \n", "9128 -0.023131 \n", "9129 -0.323477 \n", "9130 1.463345 \n", "9131 -1.204582 \n", "\n", "[9132 rows x 8 columns]" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.read_csv(\"example.csv\")" ] }, { "cell_type": "markdown", "id": "63a23fcb", "metadata": {}, "source": [ "Wenn ihr nur eine kleine Anzahl von Zeilen lesen wollt (ohne die gesamte Datei zu lesen), könnt ihr dies mit `nrows` angeben:" ] }, { "cell_type": "code", "execution_count": 15, "id": "1c2714b7", "metadata": { "execution": { "iopub.execute_input": "2026-05-22T08:32:28.238154Z", "iopub.status.busy": "2026-05-22T08:32:28.238057Z", "iopub.status.idle": "2026-05-22T08:32:28.243882Z", "shell.execute_reply": "2026-05-22T08:32:28.243585Z", "shell.execute_reply.started": "2026-05-22T08:32:28.238145Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
DateMon.Tues.Wed.Thurs.Fri.Sat.Sun.
01996-01-010.129453-0.0238361.1214601.698286-0.5985061.042221-0.726412
11996-01-02-0.094021-0.7279420.698641-1.1980401.9275051.147445-1.134103
21996-01-03-0.5608570.145222-0.9902021.2002140.7173391.117095-1.793565
31996-01-04-0.169755-0.677391-1.533519-0.343477-0.1097051.038236-0.799088
41996-01-051.344705-1.8172610.460991-0.8396330.2658140.4776590.636383
51996-01-06-0.354445-0.065182-1.244963-0.5597320.042362-0.3037120.067632
61996-01-071.4609220.1644120.883960-0.8336420.0015821.1384690.561618
\n", "
" ], "text/plain": [ " Date Mon. Tues. Wed. Thurs. Fri. Sat. \\\n", "0 1996-01-01 0.129453 -0.023836 1.121460 1.698286 -0.598506 1.042221 \n", "1 1996-01-02 -0.094021 -0.727942 0.698641 -1.198040 1.927505 1.147445 \n", "2 1996-01-03 -0.560857 0.145222 -0.990202 1.200214 0.717339 1.117095 \n", "3 1996-01-04 -0.169755 -0.677391 -1.533519 -0.343477 -0.109705 1.038236 \n", "4 1996-01-05 1.344705 -1.817261 0.460991 -0.839633 0.265814 0.477659 \n", "5 1996-01-06 -0.354445 -0.065182 -1.244963 -0.559732 0.042362 -0.303712 \n", "6 1996-01-07 1.460922 0.164412 0.883960 -0.833642 0.001582 1.138469 \n", "\n", " Sun. \n", "0 -0.726412 \n", "1 -1.134103 \n", "2 -1.793565 \n", "3 -0.799088 \n", "4 0.636383 \n", "5 0.067632 \n", "6 0.561618 " ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.read_csv(\"example.csv\", nrows=7)" ] }, { "cell_type": "markdown", "id": "e09a0052", "metadata": {}, "source": [ "Um eine Datei stückweise zu lesen, könnt ihr mit `chunksize` die Anzahl der Zeilen angeben:" ] }, { "cell_type": "code", "execution_count": 16, "id": "41640628", "metadata": { "execution": { "iopub.execute_input": "2026-05-22T08:32:28.244276Z", "iopub.status.busy": "2026-05-22T08:32:28.244179Z", "iopub.status.idle": "2026-05-22T08:32:28.247075Z", "shell.execute_reply": "2026-05-22T08:32:28.246826Z", "shell.execute_reply.started": "2026-05-22T08:32:28.244269Z" } }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.read_csv(\"example.csv\", chunksize=1000)" ] }, { "cell_type": "markdown", "id": "a85297a5", "metadata": {}, "source": [ "Das von `read_csv` zurückgegebene `TextFileReader`-Objekt ermöglicht die Iteration über Teile der Datei entsprechend der `chunksize`. Zum Beispiel können wir über die `example.csv`-Datei iterieren und die Anzahl der Werte in der Spalte `Date` wie folgt aggregieren:" ] }, { "cell_type": "code", "execution_count": 17, "id": "5a77449b", "metadata": { "execution": { "iopub.execute_input": "2026-05-22T08:32:28.247426Z", "iopub.status.busy": "2026-05-22T08:32:28.247357Z", "iopub.status.idle": "2026-05-22T08:32:28.272027Z", "shell.execute_reply": "2026-05-22T08:32:28.271722Z", "shell.execute_reply.started": "2026-05-22T08:32:28.247418Z" } }, "outputs": [], "source": [ "chunks = pd.read_csv(\"example.csv\", chunksize=1000)\n", "\n", "serie = pd.Series([], dtype=\"float64\")\n", "for chunk in chunks:\n", " values = serie.add(chunk[\"Date\"].value_counts(), fill_value=0)\n", "\n", "sorted_values = values.sort_values(ascending=False)" ] }, { "cell_type": "code", "execution_count": 18, "id": "a8a46390", "metadata": { "execution": { "iopub.execute_input": "2026-05-22T08:32:28.272428Z", "iopub.status.busy": "2026-05-22T08:32:28.272356Z", "iopub.status.idle": "2026-05-22T08:32:28.275157Z", "shell.execute_reply": "2026-05-22T08:32:28.274910Z", "shell.execute_reply.started": "2026-05-22T08:32:28.272421Z" } }, "outputs": [ { "data": { "text/plain": [ "Date\n", "2020-08-22 1.0\n", "2020-11-13 1.0\n", "2020-11-27 1.0\n", "2020-11-26 1.0\n", "2020-11-25 1.0\n", "2020-11-24 1.0\n", "2020-11-23 1.0\n", "2020-11-22 1.0\n", "2020-11-21 1.0\n", "2020-11-20 1.0\n", "dtype: float64" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sorted_values[:10]" ] }, { "cell_type": "markdown", "id": "899f5c81", "metadata": {}, "source": [ "`TextFileReader` verfügt außerdem über eine `get_chunk`-Methode, mit der ihr Stücke beliebiger Größe lesen könnt." ] }, { "cell_type": "markdown", "id": "588aab40", "metadata": {}, "source": [ "## DataFrame und Series als csv-Datei schreiben\n", "\n", "Daten können auch in ein mit Trennzeichen versehenes Format exportiert werden. Mit der Methode `pandas.DataFrame.to_csv` können wir die Daten in eine kommagetrennte Datei schreiben:" ] }, { "cell_type": "code", "execution_count": 19, "id": "7a3db15a", "metadata": { "execution": { "iopub.execute_input": "2026-05-22T08:32:28.275821Z", "iopub.status.busy": "2026-05-22T08:32:28.275676Z", "iopub.status.idle": "2026-05-22T08:32:28.279074Z", "shell.execute_reply": "2026-05-22T08:32:28.278756Z", "shell.execute_reply.started": "2026-05-22T08:32:28.275799Z" } }, "outputs": [], "source": [ "df.to_csv(\"out.csv\")" ] }, { "cell_type": "markdown", "id": "eb1fba36", "metadata": {}, "source": [ "Natürlich können auch andere Begrenzungszeichen verwendet werden, z.B. zum Schreiben nach `sys.stdout`, so dass das Textergebnis auf der Konsole und nicht in einer Datei ausgegeben wird:" ] }, { "cell_type": "code", "execution_count": 20, "id": "cb9b7b9a", "metadata": { "execution": { "iopub.execute_input": "2026-05-22T08:32:28.279543Z", "iopub.status.busy": "2026-05-22T08:32:28.279467Z", "iopub.status.idle": "2026-05-22T08:32:28.281283Z", "shell.execute_reply": "2026-05-22T08:32:28.280972Z", "shell.execute_reply.started": "2026-05-22T08:32:28.279537Z" } }, "outputs": [], "source": [ "import sys" ] }, { "cell_type": "code", "execution_count": 21, "id": "75ce3220", "metadata": { "execution": { "iopub.execute_input": "2026-05-22T08:32:28.284798Z", "iopub.status.busy": "2026-05-22T08:32:28.284634Z", "iopub.status.idle": "2026-05-22T08:32:28.287314Z", "shell.execute_reply": "2026-05-22T08:32:28.286991Z", "shell.execute_reply.started": "2026-05-22T08:32:28.284757Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "|Titel|Sprache|Autor*innen|Lizenz|Veröffentlichungsdatum|doi\n", "0|Python basics|en|Veit Schiele||2021-10-28|\n", "1|Jupyter Tutorial|en|Veit Schiele||2019-06-27|\n", "2|Jupyter Tutorial|de|Veit Schiele||2020-10-26|\n", "3|PyViz Tutorial|en|Veit Schiele||2020-04-13|\n" ] } ], "source": [ "df.to_csv(sys.stdout, sep=\"|\")" ] }, { "cell_type": "markdown", "id": "1036cf82", "metadata": {}, "source": [ "Fehlende Werte erscheinen in der Ausgabe als leere Zeichenketten. Möglicherweise möchtet ihr sie durch einen anderen Platzhalter kennzeichnen:" ] }, { "cell_type": "code", "execution_count": 22, "id": "f43ec7f7", "metadata": { "execution": { "iopub.execute_input": "2026-05-22T08:32:28.287671Z", "iopub.status.busy": "2026-05-22T08:32:28.287581Z", "iopub.status.idle": "2026-05-22T08:32:28.290070Z", "shell.execute_reply": "2026-05-22T08:32:28.289791Z", "shell.execute_reply.started": "2026-05-22T08:32:28.287664Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ ",Titel,Sprache,Autor*innen,Lizenz,Veröffentlichungsdatum,doi\n", "0,Python basics,en,Veit Schiele,NaN,2021-10-28,NaN\n", "1,Jupyter Tutorial,en,Veit Schiele,NaN,2019-06-27,NaN\n", "2,Jupyter Tutorial,de,Veit Schiele,NaN,2020-10-26,NaN\n", "3,PyViz Tutorial,en,Veit Schiele,NaN,2020-04-13,NaN\n" ] } ], "source": [ "df.to_csv(sys.stdout, na_rep=\"NaN\")" ] }, { "cell_type": "markdown", "id": "268fe3e8", "metadata": {}, "source": [ "Wenn keine anderen Optionen angegeben sind, werden sowohl die Zeilen- als auch die Spaltenbeschriftungen geschrieben. Beides kann deaktiviert werden:" ] }, { "cell_type": "code", "execution_count": 23, "id": "8c8f7e94", "metadata": { "execution": { "iopub.execute_input": "2026-05-22T08:32:28.290611Z", "iopub.status.busy": "2026-05-22T08:32:28.290538Z", "iopub.status.idle": "2026-05-22T08:32:28.292773Z", "shell.execute_reply": "2026-05-22T08:32:28.292540Z", "shell.execute_reply.started": "2026-05-22T08:32:28.290605Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Python basics,en,Veit Schiele,,2021-10-28,\n", "Jupyter Tutorial,en,Veit Schiele,,2019-06-27,\n", "Jupyter Tutorial,de,Veit Schiele,,2020-10-26,\n", "PyViz Tutorial,en,Veit Schiele,,2020-04-13,\n" ] } ], "source": [ "df.to_csv(sys.stdout, index=False, header=False)" ] }, { "cell_type": "markdown", "id": "255cf721", "metadata": {}, "source": [ "Ihr könnt auch nur eine Teilmenge der Spalten schreiben, und zwar in einer von euch gewählten Reihenfolge:" ] }, { "cell_type": "code", "execution_count": 24, "id": "a48824f5", "metadata": { "execution": { "iopub.execute_input": "2026-05-22T08:32:28.293266Z", "iopub.status.busy": "2026-05-22T08:32:28.293185Z", "iopub.status.idle": "2026-05-22T08:32:28.295957Z", "shell.execute_reply": "2026-05-22T08:32:28.295688Z", "shell.execute_reply.started": "2026-05-22T08:32:28.293259Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Titel,Sprache,Autor*innen,Veröffentlichungsdatum\n", "Python basics,en,Veit Schiele,2021-10-28\n", "Jupyter Tutorial,en,Veit Schiele,2019-06-27\n", "Jupyter Tutorial,de,Veit Schiele,2020-10-26\n", "PyViz Tutorial,en,Veit Schiele,2020-04-13\n" ] } ], "source": [ "df.to_csv(\n", " sys.stdout,\n", " index=False,\n", " columns=[\"Titel\", \"Sprache\", \"Autor*innen\", \"Veröffentlichungsdatum\"],\n", ")" ] }, { "cell_type": "markdown", "id": "0f9db2a8-7291-4db9-89bc-ef5def432dae", "metadata": {}, "source": [ "## Arbeiten mit dem csv-Modul von Python\n", "\n", "Die meisten Formen von Tabellendaten können mit Funktionen wie `pandas.read_csv` geladen werden. In einigen Fällen kann jedoch auch eine manuelle Bearbeitung erforderlich sein. Es ist nicht ungewöhnlich, eine Datei mit einer oder mehreren fehlerhaften Zeilen zu erhalten, die `read_csv` zum Scheitern bringen. Für jede Datei mit einem einstelligen Begrenzungszeichen könnt ihr das in Python eingebaute [csv](https://docs.python.org/3/library/csv.html)-Modul verwenden. Um es zu verwenden, übergebt eine offene Datei oder ein dateiähnliches Objekt an `csv.reader`:" ] }, { "cell_type": "code", "execution_count": 25, "id": "d4ed9b30-594c-4e83-a5f2-460b36cb6bab", "metadata": { "execution": { "iopub.execute_input": "2026-05-22T08:32:28.296485Z", "iopub.status.busy": "2026-05-22T08:32:28.296396Z", "iopub.status.idle": "2026-05-22T08:32:28.298816Z", "shell.execute_reply": "2026-05-22T08:32:28.298574Z", "shell.execute_reply.started": "2026-05-22T08:32:28.296477Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['', 'Titel', 'Sprache', 'Autor*innen', 'Lizenz', 'Veröffentlichungsdatum', 'doi']\n", "['0', 'Python basics', 'en', 'Veit Schiele', '', '2021-10-28', '']\n", "['1', 'Jupyter Tutorial', 'en', 'Veit Schiele', '', '2019-06-27', '']\n", "['2', 'Jupyter Tutorial', 'de', 'Veit Schiele', '', '2020-10-26', '']\n", "['3', 'PyViz Tutorial', 'en', 'Veit Schiele', '', '2020-04-13', '']\n" ] } ], "source": [ "import csv\n", "\n", "from pathlib import Path\n", "\n", "\n", "f = Path.open(\"out.csv\")\n", "reader = csv.reader(f)\n", "\n", "for line in reader:\n", " print(line)" ] }, { "cell_type": "markdown", "id": "0ed726c4-5e09-4676-bcf0-f78e9f7a10e0", "metadata": {}, "source": [ "Mit [Sniffer.has_header](https://docs.python.org/3/library/csv.html#csv.Sniffer.has_header) wird eure csv-Datei analysiert und gibt ``True`` zurück, wenn die erste Zeile eine Reihe von Spaltenüberschriften zu sein scheint.\n", "\n", "
\n", "\n", "**Bemerkung:**\n", "\n", "Diese Methode ist nur eine grobe Heuristik und kann sowohl falsch-positive als auch falsch-negative Ergebnisse liefern.\n", "
" ] }, { "cell_type": "markdown", "id": "a19c05c1-e947-471b-8089-8e36e65b4268", "metadata": {}, "source": [ "Auch [Sniffer.sniff](https://docs.python.org/3/library/csv.html#csv.Sniffer.sniff) analysiert eure csv-Datei, gibt aber eine der folgenden Dialekt-Unterklassen zurück." ] }, { "cell_type": "code", "execution_count": 26, "id": "263a8cb4-4ae1-46f0-963f-9d2df2de45ed", "metadata": { "execution": { "iopub.execute_input": "2026-05-22T08:32:28.299326Z", "iopub.status.busy": "2026-05-22T08:32:28.299250Z", "iopub.status.idle": "2026-05-22T08:32:28.302449Z", "shell.execute_reply": "2026-05-22T08:32:28.302142Z", "shell.execute_reply.started": "2026-05-22T08:32:28.299319Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['', 'Titel', 'Sprache', 'Autor*innen', 'Lizenz', 'Veröffentlichungsdatum', 'doi']\n", "['0', 'Python basics', 'en', 'Veit Schiele', '', '2021-10-28', '']\n", "['1', 'Jupyter Tutorial', 'en', 'Veit Schiele', '', '2019-06-27', '']\n", "['2', 'Jupyter Tutorial', 'de', 'Veit Schiele', '', '2020-10-26', '']\n", "['3', 'PyViz Tutorial', 'en', 'Veit Schiele', '', '2020-04-13', '']\n" ] } ], "source": [ "with Path.open(\"out.csv\") as f:\n", " dialect = csv.Sniffer().sniff(f.read(1024))\n", " f.seek(0)\n", " reader = csv.reader(f, dialect)\n", "\n", " for line in reader:\n", " print(line)" ] }, { "cell_type": "markdown", "id": "e70392b5", "metadata": {}, "source": [ "### Dialekte\n", "\n", "csv-Dateien gibt es in vielen verschiedenen Varianten. Das Python csv-Modul kommt bereits mit drei verschiedenen Dialekten:\n", "\n", "Parameter | [excel](https://docs.python.org/3/library/csv.html#csv.excel) | [excel-tab](https://docs.python.org/3/library/csv.html#csv.excel_tab) | [unix](https://docs.python.org/3/library/csv.html#csv.unix_dialect)\n", ":--- | :--- | :--- | :--- \n", "`delimiter` | `','` | `'\\t'` | `','` |\n", "`quotechar` | `'\"'` | `'\"'` | ` '\"'` |\n", "`doublequote` | `True` | `True` | `True` |\n", "`skipinitialspace` | `False` | `False` | `False` |\n", "`lineterminator` | `'\\r\\n'` | `'\\r\\n'` | `'\\n'` |\n", "`quoting` | `csv.QUOTE_MINIMAL` | `csv.QUOTE_MINIMAL` | `csv.QUOTE_ALL` |\n", "`escapechar` | `None` | `None` | `None` |\n", "\n", "Ihr könnt damit auch euer eigenes Format definieren mit einem anderen Trennzeichen, einer anderen Zeichenkettenkonvention oder einem anderen Zeilenendezeichen. Hierfür empfiehlt sich die Registrierung eines eigenen Dialekts. Mögliche Optionen und Funktionen von `csv.register_dialect` sind:\n", "\n", "Argument | Beschreibung\n", ":------- | :-----------\n", "`delimiter` | Ein-Zeichen-Zeichenfolge zur Trennung von Feldern; Standardwert ist `,`.\n", "`lineterminator` | Zeilenabschlusszeichen zum Schreiben; Standardwert ist `\\r\\n`. Reader ignoriert dies und erkennt plattformübergreifende Zeilenbegrenzer.\n", "`quotechar` | Anführungszeichen für Felder mit Sonderzeichen (wie ein Trennzeichen); Standardwert ist `\"`.\n", "`quoting` | Zitier-Konvention. Zu den Optionen gehören `csv.QUOTE_ALL` – alle Felder zitieren, `csv.QUOTE_MINIMAL` – nur Felder mit Sonderzeichen wie dem Begrenzungszeichen, `csv.QUOTE_NONNUMERIC` und `csv.QUOTE_NONE` – keine Zitate. Der Standardwert ist `QUOTE_MINIMAL`.\n", "`skipinitialspace` | Leerzeichen nach jedem Begrenzungszeichen ignorieren; Standardwert ist `False`.\n", "`doublequote` | bei `True` werden Anfürhuntszeichen innerhalb eines Feldes verdoppelt.\n", "`escapechar` | Zeichenkette, um das Trennzeichen zu umgehen, wenn `quoting` auf `csv.QUOTE_NONE` gesetzt ist; standardmäßig deaktiviert." ] }, { "cell_type": "code", "execution_count": 27, "id": "8d765adf", "metadata": { "execution": { "iopub.execute_input": "2026-05-22T08:32:28.303021Z", "iopub.status.busy": "2026-05-22T08:32:28.302929Z", "iopub.status.idle": "2026-05-22T08:32:28.304817Z", "shell.execute_reply": "2026-05-22T08:32:28.304517Z", "shell.execute_reply.started": "2026-05-22T08:32:28.303013Z" } }, "outputs": [], "source": [ "csv.register_dialect(\n", " \"my_csv_dialect\",\n", " lineterminator=\"\\n\",\n", " delimiter=\",\",\n", " quotechar=\"'\",\n", " quoting=csv.QUOTE_MINIMAL,\n", ")" ] }, { "cell_type": "markdown", "id": "e9ab0d31", "metadata": {}, "source": [ "Nun kann die CSV-Datei geöffnet werden mit:" ] }, { "cell_type": "code", "execution_count": 28, "id": "69fff7dd", "metadata": { "execution": { "iopub.execute_input": "2026-05-22T08:32:28.305369Z", "iopub.status.busy": "2026-05-22T08:32:28.305234Z", "iopub.status.idle": "2026-05-22T08:32:28.307437Z", "shell.execute_reply": "2026-05-22T08:32:28.307130Z", "shell.execute_reply.started": "2026-05-22T08:32:28.305362Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['', 'Titel', 'Sprache', 'Autor*innen', 'Lizenz', 'Veröffentlichungsdatum', 'doi']\n", "['0', 'Python basics', 'en', 'Veit Schiele', '', '2021-10-28', '']\n", "['1', 'Jupyter Tutorial', 'en', 'Veit Schiele', '', '2019-06-27', '']\n", "['2', 'Jupyter Tutorial', 'de', 'Veit Schiele', '', '2020-10-26', '']\n", "['3', 'PyViz Tutorial', 'en', 'Veit Schiele', '', '2020-04-13', '']\n" ] } ], "source": [ "with Path.open(\"out.csv\") as f:\n", " reader = csv.reader(f, \"my_csv_dialect\")\n", " for line in reader:\n", " print(line)" ] }, { "cell_type": "markdown", "id": "55ac6d15", "metadata": {}, "source": [ "Dann können wir ein Dict mit Datenspalten erstellen, indem wir [Dict Comprehensions](https://peps.python.org/pep-0274/) verwenden und mit [zip](https://docs.python.org/3/library/functions.html#zip) über die Werte aus `Values` iterieren. Beachtet dabei, dass dies bei großen Dateien viel Speicherplatz benötigt, da hierfürdie Zeilen in Spalten umgewandelt werden:" ] }, { "cell_type": "code", "execution_count": 29, "id": "e9c0a9c2", "metadata": { "execution": { "iopub.execute_input": "2026-05-22T08:32:28.307851Z", "iopub.status.busy": "2026-05-22T08:32:28.307790Z", "iopub.status.idle": "2026-05-22T08:32:28.310861Z", "shell.execute_reply": "2026-05-22T08:32:28.310496Z", "shell.execute_reply.started": "2026-05-22T08:32:28.307845Z" } }, "outputs": [ { "data": { "text/plain": [ "{'': ('0', '1', '2', '3'),\n", " 'Titel': ('Python basics',\n", " 'Jupyter Tutorial',\n", " 'Jupyter Tutorial',\n", " 'PyViz Tutorial'),\n", " 'Sprache': ('en', 'en', 'de', 'en'),\n", " 'Autor*innen': ('Veit Schiele',\n", " 'Veit Schiele',\n", " 'Veit Schiele',\n", " 'Veit Schiele'),\n", " 'Lizenz': ('', '', '', ''),\n", " 'Veröffentlichungsdatum': ('2021-10-28',\n", " '2019-06-27',\n", " '2020-10-26',\n", " '2020-04-13'),\n", " 'doi': ('', '', '', '')}" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "with Path.open(\"out.csv\") as f:\n", " reader = csv.reader(f, \"my_csv_dialect\")\n", " lines = list(reader)\n", " header, values = lines[0], lines[1:]\n", " data_dict = {\n", " h: v for h, v in zip(header, zip(*values, strict=True), strict=True)\n", " }\n", "\n", "data_dict" ] }, { "cell_type": "markdown", "id": "438ba9f5", "metadata": {}, "source": [ "Um Dateien mit Trennzeichen manuell zu schreiben, könnt ihr `csv.writer` verwenden. Er akzeptiert ein offenes, beschreibbares Dateiobjekt und die gleichen Dialekt- und Formatoptionen wie `csv.reader`:" ] }, { "cell_type": "code", "execution_count": 30, "id": "5a43af52", "metadata": { "execution": { "iopub.execute_input": "2026-05-22T08:32:28.311223Z", "iopub.status.busy": "2026-05-22T08:32:28.311157Z", "iopub.status.idle": "2026-05-22T08:32:28.313159Z", "shell.execute_reply": "2026-05-22T08:32:28.312913Z", "shell.execute_reply.started": "2026-05-22T08:32:28.311217Z" } }, "outputs": [], "source": [ "with Path.open(\"new.csv\", \"w\") as f:\n", " writer = csv.writer(f, \"my_csv_dialect\")\n", " writer.writerow((\"\", \"Titel\", \"Sprache\", \"Autor*innen\"))\n", " writer.writerow((\"1\", \"Python basics\", \"en\", \"Veit Schiele\"))\n", " writer.writerow((\"2\", \"Jupyter Tutorial\", \"en\", \"Veit Schiele\"))" ] }, { "cell_type": "code", "execution_count": 31, "id": "a65c4cef", "metadata": { "execution": { "iopub.execute_input": "2026-05-22T08:32:28.313548Z", "iopub.status.busy": "2026-05-22T08:32:28.313470Z", "iopub.status.idle": "2026-05-22T08:32:28.315780Z", "shell.execute_reply": "2026-05-22T08:32:28.315520Z", "shell.execute_reply.started": "2026-05-22T08:32:28.313542Z" } }, "outputs": [ { "data": { "text/plain": [ "[',Titel,Sprache,Autor*innen\\n',\n", " '1,Python basics,en,Veit Schiele\\n',\n", " '2,Jupyter Tutorial,en,Veit Schiele\\n']" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "list(Path.open(\"new.csv\"))" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3.13 Kernel", "language": "python", "name": "python313" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.13.0" }, "widgets": { "application/vnd.jupyter.widget-state+json": { "state": {}, "version_major": 2, "version_minor": 0 } } }, "nbformat": 4, "nbformat_minor": 5 }