{ "cells": [ { "cell_type": "markdown", "id": "ccf4ea1c", "metadata": {}, "source": [ "# Dask\n", "\n", "Dask erfüllt zwei verschiedene Aufgaben:\n", "\n", "1. die dynamische Aufgabenplanung wird optimiert, ähnlich wie bei [Airflow](https://airflow.apache.org/), [Luigi](https://github.com/spotify/luigi) oder [Celery](https://docs.celeryq.dev/en/stable/)\n", "2. Arrays, Dataframes und Lists werden parallel mit dynamischem Task Scheduling ausgeführt." ] }, { "cell_type": "markdown", "id": "d6041d52", "metadata": {}, "source": [ "## Skalierung von Laptops bis hin zu Clustern\n", "\n", "\n", "Dask kann mit uv auf einem Laptop installiert werden und erweitert die Größe der Datensätze von *passt in den Arbeitsspeicher* zu *passt auf die Festplatte*. Dask kann jedoch auch auf einen Cluster mit Hunderten von Rechnern skaliert werden. Dask ist robust, flexibel, Data Local und hat eine geringe Latenzzeit. Weitere Informationen findet ihr in der Dokumentation zum [Distributed Scheduler](https://distributed.dask.org/en/latest/). Dieser einfache Übergang zwischen einer einzelnen Maschine und einem Cluster ermöglicht einen einfachen Start und ein Wachstum nach Bedarf." ] }, { "cell_type": "markdown", "id": "4e332597", "metadata": {}, "source": [ "## Dask installieren\n", "\n", "\n", "Ihr könnt alles installieren, was für die meisten gängigen Anwendungen von Dask erforderlich ist (Arrays, Dataframes, …). Dies installiert sowohl Dask als auch Abhängigkeiten wie NumPy, Pandas, usw., die für verschiedene Arbeiten benötigt werden:\n", "\n", "``` bash\n", "$ uv add \"dask[complete]\"\n", "```\n", "\n", "Es können aber auch nur einzelne Subsets installiert werden:\n", "\n", "``` bash\n", "$ uv add \"dask[array]\"\n", "$ uv add \"dask[dataframe]\"\n", "$ uv add \"dask[diagnostics]\"\n", "$ uv add \"dask[distributed]\"\n", "```" ] }, { "cell_type": "markdown", "id": "7d944110", "metadata": {}, "source": [ "## Vertraute Bedienung" ] }, { "cell_type": "markdown", "id": "5acd42b5", "metadata": {}, "source": [ "### Dask DataFrame\n", "\n", "… imitiert Pandas" ] }, { "cell_type": "code", "execution_count": 1, "id": "6a83265c", "metadata": { "execution": { "iopub.execute_input": "2026-05-21T22:34:37.872209Z", "iopub.status.busy": "2026-05-21T22:34:37.872030Z", "iopub.status.idle": "2026-05-21T22:34:38.096678Z", "shell.execute_reply": "2026-05-21T22:34:38.096417Z", "shell.execute_reply.started": "2026-05-21T22:34:37.872189Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Unnamed: 02021-122022-012022-02
Title
Jupyter Tutorial0.518103.520505.513099.0
PyViz Tutorial2.04873.03930.02573.0
Python Basics4.5261.0251.0341.0
\n", "
" ], "text/plain": [ " Unnamed: 0 2021-12 2022-01 2022-02\n", "Title \n", "Jupyter Tutorial 0.5 18103.5 20505.5 13099.0\n", "PyViz Tutorial 2.0 4873.0 3930.0 2573.0\n", "Python Basics 4.5 261.0 251.0 341.0" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "\n", "\n", "df = pd.read_csv(\"tutorials.csv\")\n", "grouped = df.groupby(\"Title\")\n", "grouped.agg(\"mean\")" ] }, { "cell_type": "code", "execution_count": 2, "id": "ebefec46", "metadata": { "execution": { "iopub.execute_input": "2026-05-21T22:34:38.097118Z", "iopub.status.busy": "2026-05-21T22:34:38.097014Z", "iopub.status.idle": "2026-05-21T22:34:38.590778Z", "shell.execute_reply": "2026-05-21T22:34:38.590476Z", "shell.execute_reply.started": "2026-05-21T22:34:38.097108Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Unnamed: 02021-122022-012022-02
Title
Jupyter Tutorial0.518103.520505.513099.0
PyViz Tutorial2.04873.03930.02573.0
Python Basics4.5261.0251.0341.0
\n", "
" ], "text/plain": [ " Unnamed: 0 2021-12 2022-01 2022-02\n", "Title \n", "Jupyter Tutorial 0.5 18103.5 20505.5 13099.0\n", "PyViz Tutorial 2.0 4873.0 3930.0 2573.0\n", "Python Basics 4.5 261.0 251.0 341.0" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import dask.dataframe as dd\n", "\n", "\n", "df = dd.read_csv(\"tutorials.csv\")\n", "grouped = df.groupby(\"Title\")\n", "grouped.agg(\"mean\").head()" ] }, { "cell_type": "markdown", "id": "c904e314", "metadata": {}, "source": [ "
\n", "\n", "**Siehe auch**\n", "\n", "* [Dask DataFrame Docs](https://docs.dask.org/en/latest/dataframe.html)\n", "* [Dask DataFrame Best Practices](https://docs.dask.org/en/latest/dataframe-best-practices.html)\n", "
" ] }, { "cell_type": "markdown", "id": "bd4cdd77", "metadata": {}, "source": [ "### Dask Array\n", "\n", "… imitiert NumPy" ] }, { "cell_type": "code", "execution_count": 3, "id": "7c11efbb", "metadata": { "execution": { "iopub.execute_input": "2026-05-21T22:34:38.591460Z", "iopub.status.busy": "2026-05-21T22:34:38.591349Z", "iopub.status.idle": "2026-05-21T22:34:38.656401Z", "shell.execute_reply": "2026-05-21T22:34:38.655974Z", "shell.execute_reply.started": "2026-05-21T22:34:38.591452Z" } }, "outputs": [], "source": [ "import h5py\n", "import numpy as np\n", "\n", "\n", "f = h5py.File(\"mydata.h5\")\n", "x = np.array(f[\".\"])" ] }, { "cell_type": "code", "execution_count": 4, "id": "c1b99ffd", "metadata": { "execution": { "iopub.execute_input": "2026-05-21T22:34:38.657068Z", "iopub.status.busy": "2026-05-21T22:34:38.656779Z", "iopub.status.idle": "2026-05-21T22:34:38.659505Z", "shell.execute_reply": "2026-05-21T22:34:38.659257Z", "shell.execute_reply.started": "2026-05-21T22:34:38.657059Z" } }, "outputs": [], "source": [ "import dask.array as da\n", "\n", "\n", "f = h5py.File(\"mydata.h5\")\n", "x = da.array(f[\".\"])" ] }, { "cell_type": "markdown", "id": "fd485220", "metadata": {}, "source": [ "
\n", "\n", "**Siehe auch**\n", "\n", "* [Dask Array Docs](https://docs.dask.org/en/latest/array.html)\n", "* [Dask Array Best Practices](https://docs.dask.org/en/latest/array-best-practices.html)\n", "
" ] }, { "cell_type": "markdown", "id": "601abdeb", "metadata": {}, "source": [ "### Dask Bag\n", "\n", "… imitiert [iterators](https://docs.python.org/3/library/itertools.html), [Toolz](https://toolz.readthedocs.io/en/latest/index.html) und [PySpark](https://spark.apache.org/docs/latest/api/python/)." ] }, { "cell_type": "code", "execution_count": 5, "id": "23c242b9", "metadata": { "execution": { "iopub.execute_input": "2026-05-21T22:34:38.659951Z", "iopub.status.busy": "2026-05-21T22:34:38.659875Z", "iopub.status.idle": "2026-05-21T22:34:38.816074Z", "shell.execute_reply": "2026-05-21T22:34:38.815523Z", "shell.execute_reply.started": "2026-05-21T22:34:38.659943Z" } }, "outputs": [ { "data": { "text/plain": [ "[11, 10]" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import dask.bag as db\n", "\n", "\n", "b = db.from_sequence([10, 3, 5, 7, 11, 4])\n", "list(b.topk(2))" ] }, { "cell_type": "markdown", "id": "9b9d965f", "metadata": {}, "source": [ "
\n", "\n", "**Siehe auch**\n", "\n", "* [Dask Bag Docs](https://docs.dask.org/en/latest/bag.html)\n", "
" ] }, { "cell_type": "markdown", "id": "b1b1b4d3", "metadata": {}, "source": [ "### Dask Delayed\n", "\n", "… imitiert loops und umschließt benutzerdefinierten Code, siehe auch [Erstellen einer delayed-Pipeline](https://www.python4data.science/de/latest/clean-prep/dask-pipeline.html#5.-Creating-a-delayed-pipeline)." ] }, { "cell_type": "markdown", "id": "23063451", "metadata": {}, "source": [ "
\n", "\n", "**Siehe auch**\n", "\n", "* [Dask Delayed Docs](https://docs.dask.org/en/latest/delayed.html)\n", "* [Dask Delayed Best Practices](https://docs.dask.org/en/latest/delayed-best-practices.html)\n", "* [Dask-Pipeline-Beispiel: Tracking der Internationalen Raumstation mit Dask](../clean-prep/dask-pipeline.ipynb)\n", "
" ] }, { "cell_type": "markdown", "id": "3f4a6539", "metadata": {}, "source": [ "## ``concurrent.futures``\n", "\n", "Das Interface ermöglicht die Übermittlung von selbstdefinierten Aufgaben.\n", "\n", "
\n", "\n", "**Bemerkung**\n", "\n", "Für das folgende Beispiel muss Dask mit der `distributed`-Option installiert werden, z.B.\n", "\n", "``` bash\n", "$ uv add \"dask[distributed]\"\n", "```\n", "
" ] }, { "cell_type": "code", "execution_count": 6, "id": "582f1a03-9ce7-4c00-8ff9-1db8db9bb440", "metadata": { "execution": { "iopub.execute_input": "2026-05-21T22:34:38.818251Z", "iopub.status.busy": "2026-05-21T22:34:38.818082Z", "iopub.status.idle": "2026-05-21T22:34:38.821299Z", "shell.execute_reply": "2026-05-21T22:34:38.820958Z", "shell.execute_reply.started": "2026-05-21T22:34:38.818237Z" } }, "outputs": [], "source": [ "from dask.distributed import Client" ] }, { "cell_type": "code", "execution_count": 7, "id": "0bcef1bd-4c9b-47cc-a9df-19c911b9178d", "metadata": { "execution": { "iopub.execute_input": "2026-05-21T22:34:38.821700Z", "iopub.status.busy": "2026-05-21T22:34:38.821585Z", "iopub.status.idle": "2026-05-21T22:34:39.507808Z", "shell.execute_reply": "2026-05-21T22:34:39.507242Z", "shell.execute_reply.started": "2026-05-21T22:34:38.821685Z" } }, "outputs": [], "source": [ "client = Client()" ] }, { "cell_type": "markdown", "id": "4ad650ae-2f96-4fc9-b6b4-827dd37033b8", "metadata": {}, "source": [ "Dadurch werden die lokalen Worker als Prozesse gestartet. Um die lokalen Worker als Threads auszuführen, könnt ihr `processes=False` als Parameter übergeben:" ] }, { "cell_type": "code", "execution_count": 8, "id": "1cf795e8-f0f1-4f4c-b3b5-fb7a56ea603b", "metadata": { "execution": { "iopub.execute_input": "2026-05-21T22:34:39.508543Z", "iopub.status.busy": "2026-05-21T22:34:39.508423Z", "iopub.status.idle": "2026-05-21T22:34:39.535730Z", "shell.execute_reply": "2026-05-21T22:34:39.535423Z", "shell.execute_reply.started": "2026-05-21T22:34:39.508531Z" } }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/Users/veit/cusy/trn/jupyter-tutorial/uvenvs/py313/.venv/lib/python3.13/site-packages/distributed/node.py:187: UserWarning: Port 8787 is already in use.\n", "Perhaps you already have a cluster running?\n", "Hosting the HTTP server on port 62320 instead\n", " warnings.warn(\n" ] } ], "source": [ "client = Client(processes=False)" ] }, { "cell_type": "markdown", "id": "e581bea6-62db-41a4-be90-7419b1b06786", "metadata": {}, "source": [ "Jetzt könnt ihr eure eigenen Aufgaben ausführen und Abhängigkeiten mithilfe der `submit`-Methode verketten:" ] }, { "cell_type": "code", "execution_count": 9, "id": "c7375faf-a322-413a-a885-4ebfafad8a9f", "metadata": { "execution": { "iopub.execute_input": "2026-05-21T22:34:39.536239Z", "iopub.status.busy": "2026-05-21T22:34:39.536142Z", "iopub.status.idle": "2026-05-21T22:34:39.539363Z", "shell.execute_reply": "2026-05-21T22:34:39.539017Z", "shell.execute_reply.started": "2026-05-21T22:34:39.536230Z" } }, "outputs": [], "source": [ "from math import pi\n", "\n", "\n", "def inc(x):\n", " return x + 1\n", "\n", "\n", "def circumference(x):\n", " return 2 * pi * x\n", "\n", "\n", "increments = client.submit(inc, 10)\n", "circumferences = client.submit(circumference, increments)" ] }, { "cell_type": "code", "execution_count": 10, "id": "cfa665a0", "metadata": { "execution": { "iopub.execute_input": "2026-05-21T22:34:39.539862Z", "iopub.status.busy": "2026-05-21T22:34:39.539790Z", "iopub.status.idle": "2026-05-21T22:34:39.553835Z", "shell.execute_reply": "2026-05-21T22:34:39.553550Z", "shell.execute_reply.started": "2026-05-21T22:34:39.539855Z" } }, "outputs": [ { "data": { "text/plain": [ "69.11503837897544" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "circumferences.result()" ] }, { "cell_type": "markdown", "id": "acf78713", "metadata": {}, "source": [ "
\n", "\n", "**Siehe auch**\n", "\n", "* [Dask Futures Docs](https://docs.dask.org/en/latest/futures.html)\n", "* [Dask Futures Quickstart](https://distributed.dask.org/en/latest/quickstart.html)\n", "* [Dask Futures Examples](https://examples.dask.org/futures.html)\n", "
" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3.13 Kernel", "language": "python", "name": "python313" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.13.0" }, "latex_envs": { "LaTeX_envs_menu_present": true, "autoclose": false, "autocomplete": true, "bibliofile": "biblio.bib", "cite_by": "apalike", "current_citInitial": 1, "eqLabelWithNumbers": true, "eqNumInitial": 1, "hotkeys": { "equation": "Ctrl-E", "itemize": "Ctrl-I" }, "labels_anchors": false, "latex_user_defs": false, "report_style_numbering": false, "user_envs_cfg": false }, "widgets": { "application/vnd.jupyter.widget-state+json": { "state": {}, "version_major": 2, "version_minor": 0 } } }, "nbformat": 4, "nbformat_minor": 5 }