{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# String-Vergleiche\n", "\n", "In diesem Notebook verwenden wir die beliebte Bibliothek für String-Vergleiche [fuzzywuzzy](https://github.com/seatgeek/fuzzywuzzy). Sie basiert auf der eingabauten Python-Bibliothek [difflib](https://docs.python.org/3/library/difflib.html). Weitere Informationen zu den verschiedenen verfügbaren Methoden und ihren Unterschieden findet ihr im Blogbeitrag [FuzzyWuzzy: Fuzzy String Matching in Python](https://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/). \n", "\n", "> **Siehe auch:**\n", "> \n", "> [textacy](https://github.com/chartbeat-labs/textacy)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1. Installation\n", "\n", "Mit [Spack](../productive/envs/spack/index.rst) könnt ihr `fuzzywuzzy` und die optionale `python-levenshtein`-Bibliothek in eurem Kernel bereitstellen:\n", "\n", "```console\n", "$ spack env activate python-311\n", "$ spack install py-fuzzywuzzy+speedup\n", "```\n", "\n", "Alternativ könnt ihr die beiden Bibliotheken auch mit anderen Paketmanagern installieren, z.B.\n", "\n", "```console\n", "$ uv add fuzzywuzzy[speedup]\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2. Import" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "from fuzzywuzzy import fuzz, process" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 3. Beispiel" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "berlin = [\"Berlin, Germany\", \"Berlin, Deutschland\", \"Berlin\", \"Berlin, DE\"]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## String-Ähnlichkeit\n", "\n", "Die Übereinstimmung der erstem beiden Strings `'Berlin, Germany'` und `'Berlin, Deutschland'` scheint gering:" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "65" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "fuzz.ratio(berlin[0], berlin[1])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Partielle String-Ähnlichkeit\n", "\n", "Inkonsistente Teilzeichenfolgen sind ein häufiges Problem. Um dies zu umgehen, verwendet fuzzywuzzy eine Heuristik, die als _best partial_ bezeichnet wird." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "60" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "fuzz.partial_ratio(berlin[0], berlin[1])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Token-Sortierung\n", "\n", "Bei der Token-Sortierung wird die betreffende Zeichenfolge mit einem Token versehen, die Token alphabetisch sortiert und anschließend wieder zu einer Zeichenfolge zusammengefügt, beispielsweise:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "48" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "fuzz.ratio(berlin[1], berlin[2])" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "100" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "fuzz.token_set_ratio(berlin[1], berlin[2])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Weitere Informationen" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "fuzz.ratio?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Extrahieren aus einer Liste" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "choices = [\n", " \"Germany\",\n", " \"Deutschland\",\n", " \"France\",\n", " \"United Kingdom\",\n", " \"Great Britain\",\n", " \"United States\",\n", "]" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[('Deutschland', 90), ('Germany', 45)]" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "process.extract(\"DE\", choices, limit=2)" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[('United Kingdom', 51),\n", " ('United States', 41),\n", " ('Germany', 39),\n", " ('Great Britain', 35),\n", " ('Deutschland', 31)]" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "process.extract(\"Vereinigtes Königreich\", choices)" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "('France', 62)" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "process.extractOne(\"frankreich\", choices)" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "('United States', 86)" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "process.extractOne(\"U.S.\", choices)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Bekannte Ports\n", "\n", "FuzzyWuzzy wird auch in andere Sprachen portiert! Hier einige bekannte Ports:\n", "\n", "* Java: [xpresso](https://github.com/WantedTechnologies/xpresso)\n", "* Java: [xdrop fuzzywuzzy](https://github.com/xdrop/fuzzywuzzy)\n", "* Rust: [fuzzyrusty](https://github.com/logannc/fuzzywuzzy-rs)\n", "* JavaScript: [fuzzball.js](https://github.com/nol13/fuzzball.js)\n", "* C++: [tmplt fuzzywuzzy](https://github.com/Tmplt/fuzzywuzzy)\n", "* C#: [FuzzySharp](https://github.com/BoomTownRoi/BoomTown.FuzzySharp)\n", "* Go: [go-fuzzywuzzy](https://github.com/paul-mannino/go-fuzzywuzzy)\n", "* Pascal: [FuzzyWuzzy.pas](https://github.com/DavidMoraisFerreira/FuzzyWuzzy.pas)\n", "* Kotlin: [FuzzyWuzzy-Kotlin](https://github.com/willowtreeapps/fuzzywuzzy-kotlin)\n", "* R: [fuzzywuzzyR](https://github.com/mlampros/fuzzywuzzyR)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3.11 Kernel", "language": "python", "name": "python311" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.4" }, "latex_envs": { "LaTeX_envs_menu_present": true, "autoclose": false, "autocomplete": true, "bibliofile": "biblio.bib", "cite_by": "apalike", "current_citInitial": 1, "eqLabelWithNumbers": true, "eqNumInitial": 1, "hotkeys": { "equation": "Ctrl-E", "itemize": "Ctrl-I" }, "labels_anchors": false, "latex_user_defs": false, "report_style_numbering": false, "user_envs_cfg": false }, "widgets": { "application/vnd.jupyter.widget-state+json": { "state": {}, "version_major": 2, "version_minor": 0 } } }, "nbformat": 4, "nbformat_minor": 4 }