{ "cells": [ { "cell_type": "markdown", "id": "traditional-right", "metadata": {}, "source": [ "```{figure} ../images/banner.png\n", "---\n", "align: center\n", "name: banner\n", "---\n", "```" ] }, { "cell_type": "markdown", "id": "contained-yellow", "metadata": {}, "source": [ "# Chapter 5 : String Columns" ] }, { "cell_type": "markdown", "id": "acoustic-factory", "metadata": {}, "source": [ "## Chapter Learning Objectives" ] }, { "cell_type": "markdown", "id": "improving-general", "metadata": {}, "source": [ "- Various data operations on columns containing string. " ] }, { "cell_type": "markdown", "id": "thorough-employment", "metadata": {}, "source": [ "## Chapter Outline\n", "\n", "- [1. Various data operations on columns containing string ](#1)\n", " - [1a. How to split a string?](#2)\n", " - [1b. How to slice a string?](#3)\n", " - [1d. How to convert lowercase to uppercase?](#4)\n", " - [1c. How to convert uppercase to lowercase?](#5)\n", " - [1e. How to extract a specific group matched by a Java regex?](#6)\n", " - [1f. How to replace a specific group matched by a Java regex?](#7)" ] }, { "cell_type": "markdown", "id": "antique-merit", "metadata": {}, "source": [ "" ] }, { "cell_type": "code", "execution_count": 77, "id": "increased-israel", "metadata": { "tags": [ "hide_input" ] }, "outputs": [], "source": [ "# import panel as pn\n", "# css = \"\"\"\n", "# div.special_table + table, th, td {\n", "# border: 3px solid orange;\n", "# }\n", "# \"\"\"\n", "# pn.extension(raw_css=[css])\n", "import pyspark\n", "from pyspark.sql import SparkSession\n", "spark = SparkSession \\\n", " .builder \\\n", " .appName(\"Python Spark SQL basic example\") \\\n", " .config(\"spark.some.config.option\", \"some-value\") \\\n", " .getOrCreate()\n", "from IPython.display import display_html\n", "import pandas as pd \n", "import numpy as np\n", "def display_side_by_side(*args):\n", " html_str=''\n", " for df in args:\n", " html_str+=df.to_html(index=False)\n", " html_str+= \"\\xa0\\xa0\\xa0\"*10\n", " display_html(html_str.replace('table','table style=\"display:inline\"'),raw=True)\n", "space = \"\\xa0\" * 10" ] }, { "cell_type": "markdown", "id": "polar-somalia", "metadata": {}, "source": [ "## Chapter Outline - Gallery" ] }, { "cell_type": "markdown", "id": "proof-birth", "metadata": {}, "source": [ "click on | any image\n", "---: |:--- \n", "[![alt](img/chapter5/1.png)](#2)| [![alt](img/chapter5/2.png)](#3)\n", "[![alt](img/chapter5/3.png)](#4) | [![alt](img/chapter5/4.png)](#5)\n", "[![alt](img/chapter5/5.png)](#6) | [![alt](img/chapter5/6.png)](#7)\n" ] }, { "cell_type": "markdown", "id": "medieval-henry", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "id": "prime-lying", "metadata": {}, "source": [ "## 1a. How to split a string?\n", "\n", "\n", "\n", "```{figure} img/chapter5/1.png\n", "---\n", "align: center\n", "---\n", "```" ] }, { "cell_type": "markdown", "id": "mighty-appraisal", "metadata": {}, "source": [ "Lets first understand the syntax\n", "\n", "```{admonition} Syntax\n", "pyspark.sql.functions.split(str, pattern, limit=-1)\n", "\n", "Splits str around matches of the given pattern.\n", "\n", "Parameters:\n", "\n", "- str : a string expression to split\n", "\n", "- pattern : a string representing a regular expression. The regex string should be a Java regular expression.\n", "\n", "- limit : an integer which controls the number of times pattern is applied.\n", "\n", "limit > 0: The resulting array’s length will not be more than limit, and the\n", "resulting array’s last entry will contain all input beyond the last matched pattern.\n", "\n", "limit <= 0: pattern will be applied as many times as possible, and the resulting\n", "array can be of any size.\n", "'''" ] }, { "cell_type": "markdown", "id": "authentic-guarantee", "metadata": {}, "source": [ "Input: Spark data frame with a column having a string" ] }, { "cell_type": "code", "execution_count": 78, "id": "essential-water", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "+-------------+\n", "| string|\n", "+-------------+\n", "|abc__def__ghc|\n", "+-------------+\n", "\n" ] } ], "source": [ "df_string = spark.createDataFrame([('abc__def__ghc',)], ['string',])\n", "df_string.show()" ] }, { "cell_type": "markdown", "id": "detailed-punch", "metadata": {}, "source": [ "Output : Spark data frame with a column with a split string" ] }, { "cell_type": "code", "execution_count": 79, "id": "brown-runner", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "+---------------+\n", "| split_string|\n", "+---------------+\n", "|[abc, def, ghc]|\n", "+---------------+\n", "\n" ] } ], "source": [ "from pyspark.sql.functions import split\n", "df_split = df_string.select(split(df_string.string,'__').alias('split_string'))\n", "df_split.show()" ] }, { "cell_type": "markdown", "id": "recognized-integer", "metadata": {}, "source": [ " Summary:" ] }, { "cell_type": "code", "execution_count": 80, "id": "active-nation", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Input Output\n" ] }, { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
string
abc__def__ghc
                              \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
split_string
[abc, def, ghc]
                              " ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "print(\"Input \", \"Output\")\n", "display_side_by_side(df_string.toPandas(),df_split.toPandas())" ] }, { "cell_type": "code", "execution_count": null, "id": "inclusive-trial", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "other-archive", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "id": "charming-hydrogen", "metadata": {}, "source": [ "## 1b. How to slice a string?\n", "\n", "\n", "```{figure} img/chapter5/4.png\n", "---\n", "align: center\n", "---\n", "```" ] }, { "cell_type": "markdown", "id": "assured-sapphire", "metadata": {}, "source": [ "Lets first understand the syntax\n", "\n", "```{admonition} Syntax\n", "\n", "pyspark.sql.functions.slice(x, start, length)\n", "\n", "Collection function: returns an array containing all the elements in x from index start (array indices start at 1, or from the end if start is negative) with the specified length.\n", "\n", "Parameters:\n", "\n", "- x : the array to be sliced\n", "\n", "- start : the starting index\n", "\n", "- length : the length of the slice\n", "'''" ] }, { "cell_type": "markdown", "id": "million-jaguar", "metadata": {}, "source": [ "Input: Spark data frame with a column having a string" ] }, { "cell_type": "code", "execution_count": 81, "id": "collect-oakland", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "+---------+\n", "| string|\n", "+---------+\n", "|abcdefghi|\n", "+---------+\n", "\n" ] } ], "source": [ "df_string = spark.createDataFrame([('abcdefghi',)], ['string',])\n", "df_string.show()" ] }, { "cell_type": "markdown", "id": "liquid-thomson", "metadata": {}, "source": [ "Output : Spark data frame with a column with a sliced string" ] }, { "cell_type": "code", "execution_count": 82, "id": "addressed-heritage", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "+---------+\n", "|substring|\n", "+---------+\n", "| abcd|\n", "+---------+\n", "\n" ] } ], "source": [ "from pyspark.sql.functions import substring\n", "df_sub = df_string.select(substring(df_string.string,1,4).alias('substring'))\n", "df_sub.show()" ] }, { "cell_type": "markdown", "id": "cloudy-setup", "metadata": {}, "source": [ " Summary:" ] }, { "cell_type": "code", "execution_count": 83, "id": "alone-bidding", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Input Output\n" ] }, { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
string
abcdefghi
                              \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
substring
abcd
                              " ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "print(\"Input \", \"Output\")\n", "display_side_by_side(df_string.toPandas(),df_sub.toPandas())" ] }, { "cell_type": "markdown", "id": "democratic-disclaimer", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "id": "chubby-source", "metadata": {}, "source": [ "## 1c. How to convert lowercase to uppercase?\n", "\n", "\n", "```{figure} img/chapter5/3.png\n", "---\n", "align: center\n", "---\n", "```" ] }, { "cell_type": "markdown", "id": "improving-passage", "metadata": {}, "source": [ "Lets first understand the syntax\n", "\n", "Converts a string expression to upper case.\n", "\n", "```{admonition} Syntax\n", "pyspark.sql.functions.upper(col)\n", "\n", "\n", "Converts a string expression to lower case.\n", "\n", "Parameters:\n", "\n", "- col : column\n", "'''" ] }, { "cell_type": "markdown", "id": "hungry-watson", "metadata": {}, "source": [ "Input: Spark data frame with a column having a lowercase string" ] }, { "cell_type": "code", "execution_count": 84, "id": "daily-portable", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "+---------+\n", "|uppercase|\n", "+---------+\n", "|ABCDEFGHI|\n", "+---------+\n", "\n" ] } ], "source": [ "df_upper = spark.createDataFrame([('ABCDEFGHI',)], ['uppercase',])\n", "df_upper.show()" ] }, { "cell_type": "markdown", "id": "planned-terry", "metadata": {}, "source": [ "Output : Spark data frame with a column having a uppercase string" ] }, { "cell_type": "code", "execution_count": 85, "id": "collected-auckland", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "+---------+\n", "|lowercase|\n", "+---------+\n", "|abcdefghi|\n", "+---------+\n", "\n" ] } ], "source": [ "from pyspark.sql.functions import lower\n", "df_lower= df_upper.select(lower(df_upper.uppercase).alias('lowercase'))\n", "df_lower.show()" ] }, { "cell_type": "markdown", "id": "large-rally", "metadata": {}, "source": [ " Summary:" ] }, { "cell_type": "code", "execution_count": 86, "id": "judicial-summer", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "input output\n" ] }, { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
uppercase
ABCDEFGHI
                              \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
lowercase
abcdefghi
                              " ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "print(\"input \", \"output\")\n", "display_side_by_side(df_upper.toPandas(),df_lower.toPandas())" ] }, { "cell_type": "code", "execution_count": null, "id": "established-strategy", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "environmental-complexity", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "id": "amateur-puppy", "metadata": {}, "source": [ "## 1d. How to convert uppercase to lowercase?\n", "\n", "\n", "```{figure} img/chapter5/2.png\n", "---\n", "align: center\n", "---\n", "```" ] }, { "cell_type": "markdown", "id": "boring-pennsylvania", "metadata": {}, "source": [ "Lets first understand the syntax\n", "\n", "```{admonition} Syntax\n", "pyspark.sql.functions.lower(col)\n", "\n", "Converts a string expression to lower case.\n", "\n", "Parameters:\n", "\n", "- col : column\n", "'''" ] }, { "cell_type": "markdown", "id": "convinced-government", "metadata": {}, "source": [ "Input: Spark data frame with a column having a string" ] }, { "cell_type": "code", "execution_count": 87, "id": "infrared-roads", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "+---------+\n", "|lowercase|\n", "+---------+\n", "|abcdefghc|\n", "+---------+\n", "\n" ] } ], "source": [ "df_string = spark.createDataFrame([('abcdefghc',)], ['lowercase',])\n", "df_string.show()" ] }, { "cell_type": "markdown", "id": "expanded-chair", "metadata": {}, "source": [ "Output : Spark data frame with a column with a split string" ] }, { "cell_type": "code", "execution_count": 88, "id": "hearing-elephant", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "+---------+\n", "|uppercase|\n", "+---------+\n", "|ABCDEFGHC|\n", "+---------+\n", "\n" ] } ], "source": [ "from pyspark.sql.functions import upper\n", "df_upper= df_string.select(upper(df_string.lowercase).alias('uppercase'))\n", "df_upper.show()" ] }, { "cell_type": "markdown", "id": "tested-pledge", "metadata": {}, "source": [ " Summary:" ] }, { "cell_type": "code", "execution_count": 89, "id": "local-latter", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Input Output\n" ] }, { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
lowercase
abcdefghc
                              \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
uppercase
ABCDEFGHC
                              " ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "print(\"Input \", \"Output\")\n", "display_side_by_side(df_string.toPandas(),df_upper.toPandas())" ] }, { "cell_type": "markdown", "id": "corporate-fountain", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "id": "existing-cleanup", "metadata": {}, "source": [ "## 1e. How to extract a specific group matched by a Java regex?\n", "\n", "\n", "\n", "```{figure} img/chapter5/5.png\n", "---\n", "align: center\n", "---\n", "```" ] }, { "cell_type": "markdown", "id": "failing-formation", "metadata": {}, "source": [ "Lets first understand the syntax\n", "\n", "```{admonition} Syntax\n", "pyspark.sql.functions.regexp_extract(str, pattern, idx)\n", "\n", "Extract a specific group matched by a Java regex, from the specified string column. If the regex did not match, or the specified group did not match, an empty string is returned.\n", "'''" ] }, { "cell_type": "markdown", "id": "joined-patrol", "metadata": {}, "source": [ "Input: Spark data frame with a column having a string" ] }, { "cell_type": "code", "execution_count": 90, "id": "wanted-friendly", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "+-------+\n", "| str|\n", "+-------+\n", "|100-200|\n", "+-------+\n", "\n" ] } ], "source": [ "df = spark.createDataFrame([('100-200',)], ['str'])\n", "df.show()" ] }, { "cell_type": "markdown", "id": "removable-daily", "metadata": {}, "source": [ "Output : Spark data frame with a column with a split string" ] }, { "cell_type": "code", "execution_count": 91, "id": "bibliographic-finland", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "+-----+\n", "|regex|\n", "+-----+\n", "| 100|\n", "+-----+\n", "\n" ] } ], "source": [ "from pyspark.sql.functions import regexp_extract\n", "df_regex1 = df.select(regexp_extract('str', r'(\\d+)-(\\d+)', 1).alias('regex'))\n", "df_regex1.show()" ] }, { "cell_type": "markdown", "id": "copyrighted-rapid", "metadata": {}, "source": [ " Summary:" ] }, { "cell_type": "code", "execution_count": 92, "id": "abandoned-punishment", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Input Output\n" ] }, { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
str
100-200
                              \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
regex
100
                              " ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "print(\"Input \", \"Output\")\n", "display_side_by_side(df.toPandas(),df_regex1.toPandas())" ] }, { "cell_type": "markdown", "id": "processed-network", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "id": "chinese-feedback", "metadata": {}, "source": [ "## 1f. How to replace a specific group matched by a Java regex?\n", "\n", "\n", "\n", "```{figure} img/chapter5/6.png\n", "---\n", "align: center\n", "---\n", "```" ] }, { "cell_type": "markdown", "id": "removed-resistance", "metadata": {}, "source": [ "Lets first understand the syntax\n", "\n", "```{admonition} Syntax\n", "pyspark.sql.functions.regexp_replace(str, pattern, replacement)\n", "\n", "Replace all substrings of the specified string value that match regexp with rep.\n", "'''" ] }, { "cell_type": "markdown", "id": "first-homeless", "metadata": {}, "source": [ "Input: Spark data frame with a column having a string" ] }, { "cell_type": "code", "execution_count": 93, "id": "collective-pregnancy", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "+-------+\n", "| string|\n", "+-------+\n", "|100-200|\n", "+-------+\n", "\n" ] } ], "source": [ "df = spark.createDataFrame([('100-200',)], ['string'])\n", "df.show()" ] }, { "cell_type": "markdown", "id": "advanced-default", "metadata": {}, "source": [ "Output : Spark data frame with a column with a regex" ] }, { "cell_type": "code", "execution_count": 94, "id": "vanilla-provider", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "+-------+\n", "|replace|\n", "+-------+\n", "| -----|\n", "+-------+\n", "\n" ] } ], "source": [ "from pyspark.sql.functions import regexp_replace\n", "df_regex2 = df.select(regexp_replace('string', r'(\\d+)', '--').alias('replace'))\n", "df_regex2.show()" ] }, { "cell_type": "markdown", "id": "sweet-heritage", "metadata": {}, "source": [ " Summary:" ] }, { "cell_type": "code", "execution_count": 95, "id": "polar-shower", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Input Output\n" ] }, { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
string
100-200
                              \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
replace
-----
                              " ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "print(\"Input \", \"Output\")\n", "display_side_by_side(df.toPandas(),df_regex2.toPandas())" ] } ], "metadata": { "celltoolbar": "Tags", "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.5" } }, "nbformat": 4, "nbformat_minor": 5 }