{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "traditional-right",
   "metadata": {},
   "source": [
    "```{figure} ../images/banner.png\n",
    "---\n",
    "align: center\n",
    "name: banner\n",
    "---\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "contained-yellow",
   "metadata": {},
   "source": [
    "# Chapter 5 : String Columns"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "acoustic-factory",
   "metadata": {},
   "source": [
    "## Chapter Learning Objectives"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "improving-general",
   "metadata": {},
   "source": [
    "- Various data operations on columns containing string. "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "thorough-employment",
   "metadata": {},
   "source": [
    "## Chapter Outline\n",
    "\n",
    "- [1. Various data operations on columns containing string ](#1)\n",
    "    - [1a. How to split a string?](#2)\n",
    "    - [1b. How to slice a string?](#3)\n",
    "    - [1d. How to convert lowercase to uppercase?](#4)\n",
    "    - [1c. How to convert uppercase to lowercase?](#5)\n",
    "    - [1e. How to extract a specific group matched by a Java regex?](#6)\n",
    "    - [1f. How to replace a specific group matched by a Java regex?](#7)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "antique-merit",
   "metadata": {},
   "source": [
    "<style>body {text-align: justify}</style>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 77,
   "id": "increased-israel",
   "metadata": {
    "tags": [
     "hide_input"
    ]
   },
   "outputs": [],
   "source": [
    "# import panel as pn\n",
    "# css = \"\"\"\n",
    "# div.special_table + table, th, td {\n",
    "#   border: 3px solid orange;\n",
    "# }\n",
    "# \"\"\"\n",
    "# pn.extension(raw_css=[css])\n",
    "import pyspark\n",
    "from pyspark.sql import SparkSession\n",
    "spark = SparkSession \\\n",
    "    .builder \\\n",
    "    .appName(\"Python Spark SQL basic example\") \\\n",
    "    .config(\"spark.some.config.option\", \"some-value\") \\\n",
    "    .getOrCreate()\n",
    "from IPython.display import display_html\n",
    "import pandas as pd \n",
    "import numpy as np\n",
    "def display_side_by_side(*args):\n",
    "    html_str=''\n",
    "    for df in args:\n",
    "        html_str+=df.to_html(index=False)\n",
    "        html_str+= \"\\xa0\\xa0\\xa0\"*10\n",
    "    display_html(html_str.replace('table','table style=\"display:inline\"'),raw=True)\n",
    "space = \"\\xa0\" * 10"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "polar-somalia",
   "metadata": {},
   "source": [
    "## Chapter Outline - Gallery"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "proof-birth",
   "metadata": {},
   "source": [
    "click on  | any image\n",
    "---: |:--- \n",
    "[![alt](img/chapter5/1.png)](#2)| [![alt](img/chapter5/2.png)](#3)\n",
    "[![alt](img/chapter5/3.png)](#4) | [![alt](img/chapter5/4.png)](#5)\n",
    "[![alt](img/chapter5/5.png)](#6) | [![alt](img/chapter5/6.png)](#7)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "medieval-henry",
   "metadata": {},
   "source": [
    "<a id='2'></a>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "prime-lying",
   "metadata": {},
   "source": [
    "## 1a. How to split a string?\n",
    "\n",
    "\n",
    "\n",
    "```{figure} img/chapter5/1.png\n",
    "---\n",
    "align: center\n",
    "---\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "mighty-appraisal",
   "metadata": {},
   "source": [
    "Lets first understand the syntax\n",
    "\n",
    "```{admonition} Syntax\n",
    "<b>pyspark.sql.functions.split(str, pattern, limit=-1)</b>\n",
    "\n",
    "Splits str around matches of the given pattern.\n",
    "\n",
    "<b>Parameters</b>:\n",
    "\n",
    "- str : a string expression to split\n",
    "\n",
    "- pattern : a string representing a regular expression. The regex string should be a Java regular expression.\n",
    "\n",
    "- limit : an integer which controls the number of times pattern is applied.\n",
    "\n",
    "limit > 0: The resulting array’s length will not be more than limit, and the\n",
    "resulting array’s last entry will contain all input beyond the last matched pattern.\n",
    "\n",
    "limit <= 0: pattern will be applied as many times as possible, and the resulting\n",
    "array can be of any size.\n",
    "'''"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "authentic-guarantee",
   "metadata": {},
   "source": [
    "<b>Input:  Spark data frame with a column having a string</b>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 78,
   "id": "essential-water",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+-------------+\n",
      "|       string|\n",
      "+-------------+\n",
      "|abc__def__ghc|\n",
      "+-------------+\n",
      "\n"
     ]
    }
   ],
   "source": [
    "df_string = spark.createDataFrame([('abc__def__ghc',)], ['string',])\n",
    "df_string.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "detailed-punch",
   "metadata": {},
   "source": [
    "<b>Output :  Spark data frame with a column with a split string</b>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 79,
   "id": "brown-runner",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+---------------+\n",
      "|   split_string|\n",
      "+---------------+\n",
      "|[abc, def, ghc]|\n",
      "+---------------+\n",
      "\n"
     ]
    }
   ],
   "source": [
    "from pyspark.sql.functions import split\n",
    "df_split = df_string.select(split(df_string.string,'__').alias('split_string'))\n",
    "df_split.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "recognized-integer",
   "metadata": {},
   "source": [
    "<b> Summary:</b>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 80,
   "id": "active-nation",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Input                      Output\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<table style=\"display:inline\" border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th>string</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <td>abc__def__ghc</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table style=\"display:inline\">                              <table style=\"display:inline\" border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th>split_string</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <td>[abc, def, ghc]</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table style=\"display:inline\">                              "
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "print(\"Input                     \",            \"Output\")\n",
    "display_side_by_side(df_string.toPandas(),df_split.toPandas())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "inclusive-trial",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "id": "other-archive",
   "metadata": {},
   "source": [
    "<a id='3'></a>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "charming-hydrogen",
   "metadata": {},
   "source": [
    "## 1b. How to slice a string?\n",
    "\n",
    "\n",
    "```{figure} img/chapter5/4.png\n",
    "---\n",
    "align: center\n",
    "---\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "assured-sapphire",
   "metadata": {},
   "source": [
    "Lets first understand the syntax\n",
    "\n",
    "```{admonition} Syntax\n",
    "\n",
    "<b>pyspark.sql.functions.slice(x, start, length)</b>\n",
    "\n",
    "Collection function: returns an array containing all the elements in x from index start (array indices start at 1, or from the end if start is negative) with the specified length.\n",
    "\n",
    "<b>Parameters</b>:\n",
    "\n",
    "- x : the array to be sliced\n",
    "\n",
    "- start : the starting index\n",
    "\n",
    "- length : the length of the slice\n",
    "'''"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "million-jaguar",
   "metadata": {},
   "source": [
    "<b>Input:  Spark data frame with a column having a string</b>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 81,
   "id": "collect-oakland",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+---------+\n",
      "|   string|\n",
      "+---------+\n",
      "|abcdefghi|\n",
      "+---------+\n",
      "\n"
     ]
    }
   ],
   "source": [
    "df_string = spark.createDataFrame([('abcdefghi',)], ['string',])\n",
    "df_string.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "liquid-thomson",
   "metadata": {},
   "source": [
    "<b>Output :  Spark data frame with a column with a sliced string</b>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 82,
   "id": "addressed-heritage",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+---------+\n",
      "|substring|\n",
      "+---------+\n",
      "|     abcd|\n",
      "+---------+\n",
      "\n"
     ]
    }
   ],
   "source": [
    "from pyspark.sql.functions import substring\n",
    "df_sub = df_string.select(substring(df_string.string,1,4).alias('substring'))\n",
    "df_sub.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cloudy-setup",
   "metadata": {},
   "source": [
    "<b> Summary:</b>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 83,
   "id": "alone-bidding",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Input                      Output\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<table style=\"display:inline\" border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th>string</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <td>abcdefghi</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table style=\"display:inline\">                              <table style=\"display:inline\" border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th>substring</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <td>abcd</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table style=\"display:inline\">                              "
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "print(\"Input                     \",            \"Output\")\n",
    "display_side_by_side(df_string.toPandas(),df_sub.toPandas())"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "democratic-disclaimer",
   "metadata": {},
   "source": [
    "<a id='4'></a>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "chubby-source",
   "metadata": {},
   "source": [
    "## 1c. How to convert lowercase to uppercase?\n",
    "\n",
    "\n",
    "```{figure} img/chapter5/3.png\n",
    "---\n",
    "align: center\n",
    "---\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "improving-passage",
   "metadata": {},
   "source": [
    "Lets first understand the syntax\n",
    "\n",
    "Converts a string expression to upper case.\n",
    "\n",
    "```{admonition} Syntax\n",
    "<b>pyspark.sql.functions.upper(col)</b>\n",
    "\n",
    "\n",
    "Converts a string expression to lower case.\n",
    "\n",
    "<b>Parameters</b>:\n",
    "\n",
    "- col : column\n",
    "'''"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "hungry-watson",
   "metadata": {},
   "source": [
    "<b>Input:  Spark data frame with a column having a lowercase string</b>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 84,
   "id": "daily-portable",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+---------+\n",
      "|uppercase|\n",
      "+---------+\n",
      "|ABCDEFGHI|\n",
      "+---------+\n",
      "\n"
     ]
    }
   ],
   "source": [
    "df_upper = spark.createDataFrame([('ABCDEFGHI',)], ['uppercase',])\n",
    "df_upper.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "planned-terry",
   "metadata": {},
   "source": [
    "<b>Output :  Spark data frame with a column having a uppercase string</b>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 85,
   "id": "collected-auckland",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+---------+\n",
      "|lowercase|\n",
      "+---------+\n",
      "|abcdefghi|\n",
      "+---------+\n",
      "\n"
     ]
    }
   ],
   "source": [
    "from pyspark.sql.functions import lower\n",
    "df_lower= df_upper.select(lower(df_upper.uppercase).alias('lowercase'))\n",
    "df_lower.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "large-rally",
   "metadata": {},
   "source": [
    "<b> Summary:</b>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 86,
   "id": "judicial-summer",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "input                      output\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<table style=\"display:inline\" border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th>uppercase</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <td>ABCDEFGHI</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table style=\"display:inline\">                              <table style=\"display:inline\" border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th>lowercase</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <td>abcdefghi</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table style=\"display:inline\">                              "
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "print(\"input                     \",            \"output\")\n",
    "display_side_by_side(df_upper.toPandas(),df_lower.toPandas())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "established-strategy",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "id": "environmental-complexity",
   "metadata": {},
   "source": [
    "<a id='5'></a>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "amateur-puppy",
   "metadata": {},
   "source": [
    "## 1d. How to convert uppercase to lowercase?\n",
    "\n",
    "\n",
    "```{figure} img/chapter5/2.png\n",
    "---\n",
    "align: center\n",
    "---\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "boring-pennsylvania",
   "metadata": {},
   "source": [
    "Lets first understand the syntax\n",
    "\n",
    "```{admonition} Syntax\n",
    "<b>pyspark.sql.functions.lower(col)</b>\n",
    "\n",
    "Converts a string expression to lower case.\n",
    "\n",
    "<b>Parameters</b>:\n",
    "\n",
    "- col : column\n",
    "'''"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "convinced-government",
   "metadata": {},
   "source": [
    "<b>Input:  Spark data frame with a column having a string</b>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 87,
   "id": "infrared-roads",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+---------+\n",
      "|lowercase|\n",
      "+---------+\n",
      "|abcdefghc|\n",
      "+---------+\n",
      "\n"
     ]
    }
   ],
   "source": [
    "df_string = spark.createDataFrame([('abcdefghc',)], ['lowercase',])\n",
    "df_string.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "expanded-chair",
   "metadata": {},
   "source": [
    "<b>Output :  Spark data frame with a column with a split string</b>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 88,
   "id": "hearing-elephant",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+---------+\n",
      "|uppercase|\n",
      "+---------+\n",
      "|ABCDEFGHC|\n",
      "+---------+\n",
      "\n"
     ]
    }
   ],
   "source": [
    "from pyspark.sql.functions import upper\n",
    "df_upper= df_string.select(upper(df_string.lowercase).alias('uppercase'))\n",
    "df_upper.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "tested-pledge",
   "metadata": {},
   "source": [
    "<b> Summary:</b>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 89,
   "id": "local-latter",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Input                      Output\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<table style=\"display:inline\" border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th>lowercase</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <td>abcdefghc</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table style=\"display:inline\">                              <table style=\"display:inline\" border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th>uppercase</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <td>ABCDEFGHC</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table style=\"display:inline\">                              "
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "print(\"Input                     \",            \"Output\")\n",
    "display_side_by_side(df_string.toPandas(),df_upper.toPandas())"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "corporate-fountain",
   "metadata": {},
   "source": [
    "<a id='6'></a>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "existing-cleanup",
   "metadata": {},
   "source": [
    "## 1e. How to extract a specific group matched by a Java regex?\n",
    "\n",
    "\n",
    "\n",
    "```{figure} img/chapter5/5.png\n",
    "---\n",
    "align: center\n",
    "---\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "failing-formation",
   "metadata": {},
   "source": [
    "Lets first understand the syntax\n",
    "\n",
    "```{admonition} Syntax\n",
    "<b>pyspark.sql.functions.regexp_extract(str, pattern, idx)</b>\n",
    "\n",
    "Extract a specific group matched by a Java regex, from the specified string column. If the regex did not match, or the specified group did not match, an empty string is returned.\n",
    "'''"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "joined-patrol",
   "metadata": {},
   "source": [
    "<b>Input:  Spark data frame with a column having a string</b>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 90,
   "id": "wanted-friendly",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+-------+\n",
      "|    str|\n",
      "+-------+\n",
      "|100-200|\n",
      "+-------+\n",
      "\n"
     ]
    }
   ],
   "source": [
    "df = spark.createDataFrame([('100-200',)], ['str'])\n",
    "df.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "removable-daily",
   "metadata": {},
   "source": [
    "<b>Output :  Spark data frame with a column with a split string</b>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 91,
   "id": "bibliographic-finland",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+-----+\n",
      "|regex|\n",
      "+-----+\n",
      "|  100|\n",
      "+-----+\n",
      "\n"
     ]
    }
   ],
   "source": [
    "from pyspark.sql.functions import regexp_extract\n",
    "df_regex1 = df.select(regexp_extract('str', r'(\\d+)-(\\d+)', 1).alias('regex'))\n",
    "df_regex1.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "copyrighted-rapid",
   "metadata": {},
   "source": [
    "<b> Summary:</b>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 92,
   "id": "abandoned-punishment",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Input                      Output\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<table style=\"display:inline\" border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th>str</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <td>100-200</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table style=\"display:inline\">                              <table style=\"display:inline\" border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th>regex</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <td>100</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table style=\"display:inline\">                              "
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "print(\"Input                     \",            \"Output\")\n",
    "display_side_by_side(df.toPandas(),df_regex1.toPandas())"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "processed-network",
   "metadata": {},
   "source": [
    "<a id='7'></a>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "chinese-feedback",
   "metadata": {},
   "source": [
    "## 1f. How to replace a specific group matched by a Java regex?\n",
    "\n",
    "\n",
    "\n",
    "```{figure} img/chapter5/6.png\n",
    "---\n",
    "align: center\n",
    "---\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "removed-resistance",
   "metadata": {},
   "source": [
    "Lets first understand the syntax\n",
    "\n",
    "```{admonition} Syntax\n",
    "<b>pyspark.sql.functions.regexp_replace(str, pattern, replacement)</b>\n",
    "\n",
    "Replace all substrings of the specified string value that match regexp with rep.\n",
    "'''"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "first-homeless",
   "metadata": {},
   "source": [
    "<b>Input:  Spark data frame with a column having a string</b>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 93,
   "id": "collective-pregnancy",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+-------+\n",
      "| string|\n",
      "+-------+\n",
      "|100-200|\n",
      "+-------+\n",
      "\n"
     ]
    }
   ],
   "source": [
    "df = spark.createDataFrame([('100-200',)], ['string'])\n",
    "df.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "advanced-default",
   "metadata": {},
   "source": [
    "<b>Output :  Spark data frame with a column with a regex</b>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 94,
   "id": "vanilla-provider",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+-------+\n",
      "|replace|\n",
      "+-------+\n",
      "|  -----|\n",
      "+-------+\n",
      "\n"
     ]
    }
   ],
   "source": [
    "from pyspark.sql.functions import regexp_replace\n",
    "df_regex2 = df.select(regexp_replace('string', r'(\\d+)', '--').alias('replace'))\n",
    "df_regex2.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "sweet-heritage",
   "metadata": {},
   "source": [
    "<b> Summary:</b>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 95,
   "id": "polar-shower",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Input                      Output\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<table style=\"display:inline\" border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th>string</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <td>100-200</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table style=\"display:inline\">                              <table style=\"display:inline\" border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th>replace</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <td>-----</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table style=\"display:inline\">                              "
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "print(\"Input                     \",            \"Output\")\n",
    "display_side_by_side(df.toPandas(),df_regex2.toPandas())"
   ]
  }
 ],
 "metadata": {
  "celltoolbar": "Tags",
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}