{
"cells": [
{
"cell_type": "markdown",
"id": "traditional-right",
"metadata": {},
"source": [
"```{figure} ../images/banner.png\n",
"---\n",
"align: center\n",
"name: banner\n",
"---\n",
"```"
]
},
{
"cell_type": "markdown",
"id": "contained-yellow",
"metadata": {},
"source": [
"# Chapter 5 : String Columns"
]
},
{
"cell_type": "markdown",
"id": "acoustic-factory",
"metadata": {},
"source": [
"## Chapter Learning Objectives"
]
},
{
"cell_type": "markdown",
"id": "improving-general",
"metadata": {},
"source": [
"- Various data operations on columns containing string. "
]
},
{
"cell_type": "markdown",
"id": "thorough-employment",
"metadata": {},
"source": [
"## Chapter Outline\n",
"\n",
"- [1. Various data operations on columns containing string ](#1)\n",
" - [1a. How to split a string?](#2)\n",
" - [1b. How to slice a string?](#3)\n",
" - [1d. How to convert lowercase to uppercase?](#4)\n",
" - [1c. How to convert uppercase to lowercase?](#5)\n",
" - [1e. How to extract a specific group matched by a Java regex?](#6)\n",
" - [1f. How to replace a specific group matched by a Java regex?](#7)"
]
},
{
"cell_type": "markdown",
"id": "antique-merit",
"metadata": {},
"source": [
""
]
},
{
"cell_type": "code",
"execution_count": 77,
"id": "increased-israel",
"metadata": {
"tags": [
"hide_input"
]
},
"outputs": [],
"source": [
"# import panel as pn\n",
"# css = \"\"\"\n",
"# div.special_table + table, th, td {\n",
"# border: 3px solid orange;\n",
"# }\n",
"# \"\"\"\n",
"# pn.extension(raw_css=[css])\n",
"import pyspark\n",
"from pyspark.sql import SparkSession\n",
"spark = SparkSession \\\n",
" .builder \\\n",
" .appName(\"Python Spark SQL basic example\") \\\n",
" .config(\"spark.some.config.option\", \"some-value\") \\\n",
" .getOrCreate()\n",
"from IPython.display import display_html\n",
"import pandas as pd \n",
"import numpy as np\n",
"def display_side_by_side(*args):\n",
" html_str=''\n",
" for df in args:\n",
" html_str+=df.to_html(index=False)\n",
" html_str+= \"\\xa0\\xa0\\xa0\"*10\n",
" display_html(html_str.replace('table','table style=\"display:inline\"'),raw=True)\n",
"space = \"\\xa0\" * 10"
]
},
{
"cell_type": "markdown",
"id": "polar-somalia",
"metadata": {},
"source": [
"## Chapter Outline - Gallery"
]
},
{
"cell_type": "markdown",
"id": "proof-birth",
"metadata": {},
"source": [
"click on | any image\n",
"---: |:--- \n",
"[](#2)| [](#3)\n",
"[](#4) | [](#5)\n",
"[](#6) | [](#7)\n"
]
},
{
"cell_type": "markdown",
"id": "medieval-henry",
"metadata": {},
"source": [
""
]
},
{
"cell_type": "markdown",
"id": "prime-lying",
"metadata": {},
"source": [
"## 1a. How to split a string?\n",
"\n",
"\n",
"\n",
"```{figure} img/chapter5/1.png\n",
"---\n",
"align: center\n",
"---\n",
"```"
]
},
{
"cell_type": "markdown",
"id": "mighty-appraisal",
"metadata": {},
"source": [
"Lets first understand the syntax\n",
"\n",
"```{admonition} Syntax\n",
"pyspark.sql.functions.split(str, pattern, limit=-1)\n",
"\n",
"Splits str around matches of the given pattern.\n",
"\n",
"Parameters:\n",
"\n",
"- str : a string expression to split\n",
"\n",
"- pattern : a string representing a regular expression. The regex string should be a Java regular expression.\n",
"\n",
"- limit : an integer which controls the number of times pattern is applied.\n",
"\n",
"limit > 0: The resulting array’s length will not be more than limit, and the\n",
"resulting array’s last entry will contain all input beyond the last matched pattern.\n",
"\n",
"limit <= 0: pattern will be applied as many times as possible, and the resulting\n",
"array can be of any size.\n",
"'''"
]
},
{
"cell_type": "markdown",
"id": "authentic-guarantee",
"metadata": {},
"source": [
"Input: Spark data frame with a column having a string"
]
},
{
"cell_type": "code",
"execution_count": 78,
"id": "essential-water",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"+-------------+\n",
"| string|\n",
"+-------------+\n",
"|abc__def__ghc|\n",
"+-------------+\n",
"\n"
]
}
],
"source": [
"df_string = spark.createDataFrame([('abc__def__ghc',)], ['string',])\n",
"df_string.show()"
]
},
{
"cell_type": "markdown",
"id": "detailed-punch",
"metadata": {},
"source": [
"Output : Spark data frame with a column with a split string"
]
},
{
"cell_type": "code",
"execution_count": 79,
"id": "brown-runner",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"+---------------+\n",
"| split_string|\n",
"+---------------+\n",
"|[abc, def, ghc]|\n",
"+---------------+\n",
"\n"
]
}
],
"source": [
"from pyspark.sql.functions import split\n",
"df_split = df_string.select(split(df_string.string,'__').alias('split_string'))\n",
"df_split.show()"
]
},
{
"cell_type": "markdown",
"id": "recognized-integer",
"metadata": {},
"source": [
" Summary:"
]
},
{
"cell_type": "code",
"execution_count": 80,
"id": "active-nation",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Input Output\n"
]
},
{
"data": {
"text/html": [
"
\n",
" \n",
" \n",
" string | \n",
"
\n",
" \n",
" \n",
" \n",
" abc__def__ghc | \n",
"
\n",
" \n",
"
\n",
" \n",
" \n",
" split_string | \n",
"
\n",
" \n",
" \n",
" \n",
" [abc, def, ghc] | \n",
"
\n",
" \n",
"
"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"print(\"Input \", \"Output\")\n",
"display_side_by_side(df_string.toPandas(),df_split.toPandas())"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "inclusive-trial",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"id": "other-archive",
"metadata": {},
"source": [
""
]
},
{
"cell_type": "markdown",
"id": "charming-hydrogen",
"metadata": {},
"source": [
"## 1b. How to slice a string?\n",
"\n",
"\n",
"```{figure} img/chapter5/4.png\n",
"---\n",
"align: center\n",
"---\n",
"```"
]
},
{
"cell_type": "markdown",
"id": "assured-sapphire",
"metadata": {},
"source": [
"Lets first understand the syntax\n",
"\n",
"```{admonition} Syntax\n",
"\n",
"pyspark.sql.functions.slice(x, start, length)\n",
"\n",
"Collection function: returns an array containing all the elements in x from index start (array indices start at 1, or from the end if start is negative) with the specified length.\n",
"\n",
"Parameters:\n",
"\n",
"- x : the array to be sliced\n",
"\n",
"- start : the starting index\n",
"\n",
"- length : the length of the slice\n",
"'''"
]
},
{
"cell_type": "markdown",
"id": "million-jaguar",
"metadata": {},
"source": [
"Input: Spark data frame with a column having a string"
]
},
{
"cell_type": "code",
"execution_count": 81,
"id": "collect-oakland",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"+---------+\n",
"| string|\n",
"+---------+\n",
"|abcdefghi|\n",
"+---------+\n",
"\n"
]
}
],
"source": [
"df_string = spark.createDataFrame([('abcdefghi',)], ['string',])\n",
"df_string.show()"
]
},
{
"cell_type": "markdown",
"id": "liquid-thomson",
"metadata": {},
"source": [
"Output : Spark data frame with a column with a sliced string"
]
},
{
"cell_type": "code",
"execution_count": 82,
"id": "addressed-heritage",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"+---------+\n",
"|substring|\n",
"+---------+\n",
"| abcd|\n",
"+---------+\n",
"\n"
]
}
],
"source": [
"from pyspark.sql.functions import substring\n",
"df_sub = df_string.select(substring(df_string.string,1,4).alias('substring'))\n",
"df_sub.show()"
]
},
{
"cell_type": "markdown",
"id": "cloudy-setup",
"metadata": {},
"source": [
" Summary:"
]
},
{
"cell_type": "code",
"execution_count": 83,
"id": "alone-bidding",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Input Output\n"
]
},
{
"data": {
"text/html": [
"\n",
" \n",
" \n",
" string | \n",
"
\n",
" \n",
" \n",
" \n",
" abcdefghi | \n",
"
\n",
" \n",
"
\n",
" \n",
" \n",
" substring | \n",
"
\n",
" \n",
" \n",
" \n",
" abcd | \n",
"
\n",
" \n",
"
"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"print(\"Input \", \"Output\")\n",
"display_side_by_side(df_string.toPandas(),df_sub.toPandas())"
]
},
{
"cell_type": "markdown",
"id": "democratic-disclaimer",
"metadata": {},
"source": [
""
]
},
{
"cell_type": "markdown",
"id": "chubby-source",
"metadata": {},
"source": [
"## 1c. How to convert lowercase to uppercase?\n",
"\n",
"\n",
"```{figure} img/chapter5/3.png\n",
"---\n",
"align: center\n",
"---\n",
"```"
]
},
{
"cell_type": "markdown",
"id": "improving-passage",
"metadata": {},
"source": [
"Lets first understand the syntax\n",
"\n",
"Converts a string expression to upper case.\n",
"\n",
"```{admonition} Syntax\n",
"pyspark.sql.functions.upper(col)\n",
"\n",
"\n",
"Converts a string expression to lower case.\n",
"\n",
"Parameters:\n",
"\n",
"- col : column\n",
"'''"
]
},
{
"cell_type": "markdown",
"id": "hungry-watson",
"metadata": {},
"source": [
"Input: Spark data frame with a column having a lowercase string"
]
},
{
"cell_type": "code",
"execution_count": 84,
"id": "daily-portable",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"+---------+\n",
"|uppercase|\n",
"+---------+\n",
"|ABCDEFGHI|\n",
"+---------+\n",
"\n"
]
}
],
"source": [
"df_upper = spark.createDataFrame([('ABCDEFGHI',)], ['uppercase',])\n",
"df_upper.show()"
]
},
{
"cell_type": "markdown",
"id": "planned-terry",
"metadata": {},
"source": [
"Output : Spark data frame with a column having a uppercase string"
]
},
{
"cell_type": "code",
"execution_count": 85,
"id": "collected-auckland",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"+---------+\n",
"|lowercase|\n",
"+---------+\n",
"|abcdefghi|\n",
"+---------+\n",
"\n"
]
}
],
"source": [
"from pyspark.sql.functions import lower\n",
"df_lower= df_upper.select(lower(df_upper.uppercase).alias('lowercase'))\n",
"df_lower.show()"
]
},
{
"cell_type": "markdown",
"id": "large-rally",
"metadata": {},
"source": [
" Summary:"
]
},
{
"cell_type": "code",
"execution_count": 86,
"id": "judicial-summer",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"input output\n"
]
},
{
"data": {
"text/html": [
"\n",
" \n",
" \n",
" uppercase | \n",
"
\n",
" \n",
" \n",
" \n",
" ABCDEFGHI | \n",
"
\n",
" \n",
"
\n",
" \n",
" \n",
" lowercase | \n",
"
\n",
" \n",
" \n",
" \n",
" abcdefghi | \n",
"
\n",
" \n",
"
"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"print(\"input \", \"output\")\n",
"display_side_by_side(df_upper.toPandas(),df_lower.toPandas())"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "established-strategy",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"id": "environmental-complexity",
"metadata": {},
"source": [
""
]
},
{
"cell_type": "markdown",
"id": "amateur-puppy",
"metadata": {},
"source": [
"## 1d. How to convert uppercase to lowercase?\n",
"\n",
"\n",
"```{figure} img/chapter5/2.png\n",
"---\n",
"align: center\n",
"---\n",
"```"
]
},
{
"cell_type": "markdown",
"id": "boring-pennsylvania",
"metadata": {},
"source": [
"Lets first understand the syntax\n",
"\n",
"```{admonition} Syntax\n",
"pyspark.sql.functions.lower(col)\n",
"\n",
"Converts a string expression to lower case.\n",
"\n",
"Parameters:\n",
"\n",
"- col : column\n",
"'''"
]
},
{
"cell_type": "markdown",
"id": "convinced-government",
"metadata": {},
"source": [
"Input: Spark data frame with a column having a string"
]
},
{
"cell_type": "code",
"execution_count": 87,
"id": "infrared-roads",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"+---------+\n",
"|lowercase|\n",
"+---------+\n",
"|abcdefghc|\n",
"+---------+\n",
"\n"
]
}
],
"source": [
"df_string = spark.createDataFrame([('abcdefghc',)], ['lowercase',])\n",
"df_string.show()"
]
},
{
"cell_type": "markdown",
"id": "expanded-chair",
"metadata": {},
"source": [
"Output : Spark data frame with a column with a split string"
]
},
{
"cell_type": "code",
"execution_count": 88,
"id": "hearing-elephant",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"+---------+\n",
"|uppercase|\n",
"+---------+\n",
"|ABCDEFGHC|\n",
"+---------+\n",
"\n"
]
}
],
"source": [
"from pyspark.sql.functions import upper\n",
"df_upper= df_string.select(upper(df_string.lowercase).alias('uppercase'))\n",
"df_upper.show()"
]
},
{
"cell_type": "markdown",
"id": "tested-pledge",
"metadata": {},
"source": [
" Summary:"
]
},
{
"cell_type": "code",
"execution_count": 89,
"id": "local-latter",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Input Output\n"
]
},
{
"data": {
"text/html": [
"\n",
" \n",
" \n",
" lowercase | \n",
"
\n",
" \n",
" \n",
" \n",
" abcdefghc | \n",
"
\n",
" \n",
"
\n",
" \n",
" \n",
" uppercase | \n",
"
\n",
" \n",
" \n",
" \n",
" ABCDEFGHC | \n",
"
\n",
" \n",
"
"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"print(\"Input \", \"Output\")\n",
"display_side_by_side(df_string.toPandas(),df_upper.toPandas())"
]
},
{
"cell_type": "markdown",
"id": "corporate-fountain",
"metadata": {},
"source": [
""
]
},
{
"cell_type": "markdown",
"id": "existing-cleanup",
"metadata": {},
"source": [
"## 1e. How to extract a specific group matched by a Java regex?\n",
"\n",
"\n",
"\n",
"```{figure} img/chapter5/5.png\n",
"---\n",
"align: center\n",
"---\n",
"```"
]
},
{
"cell_type": "markdown",
"id": "failing-formation",
"metadata": {},
"source": [
"Lets first understand the syntax\n",
"\n",
"```{admonition} Syntax\n",
"pyspark.sql.functions.regexp_extract(str, pattern, idx)\n",
"\n",
"Extract a specific group matched by a Java regex, from the specified string column. If the regex did not match, or the specified group did not match, an empty string is returned.\n",
"'''"
]
},
{
"cell_type": "markdown",
"id": "joined-patrol",
"metadata": {},
"source": [
"Input: Spark data frame with a column having a string"
]
},
{
"cell_type": "code",
"execution_count": 90,
"id": "wanted-friendly",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"+-------+\n",
"| str|\n",
"+-------+\n",
"|100-200|\n",
"+-------+\n",
"\n"
]
}
],
"source": [
"df = spark.createDataFrame([('100-200',)], ['str'])\n",
"df.show()"
]
},
{
"cell_type": "markdown",
"id": "removable-daily",
"metadata": {},
"source": [
"Output : Spark data frame with a column with a split string"
]
},
{
"cell_type": "code",
"execution_count": 91,
"id": "bibliographic-finland",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"+-----+\n",
"|regex|\n",
"+-----+\n",
"| 100|\n",
"+-----+\n",
"\n"
]
}
],
"source": [
"from pyspark.sql.functions import regexp_extract\n",
"df_regex1 = df.select(regexp_extract('str', r'(\\d+)-(\\d+)', 1).alias('regex'))\n",
"df_regex1.show()"
]
},
{
"cell_type": "markdown",
"id": "copyrighted-rapid",
"metadata": {},
"source": [
" Summary:"
]
},
{
"cell_type": "code",
"execution_count": 92,
"id": "abandoned-punishment",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Input Output\n"
]
},
{
"data": {
"text/html": [
"\n",
" \n",
" \n",
" str | \n",
"
\n",
" \n",
" \n",
" \n",
" 100-200 | \n",
"
\n",
" \n",
"
\n",
" \n",
" \n",
" regex | \n",
"
\n",
" \n",
" \n",
" \n",
" 100 | \n",
"
\n",
" \n",
"
"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"print(\"Input \", \"Output\")\n",
"display_side_by_side(df.toPandas(),df_regex1.toPandas())"
]
},
{
"cell_type": "markdown",
"id": "processed-network",
"metadata": {},
"source": [
""
]
},
{
"cell_type": "markdown",
"id": "chinese-feedback",
"metadata": {},
"source": [
"## 1f. How to replace a specific group matched by a Java regex?\n",
"\n",
"\n",
"\n",
"```{figure} img/chapter5/6.png\n",
"---\n",
"align: center\n",
"---\n",
"```"
]
},
{
"cell_type": "markdown",
"id": "removed-resistance",
"metadata": {},
"source": [
"Lets first understand the syntax\n",
"\n",
"```{admonition} Syntax\n",
"pyspark.sql.functions.regexp_replace(str, pattern, replacement)\n",
"\n",
"Replace all substrings of the specified string value that match regexp with rep.\n",
"'''"
]
},
{
"cell_type": "markdown",
"id": "first-homeless",
"metadata": {},
"source": [
"Input: Spark data frame with a column having a string"
]
},
{
"cell_type": "code",
"execution_count": 93,
"id": "collective-pregnancy",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"+-------+\n",
"| string|\n",
"+-------+\n",
"|100-200|\n",
"+-------+\n",
"\n"
]
}
],
"source": [
"df = spark.createDataFrame([('100-200',)], ['string'])\n",
"df.show()"
]
},
{
"cell_type": "markdown",
"id": "advanced-default",
"metadata": {},
"source": [
"Output : Spark data frame with a column with a regex"
]
},
{
"cell_type": "code",
"execution_count": 94,
"id": "vanilla-provider",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"+-------+\n",
"|replace|\n",
"+-------+\n",
"| -----|\n",
"+-------+\n",
"\n"
]
}
],
"source": [
"from pyspark.sql.functions import regexp_replace\n",
"df_regex2 = df.select(regexp_replace('string', r'(\\d+)', '--').alias('replace'))\n",
"df_regex2.show()"
]
},
{
"cell_type": "markdown",
"id": "sweet-heritage",
"metadata": {},
"source": [
" Summary:"
]
},
{
"cell_type": "code",
"execution_count": 95,
"id": "polar-shower",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Input Output\n"
]
},
{
"data": {
"text/html": [
"\n",
" \n",
" \n",
" string | \n",
"
\n",
" \n",
" \n",
" \n",
" 100-200 | \n",
"
\n",
" \n",
"
\n",
" \n",
" \n",
" replace | \n",
"
\n",
" \n",
" \n",
" \n",
" ----- | \n",
"
\n",
" \n",
"
"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"print(\"Input \", \"Output\")\n",
"display_side_by_side(df.toPandas(),df_regex2.toPandas())"
]
}
],
"metadata": {
"celltoolbar": "Tags",
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.5"
}
},
"nbformat": 4,
"nbformat_minor": 5
}