Chapter 5 : String Columns

Chapter Learning Objectives

  • Various data operations on columns containing string.

Chapter Outline

# import panel as pn
# css = """
# div.special_table + table, th, td {
#   border: 3px solid orange;
# }
# """
# pn.extension(raw_css=[css])
import pyspark
from pyspark.sql import SparkSession
spark = SparkSession \
    .builder \
    .appName("Python Spark SQL basic example") \
    .config("spark.some.config.option", "some-value") \
    .getOrCreate()
from IPython.display import display_html
import pandas as pd 
import numpy as np
def display_side_by_side(*args):
    html_str=''
    for df in args:
        html_str+=df.to_html(index=False)
        html_str+= "\xa0\xa0\xa0"*10
    display_html(html_str.replace('table','table style="display:inline"'),raw=True)
space = "\xa0" * 10

1a. How to split a string?

../_images/19.png

Lets first understand the syntax

Syntax

pyspark.sql.functions.split(str, pattern, limit=-1)

Splits str around matches of the given pattern.

Parameters:

  • str : a string expression to split

  • pattern : a string representing a regular expression. The regex string should be a Java regular expression.

  • limit : an integer which controls the number of times pattern is applied.

limit > 0: The resulting array’s length will not be more than limit, and the resulting array’s last entry will contain all input beyond the last matched pattern.

limit <= 0: pattern will be applied as many times as possible, and the resulting array can be of any size. ‘’’

Input: Spark data frame with a column having a string

df_string = spark.createDataFrame([('abc__def__ghc',)], ['string',])
df_string.show()
+-------------+
|       string|
+-------------+
|abc__def__ghc|
+-------------+

Output : Spark data frame with a column with a split string

from pyspark.sql.functions import split
df_split = df_string.select(split(df_string.string,'__').alias('split_string'))
df_split.show()
+---------------+
|   split_string|
+---------------+
|[abc, def, ghc]|
+---------------+

Summary:

print("Input                     ",            "Output")
display_side_by_side(df_string.toPandas(),df_split.toPandas())
Input                      Output
string
abc__def__ghc
                              
split_string
[abc, def, ghc]
                              

1b. How to slice a string?

../_images/48.png

Lets first understand the syntax

Syntax

pyspark.sql.functions.slice(x, start, length)

Collection function: returns an array containing all the elements in x from index start (array indices start at 1, or from the end if start is negative) with the specified length.

Parameters:

  • x : the array to be sliced

  • start : the starting index

  • length : the length of the slice ‘’’

Input: Spark data frame with a column having a string

df_string = spark.createDataFrame([('abcdefghi',)], ['string',])
df_string.show()
+---------+
|   string|
+---------+
|abcdefghi|
+---------+

Output : Spark data frame with a column with a sliced string

from pyspark.sql.functions import substring
df_sub = df_string.select(substring(df_string.string,1,4).alias('substring'))
df_sub.show()
+---------+
|substring|
+---------+
|     abcd|
+---------+

Summary:

print("Input                     ",            "Output")
display_side_by_side(df_string.toPandas(),df_sub.toPandas())
Input                      Output
string
abcdefghi
                              
substring
abcd
                              

1c. How to convert lowercase to uppercase?

../_images/38.png

Lets first understand the syntax

Converts a string expression to upper case.

Syntax

pyspark.sql.functions.upper(col)

Converts a string expression to lower case.

Parameters:

  • col : column ‘’’

Input: Spark data frame with a column having a lowercase string

df_upper = spark.createDataFrame([('ABCDEFGHI',)], ['uppercase',])
df_upper.show()
+---------+
|uppercase|
+---------+
|ABCDEFGHI|
+---------+

Output : Spark data frame with a column having a uppercase string

from pyspark.sql.functions import lower
df_lower= df_upper.select(lower(df_upper.uppercase).alias('lowercase'))
df_lower.show()
+---------+
|lowercase|
+---------+
|abcdefghi|
+---------+

Summary:

print("input                     ",            "output")
display_side_by_side(df_upper.toPandas(),df_lower.toPandas())
input                      output
uppercase
ABCDEFGHI
                              
lowercase
abcdefghi
                              

1d. How to convert uppercase to lowercase?

../_images/29.png

Lets first understand the syntax

Syntax

pyspark.sql.functions.lower(col)

Converts a string expression to lower case.

Parameters:

  • col : column ‘’’

Input: Spark data frame with a column having a string

df_string = spark.createDataFrame([('abcdefghc',)], ['lowercase',])
df_string.show()
+---------+
|lowercase|
+---------+
|abcdefghc|
+---------+

Output : Spark data frame with a column with a split string

from pyspark.sql.functions import upper
df_upper= df_string.select(upper(df_string.lowercase).alias('uppercase'))
df_upper.show()
+---------+
|uppercase|
+---------+
|ABCDEFGHC|
+---------+

Summary:

print("Input                     ",            "Output")
display_side_by_side(df_string.toPandas(),df_upper.toPandas())
Input                      Output
lowercase
abcdefghc
                              
uppercase
ABCDEFGHC
                              

1e. How to extract a specific group matched by a Java regex?

../_images/57.png

Lets first understand the syntax

Syntax

pyspark.sql.functions.regexp_extract(str, pattern, idx)

Extract a specific group matched by a Java regex, from the specified string column. If the regex did not match, or the specified group did not match, an empty string is returned. ‘’’

Input: Spark data frame with a column having a string

df = spark.createDataFrame([('100-200',)], ['str'])
df.show()
+-------+
|    str|
+-------+
|100-200|
+-------+

Output : Spark data frame with a column with a split string

from pyspark.sql.functions import regexp_extract
df_regex1 = df.select(regexp_extract('str', r'(\d+)-(\d+)', 1).alias('regex'))
df_regex1.show()
+-----+
|regex|
+-----+
|  100|
+-----+

Summary:

print("Input                     ",            "Output")
display_side_by_side(df.toPandas(),df_regex1.toPandas())
Input                      Output
str
100-200
                              
regex
100
                              

1f. How to replace a specific group matched by a Java regex?

../_images/66.png

Lets first understand the syntax

Syntax

pyspark.sql.functions.regexp_replace(str, pattern, replacement)

Replace all substrings of the specified string value that match regexp with rep. ‘’’

Input: Spark data frame with a column having a string

df = spark.createDataFrame([('100-200',)], ['string'])
df.show()
+-------+
| string|
+-------+
|100-200|
+-------+

Output : Spark data frame with a column with a regex

from pyspark.sql.functions import regexp_replace
df_regex2 = df.select(regexp_replace('string', r'(\d+)', '--').alias('replace'))
df_regex2.show()
+-------+
|replace|
+-------+
|  -----|
+-------+

Summary:

print("Input                     ",            "Output")
display_side_by_side(df.toPandas(),df_regex2.toPandas())
Input                      Output
string
100-200
                              
replace
-----