Chapter 9 : Map Column

Chapter Learning Objectives

  • Various data operations on columns containing map.

Chapter Outline

import pyspark
from pyspark.sql import SparkSession
spark = SparkSession \
    .builder \
    .appName("Python Spark SQL basic example") \
    .config("spark.some.config.option", "some-value") \
    .getOrCreate()
from IPython.display import display_html
import pandas as pd 
import numpy as np
def display_side_by_side(*args):
    html_str=''
    for df in args:
        html_str+=df.to_html(index=False)
        html_str+= "\xa0\xa0\xa0"*10
    display_html(html_str.replace('table','table style="display:inline"'),raw=True)
space = "\xa0" * 10
import panel as pn

css = """
div.special_table + table, th, td {
  border: 3px solid orange;
}
"""
pn.extension(raw_css=[css])

1a. How to create a column of map type?

../_images/116.png
df = spark.createDataFrame([({"a":1,"b": 2,"c":3},)],["data"])
df.show(truncate=False)
print(df.dtypes)
+------------------------+
|data                    |
+------------------------+
|[a -> 1, b -> 2, c -> 3]|
+------------------------+

[('data', 'map<string,bigint>')]

1b. How to read individual elements of a map column ?

../_images/214.png

Input: Spark dataframe containing map column

df1 = spark.createDataFrame([({"a":1,"b": 2,"c":3},)],["data"])
df1.show(truncate=False)
+------------------------+
|data                    |
+------------------------+
|[a -> 1, b -> 2, c -> 3]|
+------------------------+

Output : Spark dataframe containing map keys as column and its value

df_map = df1.select(df1.data.a.alias("a"), df1.data.b.alias("b"), df1.data.c.alias("c") )
df_map.show()
+---+---+---+
|  a|  b|  c|
+---+---+---+
|  1|  2|  3|
+---+---+---+

Summary:

print("Input                     ",            "Output")
display_side_by_side(df1.toPandas(),df_map.toPandas())
Input                      Output
data
{'a': 1, 'b': 2, 'c': 3}
                              
a b c
1 2 3
                              

1c. How to extract the keys from a map column?

../_images/312.png

Lets first understand the syntax

Syntax

pyspark.sql.functions.map_keys(col)

Returns an unordered array containing the keys of the map.

Parameters:

  • col – name of column or expression

‘’’

Input: Spark data frame consisting of a map column

df2 = spark.createDataFrame([({"a":1,"b":"2","c":3},)],["data"])
df2.show(truncate=False)
+----------------------+
|data                  |
+----------------------+
|[a -> 1, b ->, c -> 3]|
+----------------------+

Output : Spark data frame consisting of a column of keys

from pyspark.sql.functions import map_keys
df_keys = df2.select(map_keys(df2.data).alias("keys"))
df_keys.show()
+---------+
|     keys|
+---------+
|[a, b, c]|
+---------+

Summary:

print("input                     ",            "output")
display_side_by_side(df2.toPandas(),df_keys.toPandas())
input                      output
data
{'a': 1, 'b': None, 'c': 3}
                              
keys
[a, b, c]
                              

1d. How to extract the values from a map column?

../_images/412.png

Lets first understand the syntax

Syntax

pyspark.sql.functions.map_values(col)

Collection function: Returns an unordered array containing the values of the map.

Parameters

  • col – name of column or expression

‘’’

Input: Spark data frame consisting of a map column

df3 = spark.createDataFrame([({"a":1,"b":"2","c":3},)],["data"])
df3.show(truncate=False)
+----------------------+
|data                  |
+----------------------+
|[a -> 1, b ->, c -> 3]|
+----------------------+

Output : Spark data frame consisting of a column of values

from pyspark.sql.functions import map_values
df_values = df3.select(map_values(df3.data).alias("values"))
df_values.show()
+-------+
| values|
+-------+
|[1,, 3]|
+-------+

Summary:

print("Input                     ",            "Output")
display_side_by_side(df3.toPandas(),df_values.toPandas())
Input                      Output
data
{'a': 1, 'b': None, 'c': 3}
                              
values
[1, None, 3]
                              

1e. How to convert a map column into an array column?

../_images/511.png

Lets first understand the syntax

Syntax

pyspark.sql.functions.map_entries(col)

Collection function: Returns an unordered array of all entries in the given map.

Parameters

  • col – name of column or expression ‘’’

Input: Spark data frame with map column

df4 = spark.createDataFrame([({"a":1,"b": 2,"c":3},)],["data"])
df4.show(truncate=False)
+------------------------+
|data                    |
+------------------------+
|[a -> 1, b -> 2, c -> 3]|
+------------------------+

Output : Spark dataframe containing an array

from pyspark.sql.functions import map_entries
df_array = df4.select(map_entries(df4.data).alias("array"))
df_array.show(truncate=False)
+------------------------+
|array                   |
+------------------------+
|[[a, 1], [b, 2], [c, 3]]|
+------------------------+

Summary:

print("Input                     ",            "Output")
display_side_by_side(df4.toPandas(),df_array.toPandas())
Input                      Output
data
{'a': 1, 'b': 2, 'c': 3}
                              
array
[(a, 1), (b, 2), (c, 3)]
                              

1f. How to create a map column from multiple array columns?

../_images/610.png

Lets first understand the syntax

Syntax

pyspark.sql.functions.map_from_arrays(col1, col2)

Creates a new map from two arrays.

Parameters

  • col1 – name of column containing a set of keys. All elements should not be null

  • col2 – name of column containing a set of values

‘’’

Input: Spark data frame with a column

df5 = spark.createDataFrame([([2, 5], ['a', 'b'])], ['k', 'v'])
df5.show()
+------+------+
|     k|     v|
+------+------+
|[2, 5]|[a, b]|
+------+------+

Output : Spark data frame with a column of array of repeated values

from pyspark.sql.functions import map_from_arrays
df_map1 = df5.select(map_from_arrays(df5.k, df5.v).alias("map"))
df_map1.show()
+----------------+
|             map|
+----------------+
|[2 -> a, 5 -> b]|
+----------------+

Summary:

print("Input                     ",            "Output")
display_side_by_side(df5.toPandas(),df_map1.toPandas())
Input                      Output
k v
[2, 5] [a, b]
                              
map
{5: 'b', 2: 'a'}
                              

1g. How to combine multiple map columns into one?

../_images/77.png

Lets first understand the syntax

Syntax

pyspark.sql.functions.map_concat(*cols)

Returns the union of all the given maps.

Parameters

  • col – name of columns

‘’’

Input: Spark data frame with multiple map columns

df6 = spark.sql("SELECT map(1, 'a', 2, 'b') as map1, map(3, 'c') as map2")
df6.show()
+----------------+--------+
|            map1|    map2|
+----------------+--------+
|[1 -> a, 2 -> b]|[3 -> c]|
+----------------+--------+

Output : Spark data frame with an array column with an element removed

from pyspark.sql.functions import map_concat
df_com = df6.select(map_concat("map1", "map2").alias("combined_map"))
df_com.show(truncate=False)
+------------------------+
|combined_map            |
+------------------------+
|[1 -> a, 2 -> b, 3 -> c]|
+------------------------+

Summary:

print("Input                     ",            "Output")
display_side_by_side(df6.toPandas(),df_com.toPandas())
Input                      Output
map1 map2
{1: 'a', 2: 'b'} {3: 'c'}
                              
combined_map
{1: 'a', 2: 'b', 3: 'c'}