Chapter 12 : Null & NaN Column

Chapter Learning Objectives

  • Various data operations on Null & NaN columns.

Chapter Outline

import pyspark
from pyspark.sql import SparkSession
spark = SparkSession \
    .builder \
    .appName("Python Spark SQL basic example") \
    .config("spark.some.config.option", "some-value") \
    .getOrCreate()
from IPython.display import display_html
import pandas as pd 
import numpy as np
def display_side_by_side(*args):
    html_str=''
    for df in args:
        html_str+=df.to_html(index=False)
        html_str+= "\xa0\xa0\xa0"*10
    display_html(html_str.replace('table','table style="display:inline"'),raw=True)
space = "\xa0" * 10
import panel as pn

css = """
div.special_table + table, th, td {
  border: 3px solid orange;
}
"""
pn.extension(raw_css=[css])

1a. How to count the Null & NaN in Spark DataFrame ?

Null values represents “no value” or “nothing” it’s not even an empty string or zero. A null value indicates a lack of a value

NaN stands for “Not a Number” It’s usually the result of a mathematical operation that doesn’t make sense, e.g. 0.0/0.0

Unlike Pandas, PySpark doesn’t consider NaN values to be NULL.

../_images/13.png

Input: Spark dataframe

df = spark.createDataFrame([(None, 3,float('nan')), (6,5, 2.0), (5,5, float("nan")),
                            (8, None, 2.0), (12,21,3.0),], ["a", "b","c"])
df.show()
+----+----+---+
|   a|   b|  c|
+----+----+---+
|null|   3|NaN|
|   6|   5|2.0|
|   5|   5|NaN|
|   8|null|2.0|
|  12|  21|3.0|
+----+----+---+
df.printSchema()
root
 |-- a: long (nullable = true)
 |-- b: long (nullable = true)
 |-- c: double (nullable = true)

Count of NaN in a Spark Data Frame

from pyspark.sql.functions import isnan, when, count, col
df.select([count(when(isnan(c), c)).alias(c) for c in df.columns]).show()
+---+---+---+
|  a|  b|  c|
+---+---+---+
|  0|  0|  2|
+---+---+---+

Count of Null in a Spark Data Frame

df.select([count(when(col(c).isNull(), c)).alias(c) for c in df.columns]).show()
+---+---+---+
|  a|  b|  c|
+---+---+---+
|  1|  1|  0|
+---+---+---+

Count of Null and NaN in a Spark Data Frame

df.select([count(when(isnan(c) | col(c).isNull(), c)).alias(c) for c in df.columns]).show()
+---+---+---+
|  a|  b|  c|
+---+---+---+
|  1|  1|  2|
+---+---+---+

1b. How to filter the rows that contain NaN & Null?

../_images/23.png

Input: Spark dataframe containing JSON column

df = spark.createDataFrame([(None, 3,float('nan')), (6,5, 2.0), (5,5, float("nan")),
                            (8, None, 2.0), (12,21,3.0),], ["a", "b","c"])
df.show()
+----+----+---+
|   a|   b|  c|
+----+----+---+
|null|   3|NaN|
|   6|   5|2.0|
|   5|   5|NaN|
|   8|null|2.0|
|  12|  21|3.0|
+----+----+---+

Learn to build the Boolean Expressions

from functools import reduce
filter_remove_nan_null_rows =reduce(lambda x, y: x & y, [~isnan(col(x)) for x in df.columns]+[col(x).isNotNull() for x in df.columns])
filter_remove_nan_null_rows
Column<b'((((((NOT isnan(a)) AND (NOT isnan(b))) AND (NOT isnan(c))) AND (a IS NOT NULL)) AND (b IS NOT NULL)) AND (c IS NOT NULL))'>
from functools import reduce
df.where(filter_remove_nan_null_rows).show()
+---+---+---+
|  a|  b|  c|
+---+---+---+
|  6|  5|2.0|
| 12| 21|3.0|
+---+---+---+
df.where(~isnan(col("c")) & ~isnan(col("b"))).show()
+---+----+---+
|  a|   b|  c|
+---+----+---+
|  6|   5|2.0|
|  8|null|2.0|
| 12|  21|3.0|
+---+----+---+
df.where(col("a").isNotNull()).show()
+---+----+---+
|  a|   b|  c|
+---+----+---+
|  6|   5|2.0|
|  5|   5|NaN|
|  8|null|2.0|
| 12|  21|3.0|
+---+----+---+
df.where(col("a").isNull()).show()
+----+---+---+
|   a|  b|  c|
+----+---+---+
|null|  3|NaN|
+----+---+---+
df.where(isnan(col("c"))).show()
+----+---+---+
|   a|  b|  c|
+----+---+---+
|null|  3|NaN|
|   5|  5|NaN|
+----+---+---+
df.where(~isnan(col("c"))).show()
+---+----+---+
|  a|   b|  c|
+---+----+---+
|  6|   5|2.0|
|  8|null|2.0|
| 12|  21|3.0|
+---+----+---+

Output : Spark dataframe containing individual JSON elements

1c. How to replace Null values in the dataframe?

../_images/33.png

Input: Spark dataframe containing Null and NaN column

df = spark.createDataFrame([(None, 3,float('nan')), (6,5, 2.0), (5,5, float("nan")),
                            (8, None, 2.0), (12,21,3.0),], ["a", "b","c"])
df.show()
+----+----+---+
|   a|   b|  c|
+----+----+---+
|null|   3|NaN|
|   6|   5|2.0|
|   5|   5|NaN|
|   8|null|2.0|
|  12|  21|3.0|
+----+----+---+

Output : Spark dataframe containing individual JSON elements

df.fillna(1000).show()
+----+----+------+
|   a|   b|     c|
+----+----+------+
|1000|   3|1000.0|
|   6|   5|   2.0|
|   5|   5|1000.0|
|   8|1000|   2.0|
|  12|  21|   3.0|
+----+----+------+

Filling only for selected columns

df.fillna(1000,subset=["a"]).show()
+----+----+---+
|   a|   b|  c|
+----+----+---+
|1000|   3|NaN|
|   6|   5|2.0|
|   5|   5|NaN|
|   8|null|2.0|
|  12|  21|3.0|
+----+----+---+

Filling with a different value for each of the Null Column

df.fillna(1000,subset=["a"]).fillna(500,subset=["b"]).show()
+----+---+---+
|   a|  b|  c|
+----+---+---+
|1000|  3|NaN|
|   6|  5|2.0|
|   5|  5|NaN|
|   8|500|2.0|
|  12| 21|3.0|
+----+---+---+

Summary:

print("Input                     ",            "Output")
print("                         ",   "Filling all columns","        ", "Filling selected columns","   ", "Filling a different value")
print("                         ",   "with a same value","        ", "                               "  "for each column")
display_side_by_side(df.toPandas(),df.fillna(1000).toPandas(), df.fillna(1000,subset=["a"]).toPandas(), 
                     df.fillna(1000,subset=["a"]).fillna(500,subset=["b"]).toPandas())
Input                      Output
                          Filling all columns          Filling selected columns     Filling a different value
                          with a same value                                         for each column
a b c
NaN 3.0 NaN
6.0 5.0 2.0
5.0 5.0 NaN
8.0 NaN 2.0
12.0 21.0 3.0
                              
a b c
1000 3 1000.0
6 5 2.0
5 5 1000.0
8 1000 2.0
12 21 3.0
                              
a b c
1000 3.0 NaN
6 5.0 2.0
5 5.0 NaN
8 NaN 2.0
12 21.0 3.0
                              
a b c
1000 3 NaN
6 5 2.0
5 5 NaN
8 500 2.0
12 21 3.0
                              

1d. How to replace NaN values in the dataframe?

../_images/43.png

Input: Spark data frame

df = spark.createDataFrame([(None, 3,float('nan')), (6,5, 2.0), (5,5, float("nan")),
                            (8, None, 2.0), (12,21,3.0),], ["a", "b","c"])
df.show()
+----+----+---+
|   a|   b|  c|
+----+----+---+
|null|   3|NaN|
|   6|   5|2.0|
|   5|   5|NaN|
|   8|null|2.0|
|  12|  21|3.0|
+----+----+---+

Output : Spark data frame with a struct column with a new element added

df.replace(float('nan'),10).show()
+----+----+----+
|   a|   b|   c|
+----+----+----+
|null|   3|10.0|
|   6|   5| 2.0|
|   5|   5|10.0|
|   8|null| 2.0|
|  12|  21| 3.0|
+----+----+----+