Chapter 12 : Null & NaN Column¶

Chapter Learning Objectives¶

Various data operations on Null & NaN columns.

Chapter Outline¶

1. How to deal with Null & NaN column?

import pyspark
from pyspark.sql import SparkSession
spark = SparkSession \
    .builder \
    .appName("Python Spark SQL basic example") \
    .config("spark.some.config.option", "some-value") \
    .getOrCreate()
from IPython.display import display_html
import pandas as pd 
import numpy as np
def display_side_by_side(*args):
    html_str=''
    for df in args:
        html_str+=df.to_html(index=False)
        html_str+= "\xa0\xa0\xa0"*10
    display_html(html_str.replace('table','table style="display:inline"'),raw=True)
space = "\xa0" * 10

import panel as pn

css = """
div.special_table + table, th, td {
  border: 3px solid orange;
}
"""
pn.extension(raw_css=[css])

Chapter Outline - Gallery¶

click on	any image

1a. How to count the Null & NaN in Spark DataFrame ?¶

Null values represents “no value” or “nothing” it’s not even an empty string or zero. A null value indicates a lack of a value

NaN stands for “Not a Number” It’s usually the result of a mathematical operation that doesn’t make sense, e.g. 0.0/0.0

Unlike Pandas, PySpark doesn’t consider NaN values to be NULL.

Input: Spark dataframe

df = spark.createDataFrame([(None, 3,float('nan')), (6,5, 2.0), (5,5, float("nan")),
                            (8, None, 2.0), (12,21,3.0),], ["a", "b","c"])
df.show()

+----+----+---+
|   a|   b|  c|
+----+----+---+
|null|   3|NaN|
|   6|   5|2.0|
|   5|   5|NaN|
|   8|null|2.0|
|  12|  21|3.0|
+----+----+---+

df.printSchema()

root
 |-- a: long (nullable = true)
 |-- b: long (nullable = true)
 |-- c: double (nullable = true)

Count of NaN in a Spark Data Frame

from pyspark.sql.functions import isnan, when, count, col
df.select([count(when(isnan(c), c)).alias(c) for c in df.columns]).show()

+---+---+---+
|  a|  b|  c|
+---+---+---+
|  0|  0|  2|
+---+---+---+

Count of Null in a Spark Data Frame

df.select([count(when(col(c).isNull(), c)).alias(c) for c in df.columns]).show()

+---+---+---+
|  a|  b|  c|
+---+---+---+
|  1|  1|  0|
+---+---+---+

Count of Null and NaN in a Spark Data Frame

df.select([count(when(isnan(c) | col(c).isNull(), c)).alias(c) for c in df.columns]).show()

+---+---+---+
|  a|  b|  c|
+---+---+---+
|  1|  1|  2|
+---+---+---+

1b. How to filter the rows that contain NaN & Null?¶

Input: Spark dataframe containing JSON column

df = spark.createDataFrame([(None, 3,float('nan')), (6,5, 2.0), (5,5, float("nan")),
                            (8, None, 2.0), (12,21,3.0),], ["a", "b","c"])
df.show()

+----+----+---+
|   a|   b|  c|
+----+----+---+
|null|   3|NaN|
|   6|   5|2.0|
|   5|   5|NaN|
|   8|null|2.0|
|  12|  21|3.0|
+----+----+---+

Learn to build the Boolean Expressions¶

from functools import reduce
filter_remove_nan_null_rows =reduce(lambda x, y: x & y, [~isnan(col(x)) for x in df.columns]+[col(x).isNotNull() for x in df.columns])
filter_remove_nan_null_rows

Column<b'((((((NOT isnan(a)) AND (NOT isnan(b))) AND (NOT isnan(c))) AND (a IS NOT NULL)) AND (b IS NOT NULL)) AND (c IS NOT NULL))'>

from functools import reduce
df.where(filter_remove_nan_null_rows).show()

+---+---+---+
|  a|  b|  c|
+---+---+---+
|  6|  5|2.0|
| 12| 21|3.0|
+---+---+---+

df.where(~isnan(col("c")) & ~isnan(col("b"))).show()

+---+----+---+
|  a|   b|  c|
+---+----+---+
|  6|   5|2.0|
|  8|null|2.0|
| 12|  21|3.0|
+---+----+---+

df.where(col("a").isNotNull()).show()

+---+----+---+
|  a|   b|  c|
+---+----+---+
|  6|   5|2.0|
|  5|   5|NaN|
|  8|null|2.0|
| 12|  21|3.0|
+---+----+---+

df.where(col("a").isNull()).show()

+----+---+---+
|   a|  b|  c|
+----+---+---+
|null|  3|NaN|
+----+---+---+

df.where(isnan(col("c"))).show()

+----+---+---+
|   a|  b|  c|
+----+---+---+
|null|  3|NaN|
|   5|  5|NaN|
+----+---+---+

df.where(~isnan(col("c"))).show()

+---+----+---+
|  a|   b|  c|
+---+----+---+
|  6|   5|2.0|
|  8|null|2.0|
| 12|  21|3.0|
+---+----+---+

Output : Spark dataframe containing individual JSON elements

1c. How to replace Null values in the dataframe?¶

Input: Spark dataframe containing Null and NaN column

df = spark.createDataFrame([(None, 3,float('nan')), (6,5, 2.0), (5,5, float("nan")),
                            (8, None, 2.0), (12,21,3.0),], ["a", "b","c"])
df.show()

+----+----+---+
|   a|   b|  c|
+----+----+---+
|null|   3|NaN|
|   6|   5|2.0|
|   5|   5|NaN|
|   8|null|2.0|
|  12|  21|3.0|
+----+----+---+

Output : Spark dataframe containing individual JSON elements

df.fillna(1000).show()

+----+----+------+
|   a|   b|     c|
+----+----+------+
|1000|   3|1000.0|
|   6|   5|   2.0|
|   5|   5|1000.0|
|   8|1000|   2.0|
|  12|  21|   3.0|
+----+----+------+

Filling only for selected columns

df.fillna(1000,subset=["a"]).show()

+----+----+---+
|   a|   b|  c|
+----+----+---+
|1000|   3|NaN|
|   6|   5|2.0|
|   5|   5|NaN|
|   8|null|2.0|
|  12|  21|3.0|
+----+----+---+

Filling with a different value for each of the Null Column

df.fillna(1000,subset=["a"]).fillna(500,subset=["b"]).show()

+----+---+---+
|   a|  b|  c|
+----+---+---+
|1000|  3|NaN|
|   6|  5|2.0|
|   5|  5|NaN|
|   8|500|2.0|
|  12| 21|3.0|
+----+---+---+

Summary:

print("Input                     ",            "Output")
print("                         ",   "Filling all columns","        ", "Filling selected columns","   ", "Filling a different value")
print("                         ",   "with a same value","        ", "                               "  "for each column")
display_side_by_side(df.toPandas(),df.fillna(1000).toPandas(), df.fillna(1000,subset=["a"]).toPandas(), 
                     df.fillna(1000,subset=["a"]).fillna(500,subset=["b"]).toPandas())

Input                      Output
                          Filling all columns          Filling selected columns     Filling a different value
                          with a same value                                         for each column

a	b	c
NaN	3.0	NaN
6.0	5.0	2.0
5.0	5.0	NaN
8.0	NaN	2.0
12.0	21.0	3.0

a	b	c
1000	3	1000.0
6	5	2.0
5	5	1000.0
8	1000	2.0
12	21	3.0

a	b	c
1000	3.0	NaN
6	5.0	2.0
5	5.0	NaN
8	NaN	2.0
12	21.0	3.0

a	b	c
1000	3	NaN
6	5	2.0
5	5	NaN
8	500	2.0
12	21	3.0

1d. How to replace NaN values in the dataframe?¶

Input: Spark data frame

df = spark.createDataFrame([(None, 3,float('nan')), (6,5, 2.0), (5,5, float("nan")),
                            (8, None, 2.0), (12,21,3.0),], ["a", "b","c"])
df.show()

+----+----+---+
|   a|   b|  c|
+----+----+---+
|null|   3|NaN|
|   6|   5|2.0|
|   5|   5|NaN|
|   8|null|2.0|
|  12|  21|3.0|
+----+----+---+

Output : Spark data frame with a struct column with a new element added

df.replace(float('nan'),10).show()

+----+----+----+
|   a|   b|   c|
+----+----+----+
|null|   3|10.0|
|   6|   5| 2.0|
|   5|   5|10.0|
|   8|null| 2.0|
|  12|  21| 3.0|
+----+----+----+

Spark Data Frame Programming for Modern Data Engineering