Chapter 7 : Date Columns

Chapter Learning Objectives

  • Various data operations on columns containing date strings, date and timestamps.

Chapter Outline

import pyspark
from pyspark.sql import SparkSession
spark = SparkSession \
    .builder \
    .appName("Python Spark SQL basic example") \
    .config("spark.some.config.option", "some-value") \
    .getOrCreate()
from IPython.display import display_html
import pandas as pd 
import numpy as np
def display_side_by_side(*args):
    html_str=''
    for df in args:
        html_str+=df.to_html(index=False)
        html_str+= "\xa0\xa0\xa0"*10
    display_html(html_str.replace('table','table style="display:inline"'),raw=True)
space = "\xa0" * 10
import panel as pn

css = """
div.special_table + table, th, td {
  border: 3px solid orange;
}
"""
pn.extension(raw_css=[css])
#<div class="special_table"></div>

1a. How to convert string timestamp to date format?

../_images/112.png

Lets first understand the syntax

Syntax

pyspark.sql.functions.to_date(col, format=None)

Converts a Column into pyspark.sql.types.DateType using the optionally specified format.

Specify formats according to datetime pattern. By default, it follows casting rules to pyspark.sql.types.DateType if the format is omitted.

Equivalent to col.cast(“date”).

‘’’

Input: Spark data frame with a column having a string of timestamp

df_str = spark.createDataFrame([('2020-07-15 16:52:44',)], ['string_timestamp'])
df_str.show()
+-------------------+
|   string_timestamp|
+-------------------+
|2020-07-15 16:52:44|
+-------------------+

Output : Spark data frame with a date column

from pyspark.sql.functions import to_date
df_date = df_str.select(to_date(df_str.string_timestamp).alias('date'))
df_date.show()
+----------+
|      date|
+----------+
|2020-07-15|
+----------+

Summary:

print("Input                     ",            "Output")
display_side_by_side(df_str.toPandas(),df_date.toPandas())
Input                      Output
string_timestamp
2020-07-15 16:52:44
                              
date
2020-07-15
                              

1b. How to convert a string timestamp to datetime format?

../_images/211.png

Lets first understand the syntax

Syntax

pyspark.sql.functions.to_timestamp(col, format=None)

Converts a Column into pyspark.sql.types.TimestampType using the optionally specified format.

Specify formats according to datetime pattern. By default, it follows casting rules to pyspark.sql.types.TimestampType if the format is omitted.

Equivalent to col.cast(“timestamp”). ‘’’

Input: Spark data frame with a column having a string of timestamp

df_string = spark.createDataFrame([('2020-07-15 16:52:44',)], ['string_timestamp'])
df_string.show()
+-------------------+
|   string_timestamp|
+-------------------+
|2020-07-15 16:52:44|
+-------------------+

Output : Spark data frame with a timestamp column

from pyspark.sql.functions import to_timestamp
df_datetime = df_string.select(to_timestamp(df_string.string_timestamp).alias('datetime'))
df_datetime.show()
+-------------------+
|           datetime|
+-------------------+
|2020-07-15 16:52:44|
+-------------------+

Summary:

print("Input                     ",            "Output")
display_side_by_side(df_string.toPandas(),df_datetime.toPandas())
Input                      Output
string_timestamp
2020-07-15 16:52:44
                              
datetime
2020-07-15 16:52:44
                              

1c. How to convert unix timestamp to a string timestamp?

../_images/310.png

Lets first understand the syntax

Syntax

pyspark.sql.functions.from_unixtime(timestamp, format=’yyyy-MM-dd HH:mm:ss’)

Converts the number of seconds from unix epoch (1970-01-01 00:00:00 UTC) to a string representing the timestamp of that moment in the current system time zone in the given format.

‘’’

Input: Spark data frame with a number column representing unix time

df_unix = spark.createDataFrame([(1594846364,)], ['unix_time'])
df_unix.show()
+----------+
| unix_time|
+----------+
|1594846364|
+----------+

Output : Spark data frame with a column with a sliced string

from pyspark.sql.functions import from_unixtime
df_time = df_unix.select(from_unixtime(df_unix.unix_time).alias("timestamp_string"))
df_time.show(truncate=False)
+-------------------+
|timestamp_string   |
+-------------------+
|2020-07-15 16:52:44|
+-------------------+

Summary:

print("Input                     ",            "Output")
display_side_by_side(df_unix.toPandas(),df_time.toPandas())
Input                      Output
unix_time
1594846364
                              
timestamp_string
2020-07-15 16:52:44
                              

1d. How to convert a timestamp to unix timestamp?

../_images/410.png

Lets first understand the syntax

Syntax

pyspark.sql.functions.unix_timestamp(timestamp=None, format=’yyyy-MM-dd HH:mm:ss’)

Convert time string with given pattern (‘yyyy-MM-dd HH:mm:ss’, by default) to Unix time stamp (in seconds), using the default timezone and the default locale, return null if fail.

‘’’

Input: Spark data frame consisting of a timestamp column in UTC

from pyspark.sql.functions import current_timestamp
df = spark.createDataFrame([(),],)
df = df.select(current_timestamp().alias("timestamp"))
df.show(truncate=False)
+----------------------+
|timestamp             |
+----------------------+
|2021-03-13 21:42:38.01|
+----------------------+

Output : Spark data frame consisting of a unix timestamp column

from pyspark.sql.functions import unix_timestamp
df_unix = df.select(unix_timestamp(df.timestamp).alias("unixtimestamp"))
df_unix.show(truncate=False)
+-------------+
|unixtimestamp|
+-------------+
|1615689758   |
+-------------+

Summary:

print("Input                     ",            "Output")
display_side_by_side(df.toPandas(),df_unix.toPandas())
Input                      Output
timestamp
2021-03-13 21:42:38.659
                              
unixtimestamp
1615689758
                              

1e. How to convert timestamp in UTC timezone to a given time zone?

../_images/59.png

Lets first understand the syntax

Syntax

pyspark.sql.functions.from_utc_timestamp(timestamp, tz)

This is a common function for databases supporting TIMESTAMP WITHOUT TIMEZONE. This function takes a timestamp which is timezone-agnostic, and interprets it as a timestamp in UTC, and renders that timestamp as a timestamp in the given time zone.

However, timestamp in Spark represents number of microseconds from the Unix epoch, which is not timezone-agnostic. So in Spark this function just shift the timestamp value from UTC timezone to the given timezone.

This function may return confusing result if the input is a string with timezone, e.g. ‘2018-03-13T06:18:23+00:00’. The reason is that, Spark firstly cast the string to timestamp according to the timezone in the string, and finally display the result by converting the timestamp to string according to the session local timezone.

Parameters

  • timestamp – the column that contains timestamps

  • tz – A string detailing the time zone ID that the input should be adjusted to. It should be in the format of
    either region-based zone IDs or zone offsets. Region IDs must have the form ‘area/city’, such as ‘America/Los_Angeles’. Zone offsets must be in the format ‘(+|-)HH:mm’, for example ‘-08:00’ or ‘+01:00’. Also ‘UTC’ and ‘Z’ are supported as aliases of ‘+00:00’. Other short names are not recommended to use because they can be ambiguous. ‘’’

Input: Spark data frame consisting of a timestamp column in UTC time zone

# Please note that below timestamp is in UTC.
df_utc = spark.createDataFrame([(),],)
df_utc = df_utc.select(current_timestamp().alias("timestamp_UTC"))
df_utc.show(truncate=False)
+-----------------------+
|timestamp_UTC          |
+-----------------------+
|2021-03-13 21:42:38.959|
+-----------------------+

Output : Spark data frame consisting of a timestamp column in different timezone

from pyspark.sql.functions import from_utc_timestamp
df_tz = df_utc.select(from_utc_timestamp(df_utc.timestamp_UTC,'UTC-4').alias("timestamp in local TZ"))
df_tz.show(truncate=False)
+-----------------------+
|timestamp in local TZ  |
+-----------------------+
|2021-03-13 17:42:39.213|
+-----------------------+

Summary:

print("Input                     ",            "Output")
display_side_by_side(df_utc.toPandas(),df_tz.toPandas())
Input                      Output
timestamp_UTC
2021-03-13 21:42:39.362
                              
timestamp in local TZ
2021-03-13 17:42:39.487
                              

1f. How to convert from a given time zone to UTC timestamps?

../_images/68.png

Lets first understand the syntax

{admonition} Syntax pyspark.sql.functions.to_utc_timestamp(timestamp, tz)

This is a common function for databases supporting TIMESTAMP WITHOUT TIMEZONE. This function takes a timestamp which is timezone-agnostic, and interprets it as a timestamp in the given timezone, and renders that timestamp as a timestamp in UTC.

However, timestamp in Spark represents number of microseconds from the Unix epoch, which is not timezone-agnostic. So in Spark this function just shift the timestamp value from the given timezone to UTC timezone.

This function may return confusing result if the input is a string with timezone, e.g. ‘2018-03-13T06:18:23+00:00’. The reason is that, Spark firstly cast the string to timestamp according to the timezone in the string, and finally display the result by converting the timestamp to string according to the session local timezone.

Parameters

  • timestamp – the column that contains timestamps

  • tz – A string detailing the time zone ID that the input should be adjusted to. It should be in the format of either region-based zone IDs or zone offsets. Region IDs must have the form ‘area/city’, such as ‘America/Los_Angeles’. Zone offsets must be in the format ‘(+|-)HH:mm’, for example ‘-08:00’ or ‘+01:00’. Also ‘UTC’ and ‘Z’ are supported as aliases of ‘+00:00’. Other short names are not recommended to use because they can be ambiguous.

Input: Spark data frame consisting of a timestamp column in UTC time zone

# Please note that local time zone is EST.
df_est = spark.createDataFrame([(),],)
df_est = df.select(current_timestamp().alias("timestamp"))
df_est.show(truncate=False)
+-----------------------+
|timestamp              |
+-----------------------+
|2021-03-13 21:42:39.695|
+-----------------------+

Output : Spark data frame consisting of a timestamp column in different timezone

from pyspark.sql.functions import to_utc_timestamp
df_utc = df_est.select(to_utc_timestamp(df_est.timestamp,'GMT-4').alias("timestamp in UTC"))
df_utc.show(truncate=False)
+-----------------------+
|timestamp in UTC       |
+-----------------------+
|2021-03-14 01:42:40.042|
+-----------------------+

Summary:

print("Input                     ",            "Output")
display_side_by_side(df_est.toPandas(),df_utc.toPandas())
Input                      Output
timestamp
2021-03-13 21:42:40.310
                              
timestamp in UTC
2021-03-14 01:42:40.500
                              

1g. How to convert the date format?

../_images/75.png

Lets first understand the syntax

Syntax

pyspark.sql.functions.date_format(date, format)

Converts a date/timestamp/string to a value of string in the format specified by the date format given by the second argument.

A pattern could be for instance dd.MM.yyyy and could return a string like ‘18.03.1993’. All pattern letters of datetime pattern. can be used.

Parameters:

  • col : column ‘’’

Input: Spark data frame with a string date column

df_format1 = spark.createDataFrame([('2020-07-15',)], ['date_format1'])
df_format1.show()
+------------+
|date_format1|
+------------+
|  2020-07-15|
+------------+

Output : Spark data frame with a formatted date

from pyspark.sql.functions import date_format
df_format2 = df_format1.select(date_format('date_format1', 'MM/dd/yyy').alias('date_format2'))
df_format2.show()
+------------+
|date_format2|
+------------+
|  07/15/2020|
+------------+

Summary:

print("input                     ",            "output")
display_side_by_side(df_format1.toPandas(),df_format2.toPandas())
input                      output
date_format1
2020-07-15
                              
date_format2
07/15/2020
                              

2a. How to find no of days between 2 dates?

../_images/84.png

Lets first understand the syntax

Syntax

pyspark.sql.functions.datediff(end, start)

Returns the number of days from start to end.

‘’’

Input: Spark data frame consisting of date columns

df_dates = spark.createDataFrame([('2020-07-15','2020-06-10')], ['date1', 'date2'])
df_dates.show()
+----------+----------+
|     date1|     date2|
+----------+----------+
|2020-07-15|2020-06-10|
+----------+----------+

Output : Spark data frame with a days diff

from pyspark.sql.functions import datediff
df_diff = df_dates.select(datediff(df_dates.date1,df_dates.date2).alias('diff'))
df_diff.show()
+----+
|diff|
+----+
|  35|
+----+

Summary:

print("Input                     ",            "Output")
display_side_by_side(df_dates.toPandas(),df_diff.toPandas())
Input                      Output
date1 date2
2020-07-15 2020-06-10
                              
diff
35
                              

2b. How to find no of months between 2 dates?

../_images/93.png

Lets first understand the syntax

Syntax

pyspark.sql.functions.months_between(date1, date2, roundOff=True)

Returns number of months between dates date1 and date2. If date1 is later than date2, then the result is positive. If date1 and date2 are on the same day of month, or both are the last day of month, returns an integer (time of day will be ignored). The result is rounded off to 8 digits unless roundOff is set to False.

‘’’

Input: Spark data frame with date columns

df_month = spark.createDataFrame([('2020-07-15 11:32:00', '2020-01-30')], ['date1', 'date2'])
df_month.show()
+-------------------+----------+
|              date1|     date2|
+-------------------+----------+
|2020-07-15 11:32:00|2020-01-30|
+-------------------+----------+

Output : Spark data frame with months between 2 dates

from pyspark.sql.functions import months_between
df_month_diff = df_month.select(months_between(df_month.date1, df_month.date2))
df_month_diff.show()
+----------------------------------+
|months_between(date1, date2, true)|
+----------------------------------+
|                        5.53163082|
+----------------------------------+

Summary:

print("Input                     ",            "Output")
display_side_by_side(df_month.toPandas(),df_month_diff.toPandas())
Input                      Output
date1 date2
2020-07-15 11:32:00 2020-01-30
                              
months_between(date1, date2, true)
5.531631
                              

2c. How to add no of days to a date?

../_images/102.png

Lets first understand the syntax

Syntax

pyspark.sql.functions.date_add(start, days)

Returns the date that is days days after start ‘’’

Input: Spark data frame with a string date column

df_today = spark.createDataFrame([('2020-07-15',)], ['date'])
df_today.show()
+----------+
|      date|
+----------+
|2020-07-15|
+----------+

Output : Spark data frame with a new date

from pyspark.sql.functions import date_add
df_tomo = df_today.select(date_add(df_today.date, 1).alias('next_day'))
df_tomo.show()
+----------+
|  next_day|
+----------+
|2020-07-16|
+----------+

Summary:

print("input                     ",            "output")
display_side_by_side(df_today.toPandas(),df_tomo.toPandas())
input                      output
date
2020-07-15
                              
next_day
2020-07-16
                              

2d. How to subtract no of days from a date?

../_images/113.png

Lets first understand the syntax

Syntax

pyspark.sql.functions.date_sub(start, days)

Returns the date that is days days before start

‘’’

Input: Spark data frame with a date column

df_date = spark.createDataFrame([('2020-07-15',)], ['date'])
df_date.show()
+----------+
|      date|
+----------+
|2020-07-15|
+----------+

Output : Spark data frame with a date column

from pyspark.sql.functions import date_sub
df_sub = df_date.select(date_sub(df_date.date, 1).alias('yesterday'))
df_sub.show()
+----------+
| yesterday|
+----------+
|2020-07-14|
+----------+

Summary:

print("Input                     ",            "Output")
display_side_by_side(df_date.toPandas(),df_sub.toPandas())
Input                      Output
date
2020-07-15
                              
yesterday
2020-07-14
                              

2e. How to add no of months to a date?

../_images/122.png

Lets first understand the syntax

Syntax

pyspark.sql.functions.add_months(start, months)[source]

Returns the date that is months months after start ‘’’

Input: Spark data frame consisting of a date column

df = spark.createDataFrame([('2020-07-15',)], ['date'])
df.show()
+----------+
|      date|
+----------+
|2020-07-15|
+----------+

Output : Spark data frame consisting of a date column

from pyspark.sql.functions import add_months
df1 = df.select(add_months(df.date, 1).alias('next_month'))
df1.show()
+----------+
|next_month|
+----------+
|2020-08-15|
+----------+

Summary:

print("Input                     ",            "Output")
display_side_by_side(df.toPandas(),df1.toPandas())
Input                      Output
date
2020-07-15
                              
next_month
2020-08-15
                              

How to extract year, month, date, hour, minute, second, day, weekday, week/quarter of year from a datetime column?

../_images/132.png

Input: Spark data frame consisting of a datetime column

df = spark.createDataFrame([(),],)
df = df.select(current_timestamp().alias("timestamp"))
df.show(truncate=False)
+-----------------------+
|timestamp              |
+-----------------------+
|2021-03-13 21:42:44.661|
+-----------------------+

Output : Spark data frame consisting of many columns

from pyspark.sql.functions import year, month, dayofmonth, dayofweek, dayofyear, hour, minute, second, weekofyear, quarter
df2 = df.select(df.timestamp,year(df.timestamp).alias("year"), 
month(df.timestamp).alias("month"), dayofmonth(df.timestamp).alias("dayofmonth"), 
dayofweek(df.timestamp).alias("dayofweek"), dayofyear(df.timestamp).alias("dayofyear"), 
hour(df.timestamp).alias("hour"), minute(df.timestamp).alias("minute"), 
second(df.timestamp).alias("second"), weekofyear(df.timestamp).alias("weekofyear"), 
quarter(df.timestamp).alias("quarter"))
df2.show(truncate=False)
+-----------------------+----+-----+----------+---------+---------+----+------+------+----------+-------+
|timestamp              |year|month|dayofmonth|dayofweek|dayofyear|hour|minute|second|weekofyear|quarter|
+-----------------------+----+-----+----------+---------+---------+----+------+------+----------+-------+
|2021-03-13 21:42:44.997|2021|3    |13        |7        |72       |21  |42    |44    |10        |1      |
+-----------------------+----+-----+----------+---------+---------+----+------+------+----------+-------+

Summary:

print("Input                            ",            "Output")
display_side_by_side(df.toPandas(),df2.toPandas())
Input                             Output
timestamp
2021-03-13 21:42:45.180
                              
timestamp year month dayofmonth dayofweek dayofyear hour minute second weekofyear quarter
2021-03-13 21:42:45.282 2021 3 13 7 72 21 42 45 10 1