PySpark | How to Perform Data Type Casting on Columns in a DataFrame?

PySpark | How to Perform Data Type Casting on Columns in a DataFrame?

When working with data in PySpark, ensuring the correct data type for each column is essential for accurate analysis and processing. Sometimes, the data types of columns may not match your requirements. For example, a column containing numeric data might be stored as a string (string), or dates may be stored in an incorrect format.

To handle such situations, PySpark provides a method to cast (or convert) columns to the desired data type. In this article, we will explore how to perform data type casting on PySpark DataFrame columns.

PySpark Data Types:

PySpark supports a variety of data types, including:

  • Primitive Types: IntegerType, StringType, FloatType, DoubleType, BooleanType, DateType, TimestampType.
  • Complex Types: ArrayType, MapType, StructType.

Methods for Data Type Casting:

In PySpark, you can cast columns to a different type using:

  • withColumn() and cast()
  • SQL Expressions

Example: Casting Data Types

Step 1: Create a Sample DataFrame: Here’s a simple DataFrame with mixed data types:

from pyspark.sql import SparkSession
from pyspark.sql.types import StringType, IntegerType, FloatType

# Initialize Spark session
spark = SparkSession.builder.appName("Data Type Casting").getOrCreate()

# Sample data
data = [("1", "25.5", "2023-10-10"), 
("2", "30.0", "2024-01-01"), 
("3", "35.7", "2024-05-05")]
columns = ["ID", "Score", "Date"]

# Create a DataFrame
df = spark.createDataFrame(data, columns)

# Show the original DataFrame
df.show()
df.printSchema()

Output:

+---+-----+----------+
| ID|Score|      Date|
+---+-----+----------+
|  1| 25.5|2023-10-10|
|  2| 30.0|2024-01-01|
|  3| 35.7|2024-05-05|
+---+-----+----------+

root
 |-- ID: string (nullable = true)
 |-- Score: string (nullable = true)
 |-- Date: string (nullable = true)

Here, all columns are of type string. Let’s cast them to the appropriate types.
Step 2: Cast Columns Using withColumn and cast: You can use the withColumn() method along with the cast() function to convert data types.

from pyspark.sql.functions import col

# Cast the columns to appropriate data types
casted_df = df.withColumn("ID", col("ID").cast(IntegerType())) \
              .withColumn("Score", col("Score").cast(FloatType())) \
              .withColumn("Date", col("Date").cast("date"))

# Show the updated DataFrame
casted_df.show()
casted_df.printSchema()

Output:

+---+-----+----------+
| ID|Score|      Date|
+---+-----+----------+
|  1| 25.5|2023-10-10|
|  2| 30.0|2024-01-01|
|  3| 35.7|2024-05-05|
+---+-----+----------+

root
 |-- ID: integer (nullable = true)
 |-- Score: float (nullable = true)
 |-- Date: date (nullable = true)

Explanation:

  • The ID column is cast to IntegerType.
  • The Score column is cast to FloatType.
  • The Date column is cast to DateType.

Step 3: Cast Columns Using SQL Expressions: You can also use SQL-style expressions for type casting.

# Register the DataFrame as a temporary table
df.createOrReplaceTempView("table")

# Use SQL to cast the columns
sql_casted_df = spark.sql("""
    SELECT 
        CAST(ID AS INT) AS ID, 
        CAST(Score AS FLOAT) AS Score, 
        CAST(Date AS DATE) AS Date
    FROM table
""")

# Show the updated DataFrame
sql_casted_df.show()
sql_casted_df.printSchema()

Output:

+---+-----+----------+
| ID|Score|      Date|
+---+-----+----------+
|  1| 25.5|2023-10-10|
|  2| 30.0|2024-01-01|
|  3| 35.7|2024-05-05|
+---+-----+----------+

root
 |-- ID: integer (nullable = true)
 |-- Score: float (nullable = true)
 |-- Date: date (nullable = true)

Conclusion: Data type casting is a critical step in cleaning and preparing your data in PySpark. With methods like withColumn and SQL expressions, you can easily convert columns to the desired type for accurate processing and analysis.

Leave a Reply

Your email address will not be published. Required fields are marked *

📢 Need further clarification or have any questions? Let's connect!

Connect 1:1 With Me: Schedule Call


If you have any doubts or would like to discuss anything related to this blog, feel free to reach out to me. I'm here to help! You can schedule a call by clicking on the above given link.
I'm looking forward to hearing from you and assisting you with any inquiries you may have. Your understanding and engagement are important to me!

This will close in 20 seconds