PySpark | How to Split a Single Column into Multiple Columns?

PySpark | How to Split a Single Column into Multiple Columns?

When working with data, you often encounter scenarios where a single column contains values that need to be split into multiple columns for easier analysis or processing. PySpark provides flexible way to achieve this using the split() function.

In this article, we’ll cover how to split a single column into multiple columns in a PySpark DataFrame with practical examples.

Methods to Split a Column: PySpark’s split() function from the pyspark.sql.functions module is commonly used for this purpose. Below are detailed explanations and examples for splitting columns.

Example DataFrame

from pyspark.sql import SparkSession
from pyspark.sql.functions import split

# Create a Spark session
spark = SparkSession.builder.appName("Split Column").getOrCreate()

# Sample data
data = [("John Doe", "[email protected]"), 
        ("Jane Smith", "[email protected]"),
        ("Alice Brown", "[email protected]")]

# Define columns
columns = ["FullName", "Email"]

# Create a DataFrame
df = spark.createDataFrame(data, columns)

# Display the original DataFrame
print("Original DataFrame:")
df.show(truncate=False)

Output:

+------------+-----------------------+
|    FullName|                  Email|
+------------+-----------------------+
|    John Doe|   [email protected]|
|  Jane Smith| [email protected]|
| Alice Brown| [email protected]|
+------------+-----------------------+

1: Using the split() Function along with withColumn() method of dataframe: The split() function splits a string into an array based on a specified delimiter. You can then extract array elements as new columns.

Example 1: Splitting FullName into FirstName and LastName.

# Split FullName into FirstName and LastName
df_split = df.withColumn("FirstName", split(df["FullName"], " ")[0]) \
             .withColumn("LastName", split(df["FullName"], " ")[1])

# Display the result
print("After Splitting FullName:")
df_split.show(truncate=False)

Output:

+------------+-----------------------+----------+---------+
|    FullName|                  Email| FirstName| LastName|
+------------+-----------------------+----------+---------+
|    John Doe|   [email protected]|      John|      Doe|
|  Jane Smith| [email protected]|      Jane|    Smith|
| Alice Brown| [email protected]|     Alice|    Brown|
+------------+-----------------------+----------+---------+

Also, in place of indexing [0],[1] we can use the getItem(0),getItem(1) method of array.

# Split FullName into FirstName and LastName
df_split_1 = df.withColumn("FirstName", split(df["FullName"], " ").getItem(0)) \
             .withColumn("LastName", split(df["FullName"], " ").getItem(1))

# Display the result
print("After Splitting FullName:")
df_split_1.show(truncate=False)

Output:

+------------+-----------------------+----------+---------+
|    FullName|                  Email| FirstName| LastName|
+------------+-----------------------+----------+---------+
|    John Doe|   [email protected]|      John|      Doe|
|  Jane Smith| [email protected]|      Jane|    Smith|
| Alice Brown| [email protected]|     Alice|    Brown|
+------------+-----------------------+----------+---------+

Example 2: Extracting Domain from Email.

# Split Email into Username and Domain
df_email_split = df.withColumn("Username", split(df["Email"], "@")[0]) \
                   .withColumn("Domain", split(df["Email"], "@")[1])

# Display the result
print("After Splitting Email:")
df_email_split.show(truncate=False)

Output:

+------------+-----------------------+-----------+----------------+
|    FullName|                  Email|   Username|          Domain|
+------------+-----------------------+-----------+----------------+
|    John Doe|   [email protected]|   john.doe|    example.com |
|  Jane Smith| [email protected]| jane.smith|    example.com |
| Alice Brown| [email protected]|alice.brown|    example.com |
+------------+-----------------------+-----------+----------------+

2: Using Split function along with selectExpr() method of dataframe: if you are using this then you don’t need to import split function from the pyspark.sql.functions module.
Example: Splitting FullName into FirstName and LastName.

# Using selectExpr to split columns
df_expr_split = df.selectExpr(
    "FullName", 
    "Email", 
    "split(FullName, ' ')[0] as FirstName", 
    "split(FullName, ' ')[1] as LastName"
)

# Display the result
print("Using selectExpr to Split Columns:")
df_expr_split.show(truncate=False)

Output:

+------------+-----------------------+----------+---------+
|    FullName|                  Email| FirstName| LastName|
+------------+-----------------------+----------+---------+
|    John Doe|   [email protected]|      John|      Doe|
|  Jane Smith| [email protected]|      Jane|    Smith|
| Alice Brown| [email protected]|     Alice|    Brown|
+------------+-----------------------+----------+---------+

Conclusion: Splitting a column into multiple columns in PySpark is a common operation, and PySpark’s split() function makes this easy. Whether you’re splitting names, email addresses, or any other composite column.

Leave a Reply

Your email address will not be published. Required fields are marked *

📢 Need further clarification or have any questions? Let's connect!

Connect 1:1 With Me: Schedule Call


If you have any doubts or would like to discuss anything related to this blog, feel free to reach out to me. I'm here to help! You can schedule a call by clicking on the above given link.
I'm looking forward to hearing from you and assisting you with any inquiries you may have. Your understanding and engagement are important to me!

This will close in 20 seconds