When working with data, you often encounter scenarios where a single column contains values that need to be split into multiple columns for easier analysis or processing. PySpark provides flexible way to achieve this using the split() function.
In this article, we’ll cover how to split a single column into multiple columns in a PySpark DataFrame with practical examples.
Methods to Split a Column: PySpark’s split() function from the pyspark.sql.functions module is commonly used for this purpose. Below are detailed explanations and examples for splitting columns.
Example DataFrame
from pyspark.sql import SparkSession from pyspark.sql.functions import split # Create a Spark session spark = SparkSession.builder.appName("Split Column").getOrCreate() # Sample data data = [("John Doe", "[email protected]"), ("Jane Smith", "[email protected]"), ("Alice Brown", "[email protected]")] # Define columns columns = ["FullName", "Email"] # Create a DataFrame df = spark.createDataFrame(data, columns) # Display the original DataFrame print("Original DataFrame:") df.show(truncate=False)
Output:
+------------+-----------------------+ | FullName| Email| +------------+-----------------------+ | John Doe| [email protected]| | Jane Smith| [email protected]| | Alice Brown| [email protected]| +------------+-----------------------+
1: Using the split() Function along with withColumn() method of dataframe: The split() function splits a string into an array based on a specified delimiter. You can then extract array elements as new columns.
Example 1: Splitting FullName into FirstName and LastName.
# Split FullName into FirstName and LastName df_split = df.withColumn("FirstName", split(df["FullName"], " ")[0]) \ .withColumn("LastName", split(df["FullName"], " ")[1]) # Display the result print("After Splitting FullName:") df_split.show(truncate=False)
Output:
+------------+-----------------------+----------+---------+ | FullName| Email| FirstName| LastName| +------------+-----------------------+----------+---------+ | John Doe| [email protected]| John| Doe| | Jane Smith| [email protected]| Jane| Smith| | Alice Brown| [email protected]| Alice| Brown| +------------+-----------------------+----------+---------+
Also, in place of indexing [0],[1] we can use the getItem(0),getItem(1) method of array.
# Split FullName into FirstName and LastName df_split_1 = df.withColumn("FirstName", split(df["FullName"], " ").getItem(0)) \ .withColumn("LastName", split(df["FullName"], " ").getItem(1)) # Display the result print("After Splitting FullName:") df_split_1.show(truncate=False)
Output:
+------------+-----------------------+----------+---------+ | FullName| Email| FirstName| LastName| +------------+-----------------------+----------+---------+ | John Doe| [email protected]| John| Doe| | Jane Smith| [email protected]| Jane| Smith| | Alice Brown| [email protected]| Alice| Brown| +------------+-----------------------+----------+---------+
Example 2: Extracting Domain from Email.
# Split Email into Username and Domain df_email_split = df.withColumn("Username", split(df["Email"], "@")[0]) \ .withColumn("Domain", split(df["Email"], "@")[1]) # Display the result print("After Splitting Email:") df_email_split.show(truncate=False)
Output:
+------------+-----------------------+-----------+----------------+ | FullName| Email| Username| Domain| +------------+-----------------------+-----------+----------------+ | John Doe| [email protected]| john.doe| example.com | | Jane Smith| [email protected]| jane.smith| example.com | | Alice Brown| [email protected]|alice.brown| example.com | +------------+-----------------------+-----------+----------------+
2: Using Split function along with selectExpr() method of dataframe: if you are using this then you don’t need to import split function from the pyspark.sql.functions module.
Example: Splitting FullName into FirstName and LastName.
# Using selectExpr to split columns df_expr_split = df.selectExpr( "FullName", "Email", "split(FullName, ' ')[0] as FirstName", "split(FullName, ' ')[1] as LastName" ) # Display the result print("Using selectExpr to Split Columns:") df_expr_split.show(truncate=False)
Output:
+------------+-----------------------+----------+---------+ | FullName| Email| FirstName| LastName| +------------+-----------------------+----------+---------+ | John Doe| [email protected]| John| Doe| | Jane Smith| [email protected]| Jane| Smith| | Alice Brown| [email protected]| Alice| Brown| +------------+-----------------------+----------+---------+
Conclusion: Splitting a column into multiple columns in PySpark is a common operation, and PySpark’s split() function makes this easy. Whether you’re splitting names, email addresses, or any other composite column.