Setting up Apache Spark on a Windows machine can be a straightforward process if you follow the right steps. This guide will walk you through installing Java, configuring environment variables, downloading and setting up Spark, and finally running Spark on your Windows system. Let’s get started!
Step 1: Install Java -> Before installing Spark, you need to have Java installed on your system since Spark runs on Java Virtual Machine (JVM). Follow these steps to set up Java:
a) Download Java: Go to the Java download page (https://www.oracle.com/java/technologies/downloads/) and download the latest version of Java Development Kit (JDK) suitable for your Windows machine.
b) Install Java: Run the downloaded installer and follow the installation instructions. By default, Java will be installed at C:\Program Files\Java\jdk-<<'version'>>.
c) Set Up Environment Variables for Java:
- i) JAVA_HOME: Set this to the installation directory, i.e., C:\Program Files\Java\jdk-<<'version'>>
- ii) Path: Add C:\Program Files\Java\jdk-<<'version'>>\bin to your system’s PATH environment variable.
- iii) Verify Java Installation: Open the Command Prompt and type java –version. You should see the Java version information if the installation is correct.
.
Step 2: Download and Set Up Apache Spark -> Now that Java is installed, let’s proceed with Spark installation:
a) Download Apache Spark: Visit the Apache Spark downloads page (https://spark.apache.org/downloads.html) and download the latest version of Spark (Spark 3.5.2 at the time of writing). Extract the downloaded file to a directory of your choice, for example, D:\spark_setup\spark-3.5.2.
b) Set Up Environment Variables for Spark:
- i) SPARK_HOME: Set this to your Spark directory, i.e., D:\spark_setup\spark-3.5.2.
- ii) Path: Add %SPARK_HOME%\bin to your system’s PATH environment variable.
Step 3: Set Up WinUtils -> Spark requires some Hadoop binaries to run on Windows, even though we won’t be using Hadoop itself. Winutils.exe is one such binary that allows Spark to interact with the Windows file system:
a) Download WinUtils: Download winutils.exe from this GitHub repository (https://github.com/steveloughran/winutils/tree/master/hadoop-3.0.0/bin/winutils.exe). Place winutils.exe in a directory, for example, D:\spark_setup\hadoop\bin.
b) Set Up Environment Variables for Hadoop:
- i) HADOOP_HOME: Set this to the path where winutils.exe is placed, i.e., D:\spark_setup\hadoop.
- ii) Path: Add %HADOOP_HOME%\bin to your system’s PATH environment variable.
Step 4: Verify the Spark Installation:To verify if Spark has been set up correctly:
a) Open Command Prompt:Type spark-shell and press Enter. If everything is set up correctly, you should see the Spark shell starting up, with a welcome message indicating the Spark version.
Additional Resources:
For a detailed visual guide on setting up Spark, you can refer to this YouTube video (https://www.youtube.com/watch?v=FIXanNPvBXM&t=312s).
For a detailed visual guide on setting up Java, you can refer to this YouTube video (https://www.youtube.com/watch?v=SQykK40fFds).
By following these steps, you should have a fully functional Spark setup on your Windows machine. Now, you can start working on your data processing tasks using Spark!