Resilient Distributed Datasets (RDDs) are the core abstraction in PySpark, offering fault-tolerant, distributed data structures that can be operated on in parallel. Although the DataFrame API is more popular due to its higher-level abstractions, RDDs are still fundamental for certain low-level operations and are the building blocks of PySpark.
In this article, you’ll learn how to create RDDs in PySpark, the different ways to create them, and when you should use RDDs over DataFrames.