Interview Questions on Spark Architecture

5 min readApr 6, 2024

What is PySpark and how does it relate to Apache Spark?

PySpark is the Python API for Apache Spark, a powerful open-source distributed computing system. It enables Python developers to leverage Spark’s capabilities using Python syntax, making it accessible to Python-centric workflows. PySpark facilitates data processing, machine learning, graph processing, and more, on large datasets distributed across a cluster of machines.

Explain the architecture of Apache Spark.

Apache Spark architecture comprises several components:

Driver Program: Responsible for coordinating the execution of Spark jobs. It interacts with the cluster manager to allocate resources and schedules tasks.
Cluster Manager: Manages resources across the cluster, such as memory and CPU, and allocates them to Spark applications.
Worker Nodes: Execute tasks assigned by the driver program and store data in memory or disk.
Executors: Worker node processes responsible for executing tasks. They communicate with the driver program and manage data in memory or disk.

What is a SparkContext?

SparkContext is the entry point for any Spark functionality in a PySpark application. It represents the connection to the Spark cluster and coordinates the execution of operations. SparkContext is responsible for distributing the code and data across the cluster and managing resources during execution.

What is a DataFrame in PySpark?

DataFrame in PySpark is a distributed collection of data organized into named columns, similar to a table in a relational database or a spreadsheet. It provides a high-level abstraction for data manipulation operations and integrates seamlessly with Python libraries like Pandas and NumPy.

Explain the difference between DataFrame and RDD in PySpark.

RDD (Resilient Distributed Dataset) is a fundamental data structure in Spark that represents a distributed collection of elements. It is low-level and offers more flexibility but requires manual memory management and lacks optimization opportunities. DataFrames, on the other hand, provide a higher-level abstraction with optimizations for efficient data processing, especially for structured data, making them more convenient for most tasks.

What are transformations and actions in PySpark?

Transformations in PySpark are operations that produce new DataFrames or RDDs from existing ones. They are lazily evaluated, meaning Spark delays execution until an action is invoked. Actions, on the other hand, trigger the actual computation and return results to the driver program.

What is lazy evaluation in PySpark?

Lazy evaluation means that transformations in PySpark are not executed immediately upon invocation. Instead, Spark builds up a directed acyclic graph (DAG) representing the computation and delays execution until an action is called. This allows Spark to optimize the execution plan and improve performance by chaining together transformations.

Explain the concept of lineage in PySpark.

Lineage in PySpark refers to the logical execution plan of transformations. It represents the sequence of operations applied to an RDD or DataFrame to compute a desired output. Lineage information is crucial for fault tolerance, as it enables Spark to recompute lost partitions in case of failure.

What is a SparkSession?

SparkSession is the entry point for DataFrame and SQL functionality in PySpark. It provides a unified interface for working with structured data and integrates with Spark’s underlying execution engine. SparkSession encapsulates SparkContext, SQLContext, and HiveContext, simplifying the interaction with Spark APIs.

How does PySpark handle fault tolerance?

PySpark achieves fault tolerance through the lineage information stored with each RDD or DataFrame. By tracking the transformations applied to the data, Spark can reconstruct lost partitions by recomputing them from the original source data. Additionally, Spark replicates data across multiple nodes to ensure data availability in case of node failures.

Explain the role of the Executor in Spark architecture.

Executors are worker node processes responsible for executing tasks in Spark applications. They run on each worker node and are managed by the SparkContext. Executors receive tasks from the driver program, execute them within their JVMs, and store intermediate data in memory or disk. They also communicate with other executors for shuffle operations and data exchange.

What is shuffle in PySpark?

Shuffle is the process of redistributing data across partitions during certain operations, such as groupByKey or join, which require data to be reorganized across the cluster. It involves moving data between executors and may incur network and disk I/O overhead. Efficient shuffle operations are crucial for optimizing performance in Spark applications.

What is the significance of the DAG (Directed Acyclic Graph) in Spark execution?

The DAG represents the logical execution plan of a Spark job. It consists of stages and tasks, where each stage corresponds to a set of transformations that can be executed in parallel. Spark uses the DAG to optimize the execution plan by scheduling tasks efficiently, minimizing data movement, and maximizing parallelism.

Explain the role of the driver program in PySpark.

The driver program is the main control process in a Spark application. It runs the user’s main function, coordinates the execution of Spark jobs, and interacts with the cluster manager to allocate resources. The driver program also monitors job progress, aggregates results, and handles communication with the SparkContext and executors.

What is the purpose of a broadcast variable in PySpark?

Broadcast variables allow efficient distribution of read-only data to all tasks in a Spark job. They are useful for reducing data transfer overhead in operations like joins, where one dataset is significantly smaller than others and can be efficiently replicated to all worker nodes. Broadcast variables are cached in memory across the cluster for reuse in multiple tasks.

How does PySpark handle data skewness during joins?

PySpark employs various techniques to handle data skewness during joins, such as:

Dynamic Partitioning: Dynamically partitioning data based on specific keys to balance the workload across executors.
Salting: Adding a random prefix to keys to distribute skewed data evenly across partitions.
Broadcasting: Broadcasting smaller tables to all executors to ensure that each executor has a copy of the data, reducing the impact of skewed partitions.

What is checkpointing in PySpark?

Checkpointing is the process of persisting RDD or DataFrame to disk to save the intermediate state of computations. It helps in optimizing the performance of iterative algorithms or long lineage chains by cutting off unnecessary dependencies and reducing memory usage. Checkpointing also improves fault tolerance by reducing the recomputation needed in case of failures.

Explain the difference between local and cluster modes in PySpark.

In local mode, Spark runs on a single machine with one or more CPU cores, typically for development and testing purposes. All Spark components, including the driver program and executors, run within the same JVM. In cluster mode, Spark utilizes a distributed cluster of machines, distributing computation and storage across multiple nodes for scalability and performance.

What is the purpose of the Spark UI?

The Spark UI is a web-based interface that provides real-time monitoring and visualization of Spark applications. It displays information about job progress, resource utilization, task execution, DAG visualization, and executor metrics. The Spark UI helps developers and administrators analyze application performance, identify bottlenecks, and optimize resource usage.

How does PySpark handle data partitioning?

PySpark allows users to control data partitioning through operations like repartition and partitionBy. Data partitioning affects parallelism and resource utilization during computations by determining how data is distributed across the cluster. Users can specify custom partitioning strategies based on data characteristics, workload requirements, and performance considerations.