Understanding Deployment Modes in Spark

Gaurav
2 min readMar 30, 2024

--

PySpark, the Python API for Apache Spark, offers two primary deployment modes: client mode and cluster mode. These modes determine how Spark applications are executed and managed across a distributed computing environment. In this article, we’ll delve into the differences between client and cluster modes in PySpark, providing beginner-friendly explanations and code examples.

1. Client Mode

In client mode, the Spark driver runs within the client process, which is typically the machine from which you submit your Spark application. The client communicates directly with the cluster manager (e.g., YARN, Mesos, or Spark Standalone) to request resources and execute tasks.

Key Characteristics:

  • Driver runs on the client machine.
  • Client machine requires access to the Spark cluster’s configuration.
  • Well-suited for interactive or development environments.
from pyspark.sql import SparkSession

# Create SparkSession in client mode
spark = SparkSession.builder \
.appName("ClientModeExample") \
.master("spark://<master-node-ip>:<port>") \
.config("spark.submit.deployMode", "client") \
.getOrCreate()

# Your PySpark code here

# Stop the SparkSession
spark.stop()

In this example, config("spark.submit.deployMode", "client") explicitly sets the deployment mode to client.

2. Cluster Mode

In cluster mode, the Spark driver runs on one of the cluster nodes rather than on the client machine. The client submits the Spark application to the cluster manager, which launches the driver within one of its worker nodes. This mode is suitable for production environments and large-scale data processing tasks.

Key Characteristics:

  • Driver runs on one of the cluster’s worker nodes.
  • Client machine does not need to maintain an active connection to the cluster during application execution.
  • Recommended for production deployments.
from pyspark.sql import SparkSession

# Create SparkSession in cluster mode
spark = SparkSession.builder \
.appName("ClusterModeExample") \
.master("spark://<master-node-ip>:<port>") \
.config("spark.submit.deployMode", "cluster") \
.getOrCreate()

# Your PySpark code here

# Stop the SparkSession
spark.stop()

Key Differences

Conclusion

Understanding the differences between client and cluster deployment modes in PySpark is crucial for deploying applications efficiently across distributed computing environments. While client mode is suitable for interactive and development use cases, cluster mode is recommended for production deployments due to its scalability and fault tolerance. By following the explanations and examples provided in this article, beginners can easily grasp the concepts and choose the appropriate deployment mode for their PySpark applications.

--

--