Apache Spark has come a long way since the 2.x era. What started as a fast, general-purpose cluster computing system has matured into a robust, deeply optimized unified analytics engine. This post walks through the major milestones — from Spark 2.0's introduction of the Dataset API and Structured Streaming, all the way to the adaptive query execution and GPU acceleration of Spark 3.x.
Spark 2.x — The Foundation
Unified DataFrame / Dataset API
Spark 2.0 (released in 2016) was a landmark release. The biggest architectural shift was the unification of the DataFrame and Dataset APIs. Previously, RDDs (Resilient Distributed Datasets) were the primary abstraction, but they lacked the optimizer benefits that came with a schema-aware model.
With Spark 2.0:
DataFramebecame an alias forDataset[Row]
- Strongly typed operations (via
Dataset[T]) provided compile-time type safety in Scala/Java
- The Catalyst optimizer and Tungsten execution engine could now apply aggressive optimizations across both APIs
This unification was critical: it meant that the optimizer could inspect query plans regardless of whether you used the typed or untyped API.
Catalyst Optimizer — Phase Recap
Catalyst operates in four phases:
- Analysis — resolves attribute references and relations against the catalog
- Logical Optimization — applies rule-based rewrites (predicate pushdown, constant folding, etc.)
- Physical Planning — selects execution strategies (e.g., BroadcastHashJoin vs SortMergeJoin)
- Code Generation — uses Janino to compile query fragments into JVM bytecode
Spark 2.x dramatically improved the code generation step via Whole-Stage Code Generation (WSCG), collapsing multiple operators into a single tight loop — eliminating virtual function calls and improving CPU cache utilization.
Structured Streaming
Perhaps the most impactful addition of Spark 2.x was Structured Streaming — a stream processing model built on top of Spark SQL.
The key insight: treat a stream as an unbounded table. New data arriving is like new rows being appended. You write the same SQL/DataFrame transformations as in batch, and Spark handles incrementalization internally.
Three output modes were introduced:
Append— only new rows emitted
Complete— entire result table re-emitted (useful for aggregations)
Update— only changed rows emitted
Under the hood, Structured Streaming maintains state stores (backed by RocksDB or in-memory HDFSBackedStateStore) for stateful operations like windowed aggregations and stream-stream joins (added in Spark 2.3).
Critical limitations in 2.x that were later addressed:
- No support for arbitrary stateful operations with full user control → addressed by
mapGroupsWithStateand laterflatMapGroupsWithState
- No native support for exactly-once end-to-end without careful sink design
Spark 3.0 — The Performance Revolution
Released in 2020, Spark 3.0 was the largest release in the project's history, with over 3,400 patches.
Adaptive Query Execution (AQE)
This is the headline feature of Spark 3.0. Prior to AQE, Spark's physical plan was fixed at compile time, based on statistics that were often stale or unavailable.
AQE introduces runtime re-planning: Spark now re-optimizes the query plan at shuffle boundaries using real statistics gathered during execution.
Three key AQE optimizations:
- Dynamically coalescing shuffle partitions — instead of setting
spark.sql.shuffle.partitionsstatically (default 200, often wrong), Spark merges small partitions post-shuffle based on actual data sizes.
- Dynamically switching join strategies — if runtime stats show that a table is small enough, Spark can switch from a planned
SortMergeJointo aBroadcastHashJoinon the fly.
- Dynamically optimizing skew joins — detects skewed partitions and splits them, processing each sub-partition independently to avoid stragglers.
AQE is controlled by
spark.sql.adaptive.enabled (default true since Spark 3.2).Dynamic Partition Pruning (DPP)
Another major 3.0 optimization. In star-schema queries (common in data warehouses), a filter on a dimension table can be used to prune partitions of the fact table — even though the optimizer can't know which partitions to prune until the dimension scan runs.
DPP solves this by injecting a subquery as a dynamic filter:
sqlSELECT * FROM orders o JOIN dates d ON o.date_id = d.id WHERE d.year = 2024
Spark will broadcast the filtered
dates result and use it to prune orders partitions at runtime — drastically reducing I/O for partitioned tables.Accelerator-Aware Scheduling & GPU Support
Spark 3.0 introduced the RAPIDS Accelerator for Apache Spark, enabling GPU-accelerated execution of Spark SQL operations. This is relevant for ML-heavy pipelines where data preparation (filtering, joining, feature engineering) is done in Spark before training.
The scheduler became resource-aware, allowing tasks to request specific hardware resources (GPUs, custom accelerators) via the
spark.task.resource.* configuration namespace.Python Type Hints & Pandas UDFs v2
Spark 3.0 overhauled Python UDFs significantly:
- Pandas UDFs (formerly called Vectorized UDFs) were re-categorized into clearer types:
SCALAR,SCALAR_ITER,GROUPED_MAP,GROUPED_AGG
- Python type hints became the primary way to declare UDF semantics, replacing the explicit
PandasUDFTypeenum
- Iterator-based Pandas UDFs allowed amortizing expensive setup costs (e.g., loading an ML model once per partition)
pythonfrom pyspark.sql.functions import pandas_udf import pandas as pd @pandas_udf("double") def predict(features: pd.Series) -> pd.Series: model = load_model() # called once per partition in SCALAR_ITER return model.predict(features)
Spark 3.1, 3.2, 3.3 — Steady Maturation
Spark 3.1 — Stability & SQL Completeness
mapInPandasandmapInArrow— new Pandas-on-Spark functions for efficient, arbitrary transformations on groups of rows using Arrow-backed buffers
- Node Decommissioning — graceful removal of executor nodes, enabling elastic scaling without data loss
- SQL standard compliance improvements:
EXCEPT ALL,INTERSECT ALL, lateral subqueries
Spark 3.2 — Pandas API on Spark (koalas mainlined)
The Pandas API on Spark (formerly the Koalas project) was officially merged into Spark core as
pyspark.pandas. This was a significant usability milestone: data scientists could now use familiar pandas syntax on distributed data without rewriting code.Important nuances:
- Not all pandas operations map 1:1 (operations requiring global ordering are expensive)
pyspark.pandasDataFrames are lazily evaluated and backed by Spark's optimizer
spark.sql.execution.arrow.pyspark.enabled=trueis recommended for performance
Also in 3.2: AQE enabled by default, and ANSI SQL mode (strict type coercion) became more complete.
Spark 3.3 — Improved Observability
- Spark Connect (preview) — a new client-server architecture decoupling the driver from the cluster, enabling thin clients, better language support, and IDE integration
- CHAR/VARCHAR type support — long-awaited for compatibility with Hive and other SQL engines
- Error message improvements — structured error classes with error codes, much friendlier for debugging
- Histograms in cost-based optimizer — richer statistics for better join strategy decisions
Spark 3.4 & 3.5 — Spark Connect GA & Arrow-Native Execution
Spark Connect Goes GA (3.4)
Spark Connect introduced a gRPC-based protocol allowing clients to connect to Spark remotely without embedding the SparkSession logic in the driver. This enables:
- Lightweight Python/Scala/Rust clients
- Stable API versioning between client and server
- Running notebooks or IDEs against a remote Spark cluster without a fat driver dependency
The plan representation is based on Protocol Buffers, making it language-agnostic by design.
Arrow-Based Execution
Apache Arrow has become increasingly central to Spark's execution model:
- ArrowEvalPython operator replaces the old
BatchEvalPythonfor Pandas UDFs
spark.sql.execution.arrow.maxRecordsPerBatchcontrols Arrow batch sizes
- Zero-copy data sharing between JVM and Python processes via Arrow IPC format
This dramatically reduces serialization overhead, which historically was one of the biggest bottlenecks in PySpark workloads.
What Changed Under the Hood: A Summary
Area | Spark 2.x | Spark 3.x |
Query Planning | Static, compile-time | Adaptive (AQE), runtime re-planning |
Partition Tuning | Manual ( shuffle.partitions) | Dynamic coalescing via AQE |
Join Strategy | Fixed at plan time | Dynamically switched at runtime |
Python Integration | Basic Pandas UDFs | Arrow-native, iterator UDFs, Pandas API |
Streaming State | HDFSBackedStateStore | RocksDB state store (3.2+), async checkpointing |
Client Model | Embedded driver | Spark Connect (gRPC, thin client) |
Hardware | CPU only | GPU-aware scheduling (RAPIDS) |
SQL Compliance | Partial ANSI | Near-full ANSI SQL mode |
Key Takeaways for Practitioners
Enable AQE and DPP — if you're on Spark 3.x and haven't explicitly enabled these, you're leaving performance on the table. On 3.2+ they're on by default, but verify your configs:
spark.sql.adaptive.enabled=true spark.sql.adaptive.coalescePartitions.enabled=true spark.sql.optimizer.dynamicPartitionPruning.enabled=true
Migrate from RDDs to DataFrames/Datasets — if you still have RDD-heavy code, the optimizer can't help you. Catalyst and Tungsten only apply to the structured APIs.
Use RocksDB state store for large stateful streaming jobs — the default in-memory state store doesn't scale to high-cardinality keys. Switch with:
spark.sql.streaming.stateStore.providerClass=org.apache.spark.sql.execution.streaming.state.RocksDBStateStoreProvider
Adopt Spark Connect for new tooling — if you're building internal data platforms or developer tools on top of Spark, design against the Connect API rather than embedding SparkSession.
Conclusion
The journey from Spark 2.0 to the current 3.x series reflects a maturing ecosystem: early versions established the programming model, while later releases focused on making that model fast by default, easier to use from Python, and operable at enterprise scale. The introduction of AQE alone represents a fundamental shift in how Spark thinks about query execution — from a static compiler to a dynamic, feedback-driven optimizer.
For teams running production Spark workloads, upgrading from 2.x to 3.x is not just about new features — it's about getting correctness improvements, significant performance gains, and a much more honest relationship between your query and the cluster resources it consumes.