The Evolution of Apache Spark Since Version 2: A Technical Deep Dive

Apache Spark has come a long way since the 2.x era. What started as a fast, general-purpose cluster computing system has matured into a robust, deeply optimized unified analytics engine. This post walks through the major milestones — from Spark 2.0's introduction of the Dataset API and Structured Streaming, all the way to the adaptive query execution and GPU acceleration of Spark 3.x.

Spark 2.x — The Foundation

Unified DataFrame / Dataset API

Spark 2.0 (released in 2016) was a landmark release. The biggest architectural shift was the unification of the DataFrame and Dataset APIs. Previously, RDDs (Resilient Distributed Datasets) were the primary abstraction, but they lacked the optimizer benefits that came with a schema-aware model.

With Spark 2.0:

DataFrame became an alias for Dataset[Row]

Strongly typed operations (via Dataset[T]) provided compile-time type safety in Scala/Java

The Catalyst optimizer and Tungsten execution engine could now apply aggressive optimizations across both APIs

This unification was critical: it meant that the optimizer could inspect query plans regardless of whether you used the typed or untyped API.

Catalyst Optimizer — Phase Recap

Catalyst operates in four phases:

Analysis — resolves attribute references and relations against the catalog

Logical Optimization — applies rule-based rewrites (predicate pushdown, constant folding, etc.)

Physical Planning — selects execution strategies (e.g., BroadcastHashJoin vs SortMergeJoin)

Code Generation — uses Janino to compile query fragments into JVM bytecode

Spark 2.x dramatically improved the code generation step via Whole-Stage Code Generation (WSCG), collapsing multiple operators into a single tight loop — eliminating virtual function calls and improving CPU cache utilization.

Structured Streaming

Perhaps the most impactful addition of Spark 2.x was Structured Streaming — a stream processing model built on top of Spark SQL.

The key insight: treat a stream as an unbounded table. New data arriving is like new rows being appended. You write the same SQL/DataFrame transformations as in batch, and Spark handles incrementalization internally.

Three output modes were introduced:

Append — only new rows emitted

Complete — entire result table re-emitted (useful for aggregations)

Update — only changed rows emitted

Under the hood, Structured Streaming maintains state stores (backed by RocksDB or in-memory HDFSBackedStateStore) for stateful operations like windowed aggregations and stream-stream joins (added in Spark 2.3).

Critical limitations in 2.x that were later addressed:

No support for arbitrary stateful operations with full user control → addressed by mapGroupsWithState and later flatMapGroupsWithState

No native support for exactly-once end-to-end without careful sink design

Spark 3.0 — The Performance Revolution

Released in 2020, Spark 3.0 was the largest release in the project's history, with over 3,400 patches.

Adaptive Query Execution (AQE)

This is the headline feature of Spark 3.0. Prior to AQE, Spark's physical plan was fixed at compile time, based on statistics that were often stale or unavailable.

AQE introduces runtime re-planning: Spark now re-optimizes the query plan at shuffle boundaries using real statistics gathered during execution.

Three key AQE optimizations:

Dynamically coalescing shuffle partitions — instead of setting spark.sql.shuffle.partitions statically (default 200, often wrong), Spark merges small partitions post-shuffle based on actual data sizes.

Dynamically switching join strategies — if runtime stats show that a table is small enough, Spark can switch from a planned SortMergeJoin to a BroadcastHashJoin on the fly.

Dynamically optimizing skew joins — detects skewed partitions and splits them, processing each sub-partition independently to avoid stragglers.

AQE is controlled by spark.sql.adaptive.enabled (default true since Spark 3.2).

Dynamic Partition Pruning (DPP)

Another major 3.0 optimization. In star-schema queries (common in data warehouses), a filter on a dimension table can be used to prune partitions of the fact table — even though the optimizer can't know which partitions to prune until the dimension scan runs.

DPP solves this by injecting a subquery as a dynamic filter:

sql
SELECT * FROM orders o
JOIN dates d ON o.date_id = d.id
WHERE d.year = 2024

Spark will broadcast the filtered dates result and use it to prune orders partitions at runtime — drastically reducing I/O for partitioned tables.

Accelerator-Aware Scheduling & GPU Support

Spark 3.0 introduced the RAPIDS Accelerator for Apache Spark, enabling GPU-accelerated execution of Spark SQL operations. This is relevant for ML-heavy pipelines where data preparation (filtering, joining, feature engineering) is done in Spark before training.

The scheduler became resource-aware, allowing tasks to request specific hardware resources (GPUs, custom accelerators) via the spark.task.resource.* configuration namespace.

Python Type Hints & Pandas UDFs v2

Spark 3.0 overhauled Python UDFs significantly:

Pandas UDFs (formerly called Vectorized UDFs) were re-categorized into clearer types: SCALAR, SCALAR_ITER, GROUPED_MAP, GROUPED_AGG

Python type hints became the primary way to declare UDF semantics, replacing the explicit PandasUDFType enum

Iterator-based Pandas UDFs allowed amortizing expensive setup costs (e.g., loading an ML model once per partition)

python
from pyspark.sql.functions import pandas_udf
import pandas as pd

@pandas_udf("double")
def predict(features: pd.Series) -> pd.Series:
    model = load_model()  # called once per partition in SCALAR_ITER
    return model.predict(features)

Spark 3.1, 3.2, 3.3 — Steady Maturation

Spark 3.1 — Stability & SQL Completeness

mapInPandas and mapInArrow — new Pandas-on-Spark functions for efficient, arbitrary transformations on groups of rows using Arrow-backed buffers

Node Decommissioning — graceful removal of executor nodes, enabling elastic scaling without data loss

SQL standard compliance improvements: EXCEPT ALL, INTERSECT ALL, lateral subqueries

Spark 3.2 — Pandas API on Spark (koalas mainlined)

The Pandas API on Spark (formerly the Koalas project) was officially merged into Spark core as pyspark.pandas. This was a significant usability milestone: data scientists could now use familiar pandas syntax on distributed data without rewriting code.

Important nuances:

Not all pandas operations map 1:1 (operations requiring global ordering are expensive)

pyspark.pandas DataFrames are lazily evaluated and backed by Spark's optimizer

spark.sql.execution.arrow.pyspark.enabled=true is recommended for performance

Also in 3.2: AQE enabled by default, and ANSI SQL mode (strict type coercion) became more complete.

Spark 3.3 — Improved Observability

Spark Connect (preview) — a new client-server architecture decoupling the driver from the cluster, enabling thin clients, better language support, and IDE integration

CHAR/VARCHAR type support — long-awaited for compatibility with Hive and other SQL engines

Error message improvements — structured error classes with error codes, much friendlier for debugging

Histograms in cost-based optimizer — richer statistics for better join strategy decisions

Spark 3.4 & 3.5 — Spark Connect GA & Arrow-Native Execution

Spark Connect Goes GA (3.4)

Spark Connect introduced a gRPC-based protocol allowing clients to connect to Spark remotely without embedding the SparkSession logic in the driver. This enables:

Lightweight Python/Scala/Rust clients

Stable API versioning between client and server

Running notebooks or IDEs against a remote Spark cluster without a fat driver dependency

The plan representation is based on Protocol Buffers, making it language-agnostic by design.

Arrow-Based Execution

Apache Arrow has become increasingly central to Spark's execution model:

ArrowEvalPython operator replaces the old BatchEvalPython for Pandas UDFs

spark.sql.execution.arrow.maxRecordsPerBatch controls Arrow batch sizes

Zero-copy data sharing between JVM and Python processes via Arrow IPC format

This dramatically reduces serialization overhead, which historically was one of the biggest bottlenecks in PySpark workloads.

What Changed Under the Hood: A Summary

Area	Spark 2.x	Spark 3.x
Query Planning	Static, compile-time	Adaptive (AQE), runtime re-planning
Partition Tuning	Manual (`shuffle.partitions`)	Dynamic coalescing via AQE
Join Strategy	Fixed at plan time	Dynamically switched at runtime
Python Integration	Basic Pandas UDFs	Arrow-native, iterator UDFs, Pandas API
Streaming State	HDFSBackedStateStore	RocksDB state store (3.2+), async checkpointing
Client Model	Embedded driver	Spark Connect (gRPC, thin client)
Hardware	CPU only	GPU-aware scheduling (RAPIDS)
SQL Compliance	Partial ANSI	Near-full ANSI SQL mode

Key Takeaways for Practitioners

Enable AQE and DPP — if you're on Spark 3.x and haven't explicitly enabled these, you're leaving performance on the table. On 3.2+ they're on by default, but verify your configs:


spark.sql.adaptive.enabled=true
spark.sql.adaptive.coalescePartitions.enabled=true
spark.sql.optimizer.dynamicPartitionPruning.enabled=true

Migrate from RDDs to DataFrames/Datasets — if you still have RDD-heavy code, the optimizer can't help you. Catalyst and Tungsten only apply to the structured APIs.

Use RocksDB state store for large stateful streaming jobs — the default in-memory state store doesn't scale to high-cardinality keys. Switch with:


spark.sql.streaming.stateStore.providerClass=org.apache.spark.sql.execution.streaming.state.RocksDBStateStoreProvider

Adopt Spark Connect for new tooling — if you're building internal data platforms or developer tools on top of Spark, design against the Connect API rather than embedding SparkSession.

Conclusion

The journey from Spark 2.0 to the current 3.x series reflects a maturing ecosystem: early versions established the programming model, while later releases focused on making that model fast by default, easier to use from Python, and operable at enterprise scale. The introduction of AQE alone represents a fundamental shift in how Spark thinks about query execution — from a static compiler to a dynamic, feedback-driven optimizer.

For teams running production Spark workloads, upgrading from 2.x to 3.x is not just about new features — it's about getting correctness improvements, significant performance gains, and a much more honest relationship between your query and the cluster resources it consumes.