Reinforcement Learning vs Classical Search in Connect 4: A Comparative Analysis of DQN and Alpha-Beta Pruning

Abstract

This paper presents a comparative study of two fundamentally different approaches to game-playing AI in Connect 4: a Deep Q-Network (DQN) agent trained via reinforcement learning, and a classical Alpha-Beta pruning agent powered by heuristic evaluation. We trained the DQN across 10,000 self-play episodes, achieving a peak win rate of 89% against random opponents. However, systematic evaluation revealed critical failure modes — including catastrophic forgetting, defensive blind spots from weak opponent training, and the counterintuitive finding that the best-performing model emerged at only 20% through training. The Alpha-Beta agent, requiring zero training data, achieved near-perfect play through a depth-6 minimax search with a hand-tuned evaluation function. Head-to-head results consistently favored the classical approach, raising important questions about when learned heuristics outperform calculated reasoning.

1. Introduction

Connect 4 is a solved game — Victor Allis proved in 1988 that the first player can always force a win with optimal play. Despite this, it remains a compelling testbed for AI research due to its manageable state space (~4.5 × 10¹² positions) and the strategic depth required to play well.

We implemented two agents:

DQN Agent — A Convolutional Neural Network trained via Deep Q-Learning to approximate the optimal action-value function Q(s, a).
Alpha-Beta Agent — A recursive minimax search with alpha-beta pruning and a domain-specific heuristic evaluation function.

Our primary research questions:

How does a learned value function compare to a hand-crafted one?
What are the failure modes of DQN training in a self-play regime?
Does more training always produce stronger agents?

2. DQN Agent

2.1 Architecture

The DQN follows the approach introduced by Mnih et al. (2015), adapted for the 6×7 board:

Layer	Type	Output Shape
Input	—	(6, 7, 1)
Conv2D	32 filters, 3×3, ReLU	(6, 7, 32)
Conv2D	64 filters, 3×3, ReLU	(6, 7, 64)
Conv2D	64 filters, 3×3, ReLU	(6, 7, 64)
Flatten	—	2688
Dense	512 units, ReLU	512
Output	7 units (linear)	7

The three convolutional layers detect spatial patterns — horizontal sequences, diagonal threats, center control — while the dense layers map these features to column-level Q-values.

2.2 Training Protocol

Training was conducted against a uniformly random opponent over 10,000 episodes using the following hyperparameters:

Parameter	Value
Learning rate (α)	0.001
Discount factor (γ)	0.99
Initial ε	1.0
ε decay	0.9995
ε minimum	0.05
Replay buffer capacity	50,000
Batch size	64
Target network sync	Every 500 steps
Optimizer	Adam
Loss function	Mean Squared Error

The agent used experience replay (Lin, 1992) and a separate target network (Mnih et al., 2015) for training stability.

2.3 Training Observations

2.3.1 The Performance Dip Phenomenon

At approximately episode 4,500, win rate declined from 74.7% to 74.6% despite ε reaching its floor at 0.050. We attribute this to three factors:

Target network refresh instability. Every 500 steps, the target network updates shift the regression target. The policy network briefly chases a moving objective.
Opponent stochasticity. A random opponent occasionally stumbles into strong moves by chance, creating noise in the evaluation signal.
Exploration floor effects. At ε = 0.05, the agent relies almost entirely on its learned policy. Blind spots in the value function (e.g., specific diagonal threats) become persistent failure modes until the replay buffer supplies sufficient counter-examples.

Key insight: Training is not monotonic. The agent must occasionally unlearn suboptimal policies before converging to better ones.

2.3.2 Catastrophic Forgetting

After episode ~5,000, win rate declined from 76% to 71%. Root cause analysis revealed a replay buffer homogeneity problem: the 50,000-entry buffer became saturated with low-diversity experiences from ε = 0.05 play. The varied, exploratory transitions from high-ε episodes were evicted, causing the agent to "forget" how to handle unusual board positions.

This aligns with findings by Schaul et al. (2016) on prioritized experience replay — uniform sampling from a homogeneous buffer degrades learning quality. A larger buffer (>100K) or prioritized sampling would mitigate this effect.

2.4 Checkpoint Evaluation

We evaluated five checkpoints against a random opponent (500 games each, ε = 0):

Checkpoint	Wins	Losses	Draws	Win Rate
Episode 2,000	445	55	0	89.0%
Episode 4,000	343	156	1	68.6%
Episode 6,000	368	132	0	73.6%
Episode 8,000	288	211	1	57.6%
Episode 10,000	380	119	1	76.0%

The best model was at episode 2,000 — only 20% through training. The non-monotonic performance curve (89% → 57% → 76%) demonstrates DQN's inherent instability and validates the practice of checkpoint-based model selection rather than final-episode deployment.

2.5 The Defensive Blind Spot

A critical finding: the DQN, despite achieving 89% against random opponents, could be defeated by a human using a trivial strategy — stacking pieces in a single column.

Root cause: Training against a random opponent teaches offense but not defense. The random opponent never builds coherent threats, so the agent never learns to block them. The analogy: training a boxer by fighting mannequins teaches punching but not dodging.

This is consistent with the observation by Silver et al. (2017) in AlphaGo Zero that self-play produces more robust agents than play against fixed opponents.

2.5.1 Hybrid Safety Layer

Rather than retraining (computationally expensive), we implemented a lightweight rule-based safety layer:

Priority 1: Can the AI win this turn?    → Take the winning move
Priority 2: Can the opponent win next?   → Block the threat
Priority 3: No immediate threats         → Defer to DQN policy

The blocking check simulates each valid move on a cloned board and tests for 4-in-a-row. This O(7) operation adds negligible latency but eliminates the most egregious failure mode.

This hybrid approach — combining learned heuristics with rule-based safety checks — mirrors patterns in production game engines. AlphaGo combines a value network with Monte Carlo tree search; Stockfish blends a neural network evaluation with alpha-beta search.

3. Alpha-Beta Agent

3.1 Algorithm

The Alpha-Beta agent implements the minimax algorithm with alpha-beta pruning (Knuth & Moore, 1975). At each node:

Maximizing player selects the move maximizing the evaluation.
Minimizing player selects the move minimizing the evaluation.
Pruning eliminates subtrees that cannot influence the final decision (α ≥ β).

3.2 Evaluation Function

The heuristic evaluation function scores board positions using four components:

1. Center Column Control

score += count(center_column == player) × 3

The center column (column 3) participates in the most potential 4-in-a-rows of any column position. This simple heuristic alone significantly improves play quality.

2. Window Evaluation Every possible window of 4 cells is scanned across all four directions (horizontal, vertical, both diagonals):

Pattern	Score
4 in a row (own)	+100
3 + 1 empty (own)	+5
2 + 2 empty (own)	+2
3 + 1 empty (opponent)	−4

The asymmetry (own 3-in-a-row = +5, opponent 3-in-a-row = −4) makes the agent slightly more offensive than defensive — a deliberate design choice, since the first mover advantage in Connect 4 favors aggression.

3.3 Move Ordering

A critical optimization: sorting valid moves by distance from center before expanding:

python

Center moves are explored first, producing better alpha-beta bounds earlier, which prunes more branches. This roughly doubles search efficiency without any change in move quality.

3.4 Depth as Difficulty

Difficulty	Search Depth	Effective Lookahead
Easy	2	1 turn per player
Medium	4	2 turns per player
Hard	6	3 turns per player

At depth 6, the agent is functionally unbeatable by casual players. It identifies threats 3 full turns before they materialize and blocks preemptively.

4. Head-to-Head: DQN vs Alpha-Beta

4.1 Experimental Setup

We ran automated battles between the DQN (hard mode, no safety layer) and Alpha-Beta (depth 6). To prevent deterministic repetition, the first move for each agent was randomized from valid columns. All subsequent moves used full-strength policies.

4.2 Results

Alpha-Beta wins consistently. The result is decisive and expected: a depth-6 search with a well-tuned heuristic systematically exploits the DQN's inability to look ahead. The DQN agent, playing without its safety layer, walks into traps that the minimax algorithm identified 3 moves prior.

4.3 Analysis

The outcome illustrates a fundamental asymmetry between the two approaches:

Dimension	DQN	Alpha-Beta
Decision basis	Learned pattern matching	Exhaustive tree search
Lookahead	None (single-step Q-value)	6 plies (3 full turns)
Training required	10,000 episodes	None
Compute at inference	Single forward pass (~1ms)	Tree traversal (~200ms at depth 6)
Adaptability	Generalizes to unseen states	Fixed heuristic
Determinism	Stochastic (ε-greedy)	Deterministic
Defensive capability	Poor without safety layer	Built-in via search

The DQN approximates strategic understanding through pattern recognition. The Alpha-Beta agent calculates optimal play through enumeration. For a game with Connect 4's state space, calculation wins.

However, this conclusion does not generalize. In games with larger state spaces (Go, Chess with full-depth search intractable), learned heuristics become essential. AlphaZero's triumph over Stockfish demonstrated that neural evaluation can surpass hand-crafted heuristics when combined with search (Silver et al., 2018).

5. Discussion

5.1 More Training ≠ Better Model

The most counterintuitive finding: the strongest DQN checkpoint was at episode 2,000, not episode 10,000. This challenges the common assumption that longer training produces better agents and underscores the importance of periodic evaluation during training.

The 89% → 57% → 76% trajectory suggests the agent cycles through strategy discovery, catastrophic forgetting, and partial recovery — a pattern consistent with the instability of off-policy learning with function approximation (the "deadly triad" described by Sutton & Barto, 2018).

5.2 Opponent Quality Determines Agent Quality

Training against a random opponent produced an agent that excels at offense but fails at defense. The safety layer patches this symptom, but the root cause is the training distribution. Self-play (Silver et al., 2017) would address this by ensuring the agent faces an opponent that improves in tandem, naturally introducing the defensive pressure missing from random-opponent training.

5.3 Hybrid Systems Are Pragmatic

The safety layer demonstrates a practical pattern: combine fast-but-imperfect learned policies with slow-but-reliable rule-based checks. This is not a failure of the learning approach — it's an engineering solution that leverages the strengths of both paradigms. Production systems from game engines to autonomous vehicles use similar architectures.

5.4 Stanford Comparison

Our results contextualize against existing work. Guo et al. at Stanford reported Q-learning/Sarsa peaking at ~73.5% with tabular methods over 1,000 games. Our CNN-based DQN achieved 89% with 10,000 games. The key differentiator: function approximation via convolutional layers enables generalization to unseen board states, whereas tabular methods require explicit visitation of each state.

6. Conclusion

We built, trained, and evaluated two Connect 4 agents representing opposite ends of the AI methodology spectrum. The DQN agent demonstrates the promise and fragility of reinforcement learning — achieving strong offensive play through self-supervised training while exhibiting critical defensive blind spots. The Alpha-Beta agent demonstrates that for sufficiently small game trees, classical search with domain-specific heuristics remains unmatched in reliability and performance.

The head-to-head battle favors Alpha-Beta decisively. But the deeper lesson is not that one approach is universally superior — it's that each has a domain of dominance. For Connect 4's ~4.5 × 10¹² states, a depth-6 search is tractable and sufficient. For Go's ~10¹⁷⁰ states, it is not. The right tool depends on the size of the tree.

Future work includes training the DQN via full self-play (agent vs itself), implementing Double DQN with Huber loss for training stability, and combining both approaches: using the DQN's value network as the evaluation function within an Alpha-Beta search framework.

References

Allis, V. (1988). A Knowledge-based Approach of Connect-Four. Vrije Universiteit Amsterdam.
Knuth, D. & Moore, R. (1975). An analysis of alpha-beta pruning. Artificial Intelligence, 6(4), 293-326.
Lin, L. (1992). Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine Learning, 8(3-4), 293-321.
Mnih, V. et al. (2015). Human-level control through deep reinforcement learning. Nature, 518, 529-533.
Schaul, T. et al. (2016). Prioritized Experience Replay. ICLR 2016.
Silver, D. et al. (2017). Mastering the game of Go without human knowledge. Nature, 550, 354-359.
Silver, D. et al. (2018). A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play. Science, 362(6419), 1140-1144.
Sutton, R. & Barto, A. (2018). Reinforcement Learning: An Introduction. 2nd Edition. MIT Press.

Menu