Abstract
This paper presents a comparative study of two fundamentally different approaches to game-playing AI in Connect 4: a Deep Q-Network (DQN) agent trained via reinforcement learning, and a classical Alpha-Beta pruning agent powered by heuristic evaluation. We trained the DQN across 10,000 self-play episodes, achieving a peak win rate of 89% against random opponents. However, systematic evaluation revealed critical failure modes — including catastrophic forgetting, defensive blind spots from weak opponent training, and the counterintuitive finding that the best-performing model emerged at only 20% through training. The Alpha-Beta agent, requiring zero training data, achieved near-perfect play through a depth-6 minimax search with a hand-tuned evaluation function. Head-to-head results consistently favored the classical approach, raising important questions about when learned heuristics outperform calculated reasoning.
1. Introduction
Connect 4 is a solved game — Victor Allis proved in 1988 that the first player can always force a win with optimal play. Despite this, it remains a compelling testbed for AI research due to its manageable state space (~4.5 × 10¹² positions) and the strategic depth required to play well.
We implemented two agents:
- DQN Agent — A Convolutional Neural Network trained via Deep Q-Learning to approximate the optimal action-value function Q(s, a).
- Alpha-Beta Agent — A recursive minimax search with alpha-beta pruning and a domain-specific heuristic evaluation function.
Our primary research questions:
- How does a learned value function compare to a hand-crafted one?
- What are the failure modes of DQN training in a self-play regime?
- Does more training always produce stronger agents?
2. DQN Agent
2.1 Architecture
The DQN follows the approach introduced by Mnih et al. (2015), adapted for the 6×7 board:
| Layer | Type | Output Shape |
|---|---|---|
| Input | — | (6, 7, 1) |
| Conv2D | 32 filters, 3×3, ReLU | (6, 7, 32) |
| Conv2D | 64 filters, 3×3, ReLU | (6, 7, 64) |
| Conv2D | 64 filters, 3×3, ReLU | (6, 7, 64) |
| Flatten | — | 2688 |
| Dense | 512 units, ReLU | 512 |
| Output | 7 units (linear) | 7 |
The three convolutional layers detect spatial patterns — horizontal sequences, diagonal threats, center control — while the dense layers map these features to column-level Q-values.
2.2 Training Protocol
Training was conducted against a uniformly random opponent over 10,000 episodes using the following hyperparameters:
| Parameter | Value |
|---|---|
| Learning rate (α) | 0.001 |
| Discount factor (γ) | 0.99 |
| Initial ε | 1.0 |
| ε decay | 0.9995 |
| ε minimum | 0.05 |
| Replay buffer capacity | 50,000 |
| Batch size | 64 |
| Target network sync | Every 500 steps |
| Optimizer | Adam |
| Loss function | Mean Squared Error |
The agent used experience replay (Lin, 1992) and a separate target network (Mnih et al., 2015) for training stability.
2.3 Training Observations
2.3.1 The Performance Dip Phenomenon
At approximately episode 4,500, win rate declined from 74.7% to 74.6% despite ε reaching its floor at 0.050. We attribute this to three factors:
- Target network refresh instability. Every 500 steps, the target network updates shift the regression target. The policy network briefly chases a moving objective.
- Opponent stochasticity. A random opponent occasionally stumbles into strong moves by chance, creating noise in the evaluation signal.
- Exploration floor effects. At ε = 0.05, the agent relies almost entirely on its learned policy. Blind spots in the value function (e.g., specific diagonal threats) become persistent failure modes until the replay buffer supplies sufficient counter-examples.
Key insight: Training is not monotonic. The agent must occasionally unlearn suboptimal policies before converging to better ones.
2.3.2 Catastrophic Forgetting
After episode ~5,000, win rate declined from 76% to 71%. Root cause analysis revealed a replay buffer homogeneity problem: the 50,000-entry buffer became saturated with low-diversity experiences from ε = 0.05 play. The varied, exploratory transitions from high-ε episodes were evicted, causing the agent to "forget" how to handle unusual board positions.
This aligns with findings by Schaul et al. (2016) on prioritized experience replay — uniform sampling from a homogeneous buffer degrades learning quality. A larger buffer (>100K) or prioritized sampling would mitigate this effect.
2.4 Checkpoint Evaluation
We evaluated five checkpoints against a random opponent (500 games each, ε = 0):
| Checkpoint | Wins | Losses | Draws | Win Rate |
|---|---|---|---|---|
| Episode 2,000 | 445 | 55 | 0 | 89.0% |
| Episode 4,000 | 343 | 156 | 1 | 68.6% |
| Episode 6,000 | 368 | 132 | 0 | 73.6% |
| Episode 8,000 | 288 | 211 | 1 | 57.6% |
| Episode 10,000 | 380 | 119 | 1 | 76.0% |
The best model was at episode 2,000 — only 20% through training. The non-monotonic performance curve (89% → 57% → 76%) demonstrates DQN's inherent instability and validates the practice of checkpoint-based model selection rather than final-episode deployment.
2.5 The Defensive Blind Spot
A critical finding: the DQN, despite achieving 89% against random opponents, could be defeated by a human using a trivial strategy — stacking pieces in a single column.
Root cause: Training against a random opponent teaches offense but not defense. The random opponent never builds coherent threats, so the agent never learns to block them. The analogy: training a boxer by fighting mannequins teaches punching but not dodging.
This is consistent with the observation by Silver et al. (2017) in AlphaGo Zero that self-play produces more robust agents than play against fixed opponents.
2.5.1 Hybrid Safety Layer
Rather than retraining (computationally expensive), we implemented a lightweight rule-based safety layer:
Priority 1: Can the AI win this turn? → Take the winning move
Priority 2: Can the opponent win next? → Block the threat
Priority 3: No immediate threats → Defer to DQN policy
The blocking check simulates each valid move on a cloned board and tests for 4-in-a-row. This O(7) operation adds negligible latency but eliminates the most egregious failure mode.
This hybrid approach — combining learned heuristics with rule-based safety checks — mirrors patterns in production game engines. AlphaGo combines a value network with Monte Carlo tree search; Stockfish blends a neural network evaluation with alpha-beta search.
3. Alpha-Beta Agent
3.1 Algorithm
The Alpha-Beta agent implements the minimax algorithm with alpha-beta pruning (Knuth & Moore, 1975). At each node:
- Maximizing player selects the move maximizing the evaluation.
- Minimizing player selects the move minimizing the evaluation.
- Pruning eliminates subtrees that cannot influence the final decision (α ≥ β).
3.2 Evaluation Function
The heuristic evaluation function scores board positions using four components:
1. Center Column Control
score += count(center_column == player) × 3
The center column (column 3) participates in the most potential 4-in-a-rows of any column position. This simple heuristic alone significantly improves play quality.
2. Window Evaluation Every possible window of 4 cells is scanned across all four directions (horizontal, vertical, both diagonals):
| Pattern | Score |
|---|---|
| 4 in a row (own) | +100 |
| 3 + 1 empty (own) | +5 |
| 2 + 2 empty (own) | +2 |
| 3 + 1 empty (opponent) | −4 |
The asymmetry (own 3-in-a-row = +5, opponent 3-in-a-row = −4) makes the agent slightly more offensive than defensive — a deliberate design choice, since the first mover advantage in Connect 4 favors aggression.
3.3 Move Ordering
A critical optimization: sorting valid moves by distance from center before expanding:
Center moves are explored first, producing better alpha-beta bounds earlier, which prunes more branches. This roughly doubles search efficiency without any change in move quality.
3.4 Depth as Difficulty
| Difficulty | Search Depth | Effective Lookahead |
|---|---|---|
| Easy | 2 | 1 turn per player |
| Medium | 4 | 2 turns per player |
| Hard | 6 | 3 turns per player |
At depth 6, the agent is functionally unbeatable by casual players. It identifies threats 3 full turns before they materialize and blocks preemptively.
4. Head-to-Head: DQN vs Alpha-Beta
4.1 Experimental Setup
We ran automated battles between the DQN (hard mode, no safety layer) and Alpha-Beta (depth 6). To prevent deterministic repetition, the first move for each agent was randomized from valid columns. All subsequent moves used full-strength policies.
4.2 Results
Alpha-Beta wins consistently. The result is decisive and expected: a depth-6 search with a well-tuned heuristic systematically exploits the DQN's inability to look ahead. The DQN agent, playing without its safety layer, walks into traps that the minimax algorithm identified 3 moves prior.
4.3 Analysis
The outcome illustrates a fundamental asymmetry between the two approaches:
| Dimension | DQN | Alpha-Beta |
|---|---|---|
| Decision basis | Learned pattern matching | Exhaustive tree search |
| Lookahead | None (single-step Q-value) | 6 plies (3 full turns) |
| Training required | 10,000 episodes | None |
| Compute at inference | Single forward pass (~1ms) | Tree traversal (~200ms at depth 6) |
| Adaptability | Generalizes to unseen states | Fixed heuristic |
| Determinism | Stochastic (ε-greedy) | Deterministic |
| Defensive capability | Poor without safety layer | Built-in via search |
The DQN approximates strategic understanding through pattern recognition. The Alpha-Beta agent calculates optimal play through enumeration. For a game with Connect 4's state space, calculation wins.
However, this conclusion does not generalize. In games with larger state spaces (Go, Chess with full-depth search intractable), learned heuristics become essential. AlphaZero's triumph over Stockfish demonstrated that neural evaluation can surpass hand-crafted heuristics when combined with search (Silver et al., 2018).
5. Discussion
5.1 More Training ≠ Better Model
The most counterintuitive finding: the strongest DQN checkpoint was at episode 2,000, not episode 10,000. This challenges the common assumption that longer training produces better agents and underscores the importance of periodic evaluation during training.
The 89% → 57% → 76% trajectory suggests the agent cycles through strategy discovery, catastrophic forgetting, and partial recovery — a pattern consistent with the instability of off-policy learning with function approximation (the "deadly triad" described by Sutton & Barto, 2018).
5.2 Opponent Quality Determines Agent Quality
Training against a random opponent produced an agent that excels at offense but fails at defense. The safety layer patches this symptom, but the root cause is the training distribution. Self-play (Silver et al., 2017) would address this by ensuring the agent faces an opponent that improves in tandem, naturally introducing the defensive pressure missing from random-opponent training.
5.3 Hybrid Systems Are Pragmatic
The safety layer demonstrates a practical pattern: combine fast-but-imperfect learned policies with slow-but-reliable rule-based checks. This is not a failure of the learning approach — it's an engineering solution that leverages the strengths of both paradigms. Production systems from game engines to autonomous vehicles use similar architectures.
5.4 Stanford Comparison
Our results contextualize against existing work. Guo et al. at Stanford reported Q-learning/Sarsa peaking at ~73.5% with tabular methods over 1,000 games. Our CNN-based DQN achieved 89% with 10,000 games. The key differentiator: function approximation via convolutional layers enables generalization to unseen board states, whereas tabular methods require explicit visitation of each state.
6. Conclusion
We built, trained, and evaluated two Connect 4 agents representing opposite ends of the AI methodology spectrum. The DQN agent demonstrates the promise and fragility of reinforcement learning — achieving strong offensive play through self-supervised training while exhibiting critical defensive blind spots. The Alpha-Beta agent demonstrates that for sufficiently small game trees, classical search with domain-specific heuristics remains unmatched in reliability and performance.
The head-to-head battle favors Alpha-Beta decisively. But the deeper lesson is not that one approach is universally superior — it's that each has a domain of dominance. For Connect 4's ~4.5 × 10¹² states, a depth-6 search is tractable and sufficient. For Go's ~10¹⁷⁰ states, it is not. The right tool depends on the size of the tree.
Future work includes training the DQN via full self-play (agent vs itself), implementing Double DQN with Huber loss for training stability, and combining both approaches: using the DQN's value network as the evaluation function within an Alpha-Beta search framework.
References
- Allis, V. (1988). A Knowledge-based Approach of Connect-Four. Vrije Universiteit Amsterdam.
- Knuth, D. & Moore, R. (1975). An analysis of alpha-beta pruning. Artificial Intelligence, 6(4), 293-326.
- Lin, L. (1992). Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine Learning, 8(3-4), 293-321.
- Mnih, V. et al. (2015). Human-level control through deep reinforcement learning. Nature, 518, 529-533.
- Schaul, T. et al. (2016). Prioritized Experience Replay. ICLR 2016.
- Silver, D. et al. (2017). Mastering the game of Go without human knowledge. Nature, 550, 354-359.
- Silver, D. et al. (2018). A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play. Science, 362(6419), 1140-1144.
- Sutton, R. & Barto, A. (2018). Reinforcement Learning: An Introduction. 2nd Edition. MIT Press.

