The simulator currently provided by ‣ consists of three components: a memory model (FSRS-6), a retention-based scheduler, and a basic cost model. While sufficient for estimating workload at different desired retention levels, the architecture suffers from four structural defects:
- Coupled Memory Model: The simulation is hardcoded to FSRS-6 and cannot support alternative memory models.
- Information Leakage: The scheduler has direct access to the "true" memory state (Stability/Difficulty) generated by the memory model.
- State-Independent Cost: The time cost of a review is static and does not account for the state of memory (retrievability).
- Static User Behavior: The simulator assumes a user with fixed, unwavering learning habits.
These architectural limitations create significant blind spots in our analysis:
- Inability to Benchmark Against "Ground Truth": We cannot test the scheduler against high-fidelity memory models like RWKV-P. Ideally, the simulator should utilize the most accurate memory model available—even one too computationally expensive for client-side use—to serve as the ground truth for evaluating lightweight schedulers.
- Failure to Test Robustness (Overfitting Risk): While methods like ‣ can find the optimal policy for a specific memory model, they risk overfitting. If we introduce noise or parameter mismatch, performance may regress dramatically. A robust scheduler must function well even when its internal model implies a slightly different reality than the user's actual memory.
- Unrealistic "God Mode" Assumptions: In the real world, a scheduler faces partial observability: it never knows the true memory state, only the review history. It must estimate the state. By allowing the simulated scheduler to peek at the true values, we fail to analyze how well it recovers from estimation errors.
- Inaccurate Workload Optimization: Real-world recall time is correlated with Retrievability. When R is low, users may hesitate longer before answering. By decoupling cost from retrievability, the simulator underestimates the time burden of low-retention strategies (which generate many low-R reviews).
- Ignoring User Behavior & Adherence: Real users are stochastic: they skip days, fatigue, lazy-grade ("slap Good on everything"), or churn completely. A superior scheduler shouldn't just optimize theoretical workload; it should maximize user adherence and resilience.
To build a truly robust scheduler, we must first upgrade the simulator to expose these real-world frictions.
So, what will a general simulator of spaced repetition consist of? Ideally, such a system consists of four distinct modules:
- The Generative Memory Model (The Environment): This serves as the "Ground Truth." It must be model-agnostic—capable of utilizing FSRS-6, DASH or recurrent neural networks (LSTM, GRU, RWKV)—to output the precise, hidden Retrievability of a card at any given timestamp.
- The Stochastic User Model (The Behavior): This simulates human variability. It probabilistically transforms the true R into discrete outcomes (Ratings) and governs adherence, determining if and when a review actually occurs regardless of the schedule.
- The Dynamic Cost Model (The Workload): This computes the specific resource consumption of a review. Unlike static estimates, it calculates temporal and cognitive costs dynamically based on the memory state (e.g., accounting for increased latency when R is low).
- The Scheduler (The Agent): This is the black-box system being benchmarked. It operates under a strict information barrier, seeing only the review history. While it may contain an internal memory model (like FSRS), it is not required to; this allows the benchmarking of heuristic algorithms (SM-2, Leitner System, Fixed-Interval) that lack explicit memory assumptions.
This architecture transforms the tool from a calculator (verifying FSRS math against itself) into a stress-test laboratory (verifying FSRS robustness against complex, noisy, real-world realities).