Mindexer

Picking the right indexes for a document database is still largely a manual job: an experienced engineer looks at the query patterns, weighs up cardinalities, and guesses. MongoDB’s Atlas Advisor helps on the cloud side, but there’s no open-source equivalent you can point at a self-hosted deployment or study as a research artefact.

Mindexer is that artefact. It reads the query workload directly from MongoDB’s intrinsic system.profile collection, draws a small random sample of actual data, enumerates composite-index candidates (up to 3 fields), and scores each one against a simple cost model built from the sampled cardinalities. A greedy selector ranks candidates by cumulative benefit and returns the top recommendations. The scope is deliberately narrow — equality, ranges, $in, $exists, $regex, $size, and negations — which keeps the algorithm tractable and the behaviour easy to reason about.

Because the cost model, the sample ratio, and the candidate-generation strategy are all exposed as tunable parameters, Mindexer is a natural vehicle for research into ML-for-systems questions: what does the advisor miss? can a learned cost model do better than hand-rolled constants? how does performance scale when the workload shifts? Two Honours theses at the University of Sydney, co-supervised with Prof. Alan Fekete, have already used the tool as their experimental harness:

Avinash Thirukumaran (2023) evaluated the recommender against real production-style workloads, stress-testing the assumptions in the original prototype.
Yan Rong (2024) pushed the candidate-generation and cost-scoring logic, measuring sensitivity to cost-model constants and workload shape.

The recent addition of a Yanex-based experiment harness means every benchmark run, parameter sweep, and baseline comparison is tracked and reproducible — the next student or agent can pick up exactly where the last one left off.

This project sits under the lab’s broader thread of ML & data systems, alongside ORiGAMi — both treat database internals as a substrate for learning and measurement rather than a black box.