← all projects
MDX-03

Mindexer

active 2022-06 → present

An experimental index advisor for MongoDB — workload-driven, cost-based, and open-source.

databasesml-for-systemsmongodbindexinghonours-thesis
overview
Hypothesis

Index recommendation for document databases should be grounded in actual query workloads and real data distributions, not synthetic statistics. A greedy search over composite-index candidates, scored against sampled cardinalities and a simple cost model, produces practical recommendations without requiring manual tuning expertise.

Picking the right indexes for a document database is still largely a manual job: an experienced engineer looks at the query patterns, weighs up cardinalities, and guesses. MongoDB’s Atlas Advisor helps on the cloud side, but there’s no open-source equivalent you can point at a self-hosted deployment or study as a research artefact.

Mindexer is that artefact. It reads the query workload directly from MongoDB’s intrinsic system.profile collection, draws a small random sample of actual data, enumerates composite-index candidates (up to 3 fields), and scores each one against a simple cost model built from the sampled cardinalities. A greedy selector ranks candidates by cumulative benefit and returns the top recommendations. The scope is deliberately narrow — equality, ranges, $in, $exists, $regex, $size, and negations — which keeps the algorithm tractable and the behaviour easy to reason about.

Because the cost model, the sample ratio, and the candidate-generation strategy are all exposed as tunable parameters, Mindexer is a natural vehicle for research into ML-for-systems questions: what does the advisor miss? can a learned cost model do better than hand-rolled constants? how does performance scale when the workload shifts? Two Honours theses at the University of Sydney, co-supervised with Prof. Alan Fekete, have already used the tool as their experimental harness:

  • Avinash Thirukumaran (2023) evaluated the recommender against real production-style workloads, stress-testing the assumptions in the original prototype.
  • Yan Rong (2024) pushed the candidate-generation and cost-scoring logic, measuring sensitivity to cost-model constants and workload shape.

The recent addition of a Yanex-based experiment harness means every benchmark run, parameter sweep, and baseline comparison is tracked and reproducible — the next student or agent can pick up exactly where the last one left off.

This project sits under the lab’s broader thread of ML & data systems, alongside ORiGAMi — both treat database internals as a substrate for learning and measurement rather than a black box.

collaborators
Thomas Rückstieß1, Prof. Alan Fekete2, Dr. Michael Cahill3, Yan Rong4, Avinash Thirukumaran5
  1. 1MongoDB / Relaxed Constraints · Advisor
  2. 2University of Sydney · Advisor
  3. 3MongoDB / University of Sydney · Advisor
  4. 4University of Sydney · Honours Project, 2024
  5. 5University of Sydney · Honours Project, 2023
papers
[1]
Improving Index Recommendation for MongoDB
Yan Rong — supervised by A. Fekete and T. Rückstieß · University of Sydney · Honours Thesis · 2024
PDF
[2]
Evaluating an Index Recommender on Real Workloads
Avinash Thirukumaran — supervised by A. Fekete, M. Cahill, T. Rückstieß · University of Sydney · Honours Thesis · 2023
PDF
code
changelog
  1. ongoing
    Yan Rong, PhD project: Query optimisation for document databases, with focus on index and schema recommendation.
  2. Dec 6, 2024
    Yan Rong's Honours thesis: improvements to candidate generation and cost model
  3. Feb 22, 2024
    Cost model refinements — collscan cost and scoring tweaks
  4. 2023
    Avinash Thirukumaran's Honours thesis: evaluation on real workloads
  5. Jun 21, 2022
    Initial prototype — workload extraction from system.profile

(c) 2026 Relaxed Constraints.

contact@relcon.ai