← all projects
ORG-01

ORiGAMi

active 2024-09 → present

Object Representation via Generative Autoregressive Modelling

MLsemi-structureddensity-estimation
overview
Hypothesis

Neural architectures can operate natively on nested JSON with the right tokenised representation, unlocking new capabilities in density estimation, query optimization, synthetic data generation, and predictive modelling for semi-structured data and document databases.

Most mixed-type Machine Learning models, tools and algorithms (e.g. Pandas, scikit-learn) operate on flat tables. Real-world application layer data is often nested JSON with optional fields and variable-length arrays. This mismatch forces practitioners to flatten the data into tabular form, which is lossy, laborious, and doesn’t scale as it creates very wide and sparse tables.

The project began at MongoDB Research as an attempt to make predictions from JSON data without lossy flattening into tabular form. It grew into a general architecture for semi-structured density estimation, with applications in cardinality estimation, learned indexes, and privacy-preserving synthetic data.

The key insights are:

  1. Density estimation is a general primitive on which many downstream tasks can be built: sampling, conditional generation, outlier detection, imputation, and more. By focusing on density estimation rather than a specific task, we could design a more general architecture.
  2. Transformers are general-purpose sequence models which can operate on any tokenised inputs. By designing a tokenisation scheme for nested JSON, we could leverage the power of transformers without flattening the data.
collaborators
Thomas Rückstieß1, Robin Vujanic2, Alana Huang3, William Hadden4
  1. 1Relaxed Constraints · Lead
  2. 2MongoDB · Contributor
  3. 3MongoDB · Contributor
  4. 4MongoDB · Intern
papers
[1]
ORiGAMi: Object representation via generative autoregressive modelling
T. Rückstieß, A. Huang, R. Vujanic · Preprint · Dec 2024
[2]
Autoregressive Synthesis of Sparse and Semi-Structured Mixed-Type Data
T. Rückstieß, R. Vujanic · Submitted to PVLDB 2027 · Mar 2026
blog
slides
code
changelog
  1. Apr 30, 2026
    Submitted synthesis paper to VLDB
  2. Apr 28, 2026
    Blog post: "Breaking through tabular constraints"
  3. Apr 1, 2026
    Presentation at USYD - Database Reading Group
  4. Mar 2, 2026
    Preprint Autoregressive Synthesis of Sparse and Semi-Structured Mixed-Type Data on arXiv
  5. Dec 12, 2024
    Preprint ORiGAMi: Object representation via generative autoregressive modelling on arXiv

(c) 2026 Relaxed Constraints.

contact@relcon.ai