← all projects

ORG-01

ORiGAMi

●active 2024-09 → present

Object Representation via Generative Autoregressive Modelling

MLsemi-structureddensity-estimation

◢overview

Hypothesis

Neural architectures can operate natively on nested JSON with the right tokenised representation, unlocking new capabilities in density estimation, query optimization, synthetic data generation, and predictive modelling for semi-structured data and document databases.

Most mixed-type Machine Learning models, tools and algorithms (e.g. Pandas, scikit-learn) operate on flat tables. Real-world application layer data is often nested JSON with optional fields and variable-length arrays. This mismatch forces practitioners to flatten the data into tabular form, which is lossy, laborious, and doesn’t scale as it creates very wide and sparse tables.

The project began at MongoDB Research as an attempt to make predictions from JSON data without lossy flattening into tabular form. It grew into a general architecture for semi-structured density estimation, with applications in cardinality estimation, learned indexes, and privacy-preserving synthetic data.

The key insights are:

Density estimation is a general primitive on which many downstream tasks can be built: sampling, conditional generation, outlier detection, imputation, and more. By focusing on density estimation rather than a specific task, we could design a more general architecture.
Transformers are general-purpose sequence models which can operate on any tokenised inputs. By designing a tokenisation scheme for nested JSON, we could leverage the power of transformers without flattening the data.

◢collaborators

Thomas Rückstieß¹, Robin Vujanic², Alana Huang³, William Hadden⁴

¹Relaxed Constraints · Lead
²MongoDB · Contributor
³MongoDB · Contributor
⁴MongoDB · Intern

◢papers

[1]

ORiGAMi: Object representation via generative autoregressive modelling

T. Rückstieß, A. Huang, R. Vujanic · Preprint · Dec 2024

PDF arXiv

[2]

Autoregressive Synthesis of Sparse and Semi-Structured Mixed-Type Data

T. Rückstieß, R. Vujanic · Submitted to PVLDB 2027 · Mar 2026

PDF arXiv

◢blog

Breaking Through Tabular Constraints for Synthetic Data Generation

Most synthetic data generation tools assume flat tables. Real-world application data is often nested JSON with optional fields and variable-length arrays. The ORiGAMi architecture handles semi-structured data directly.

◢slides

Autoregressive Synthesis of Sparse and Semi-Structured Mixed-Type Data

DBRG Seminar · Apr 2026

◢code

▸ rueckstiess/origami github ↗ ▸ rueckstiess/origami-jsynth github ↗

◢changelog

Apr 30, 2026

●

Submitted synthesis paper to VLDB
Apr 28, 2026

○

Blog post: "Breaking through tabular constraints"
Apr 1, 2026

○

Presentation at USYD - Database Reading Group
Mar 2, 2026

○

Preprint Autoregressive Synthesis of Sparse and Semi-Structured Mixed-Type Data on arXiv
Dec 12, 2024

○

Preprint ORiGAMi: Object representation via generative autoregressive modelling on arXiv

(c) 2026 Relaxed Constraints.

contact@relcon.ai