ORiGAMi
Object Representation via Generative Autoregressive Modelling
Neural architectures can operate natively on nested JSON with the right tokenised representation, unlocking new capabilities in density estimation, query optimization, synthetic data generation, and predictive modelling for semi-structured data and document databases.
Most mixed-type Machine Learning models, tools and algorithms (e.g. Pandas, scikit-learn) operate on flat tables. Real-world application layer data is often nested JSON with optional fields and variable-length arrays. This mismatch forces practitioners to flatten the data into tabular form, which is lossy, laborious, and doesn’t scale as it creates very wide and sparse tables.
The project began at MongoDB Research as an attempt to make predictions from JSON data without lossy flattening into tabular form. It grew into a general architecture for semi-structured density estimation, with applications in cardinality estimation, learned indexes, and privacy-preserving synthetic data.
The key insights are:
- Density estimation is a general primitive on which many downstream tasks can be built: sampling, conditional generation, outlier detection, imputation, and more. By focusing on density estimation rather than a specific task, we could design a more general architecture.
- Transformers are general-purpose sequence models which can operate on any tokenised inputs. By designing a tokenisation scheme for nested JSON, we could leverage the power of transformers without flattening the data.
- 1Relaxed Constraints · Lead
- 2MongoDB · Contributor
- 3MongoDB · Contributor
- 4MongoDB · Intern
-
Apr 30, 2026●Submitted synthesis paper to VLDB
-
Apr 28, 2026○Blog post: "Breaking through tabular constraints"
-
Apr 1, 2026○Presentation at USYD - Database Reading Group
-
Mar 2, 2026○Preprint Autoregressive Synthesis of Sparse and Semi-Structured Mixed-Type Data on arXiv
-
Dec 12, 2024○Preprint ORiGAMi: Object representation via generative autoregressive modelling on arXiv