This is the repository for the generation system of the PixMo-Docs, CoSyn-400K, and CoSyn-point datasets. PixMo-Docs was used to train the Molmo model, and the CoSyn datasets are an expanded version that use an improved pipeline and more types of documents.
After cloning the repo, you can install the required dependencies using the following commands:
conda create --name pixmo-doc python=3.10
conda activate pixmo-doc
pip install -r requirements.txt
Then export your API key as an environment variable:
export OPENAI_API_KEY=your-api-key
export ANTHROPIC_API_KEY=your-api-key
export HF_TOKEN=your-api-key # only if you want to upload the dataset to the Hugging Face Hub
You need to install the following packages to use some of the pipelines:
-
LaTeX: the installation depends on your operating system, you can refer to the official LaTeX website for more details.
-
Mermaid: you can refer to here to install the Mermaid CLI using npm:
npm install -g @mermaid-js/mermaid-cli
-
HTML: install playwright with:
pip install playwright playwright install
-
mplfinance:
pip install mpl_finance<=0.10.1 mplfinance<=0.12.10b0
-
cairosvg:
pip install cairosvg<=2.7.1
The main.py script is the entry point for the generation of the dataset. You can use the following main arguments to control the generation process:
python main.py -p {PIPELINE} \
-t {TYPE_OF_DATA_YOU_WANT_TO_GENERATE} \
-n {NUMBER_OF_SAMPLES} \
-m {NAME_OF_DATASET} \
For example, python main.py -p "MatplotlibChartPipeline" -n 5 -m "matplotlib_test" -t "bar chart"
, will generate 5 bar charts using the MatplotlibChartPipeline and save them with the name "matplotlib_test".
You can use comma separated values for the -p
and -t
arguments to generate multiple types of data using different pipelines at the same time.
Please refer to the main.py script for more details on the available arguments and their usage.
We released 25 pipelines to generate eight main categories of text-rich images: charts, tables, documents, diagrams, circuits, specialized graphics, and pointing. Each pipeline uses one renderer/programming language to generate the images.
-
Chart:
- MatplotlibChartPipeline: using Matplotlib to generate charts like bar charts, line charts, etc. You can check the Matplotlib gallery for possible charts.
- PlotlyChartPipeline: using Plotly to generate charts. You can check the Plotly gallery for possible charts.
- VegaLiteChartPipeline: using Vega-Lite to generate charts. You can check the Vega-Lite gallery for possible usage.
- LaTeXChartPipeline: using TikZ to generate charts. This pipeline only works for simple charts like bar charts, line charts, etc.
- HTMLChartPipeline: using HTML and CSS to generate charts. This pipeline only works for simple charts like bar charts, line charts, etc.
-
Table:
- LaTeXTablePipeline: best for tables with complex structures.
- MatplotlibTablePipeline: uses Matplotlib to render tables within figures.
- PlotlyTablePipeline: only works for simple tables like single-header tables.
- HTMLTablePipeline: only works for simple tables like single-header tables.
-
Document:
- LaTeXDocumentPipeline: works for diverse types of documents like reports, articles, etc.
- HTMLDocumentPipeline: can create documents with complex styles and structures.
- DOCXDocumentPipeline: generates Microsoft Word-compatible
.docx
documents.
-
Diagram:
-
Circuit:
- SchemDrawCircuitPipeline: uses SchemDraw to generate electrical circuit diagrams.
- LaTeXCircuitPipeline: uses TikZ circuit libraries to generate circuit diagrams.
-
Specialized Graphics:
- DALLEImagePipeline: generates images using DALL·E models.
- RdkitChemicalPipeline: renders chemical structure diagrams using RDKit.
- LaTeXMathPipeline: generates mathematical expressions using LaTeX.
- LilyPondMusicPipeline: generates sheet music using LilyPond.
- SVGGraphicPipeline: creates vector graphics using SVG format.
- AsymptoteGraphicPipeline: uses Asymptote to generate mathematical and technical graphics.
-
Web Screens:
- HTMLScreenPipeline: creates HTML-based screen layouts rendered with Playwright and Google Chrome / Chromium.
-
Pointing:
- HTMLDocumentPointPipeline: generates HTML documents with structured points.
Please cite the following papers if you use this code in your work.
@article{yang2025scaling,
title={Scaling Text-Rich Image Understanding via Code-Guided Synthetic Multimodal Data Generation},
author={Yang, Yue and Patel, Ajay and Deitke, Matt and Gupta, Tanmay and Weihs, Luca and Head, Andrew and Yatskar, Mark and Callison-Burch, Chris and Krishna, Ranjay and Kembhavi, Aniruddha and others},
journal={arXiv preprint arXiv:2502.14846},
year={2025}
}
@article{deitke2024molmo,
title={Molmo and pixmo: Open weights and open data for state-of-the-art multimodal models},
author={Deitke, Matt and Clark, Christopher and Lee, Sangho and Tripathi, Rohun and Yang, Yue and Park, Jae Sung and Salehi, Mohammadreza and Muennighoff, Niklas and Lo, Kyle and Soldaini, Luca and others},
journal={arXiv preprint arXiv:2409.17146},
year={2024}
}