CUDA-ooM with large PDFs #50

infinitytrans · 2025-02-27T10:40:31Z

🐛 Describe the bug

Component: Pipeline (pipeline.py)
Version: Latest as of February 27, 2025 (assumed from git clone)
Environment:
OS: Ubuntu

Python: 3.12

GPU: NVIDIA RTX 3090 (24 GB VRAM)

CUDA: 12.4

SGLang: 0.4.2

PyTorch: Installed via olmocr dependencies

Description:
When processing large PDF files (e.g., tests/2.pdf with 4180 pages), the olmocr.pipeline module excessively allocates CUDA memory (VRAM) during the preparation phase, leading to a torch.OutOfMemoryError. This occurs even when parameters like --pages_per_group and --workers are set to limit the number of pages processed simultaneously. The issue appears to stem from the pipeline loading or rendering all pages into CUDA memory at once, rather than respecting the batch size defined by --pages_per_group.
Steps to Reproduce:
Set up the olmocr environment:

git clone https://github.com/allenai/olmocr.git
cd olmocr
pip install -e .
pip install "sglang[all]==0.4.2" --find-links https://flashinfer.ai/whl/cu124/torch2.4/flashinfer/

Start the SGLang server:

python -m sglang.launch_server --model-path allenai/olmOCR-7B-0225-preview --port 30024

Run the pipeline with a large PDF (e.g., 4180 pages):

python -m olmocr.pipeline ./localworkspace --pdfs tests/2.pdf --model allenai/olmOCR-7B-0225-preview --pages_per_group 10 --workers 1

Monitor VRAM usage:

watch -n 1 nvidia-smi

Actual Result:

The pipeline process allocates excessive CUDA memory (~3 GiB observed, growing with PDF size) before sending data to the server.
Logs show Got 4180 pages to do for tests/2.pdf in worker 0, indicating all pages are prepared at once, ignoring --pages_per_group 10. VRAM fills up (e.g., 20.45 GiB by server + 3.18 GiB by pipeline), triggering:

torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 130.00 MiB. GPU 0 has a total capacity of 23.68 GiB of which 111.12 MiB is free.

Expected Result:
The pipeline should respect --pages_per_group 10, preparing and processing only 10 pages at a time in CUDA memory. VRAM usage by the pipeline should remain minimal (e.g., <1-2 GiB for 10 pages), allowing large PDFs (1000+ pages) to be processed without OOM errors.

Additional Information:
Pipeline VRAM usage spikes during PDF preparation, before server inference begins (server logs show no significant activity beyond initial requests).
Setting CUDA_VISIBLE_DEVICES="" prevents CUDA usage but crashes the pipeline with RuntimeError: No CUDA GPUs are available, indicating a hard dependency on CUDA.

Suggested Fix:
Modify pipeline.py to incrementally load and render PDF pages in batches defined by --pages_per_group, avoiding loading all pages into CUDA memory at once.
Optionally, allow CPU-only rendering as a fallback (remove or make optional the GPU check in check.py:38).

Logs:
2025-02-27 01:41:39,217 - main - INFO - Got 4180 pages to do for tests/2.pdf in worker 0
[...]

torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 130.00 MiB. GPU 0 has a total capacity of 23.68 GiB of which 111.12 MiB is free. Process 34417 has 20.45 GiB memory in use. Including non-PyTorch memory, this process has 3.18 GiB memory in use.

Versions

aiohappyeyeballs==2.4.6
aiohttp==3.11.13
aiosignal==1.3.2
annotated-types==0.7.0
anthropic==0.47.2
anyio==4.8.0
asttokens==3.0.0
attrs==25.1.0
beaker-py==1.34.1
bitsandbytes==0.45.3
bleach==6.2.0
boto3==1.37.2
botocore==1.37.2
cached_path==1.6.7
cachetools==5.5.2
certifi==2025.1.31
cffi==1.17.1
charset-normalizer==3.4.1
click==8.1.8
cloudpickle==3.1.1
compressed-tensors==0.8.0
cryptography==44.0.1
cuda-bindings==12.8.0
cuda-python==12.8.0
datasets==3.3.2
decorator==5.2.1
decord==0.6.0
dill==0.3.8
diskcache==5.6.3
distro==1.9.0
docker==7.1.0
einops==0.8.1
executing==2.2.0
fastapi==0.115.8
filelock==3.17.0
flashinfer==0.1.6+cu124torch2.4
frozenlist==1.5.0
fsspec==2024.12.0
ftfy==6.3.1
fuzzysearch==0.7.3
gguf==0.10.0
google-api-core==2.24.1
google-auth==2.38.0
google-cloud-core==2.4.2
google-cloud-storage==2.19.0
google-crc32c==1.6.0
google-resumable-media==2.7.2
googleapis-common-protos==1.68.0
h11==0.14.0
hf_transfer==0.1.9
httpcore==1.0.7
httptools==0.6.4
httpx==0.28.1
huggingface-hub==0.27.1
idna==3.10
importlib_metadata==8.6.1
interegular==0.3.3
ipython==8.32.0
jedi==0.19.2
Jinja2==3.1.5
jiter==0.8.2
jmespath==1.0.1
jsonschema==4.23.0
jsonschema-specifications==2024.10.1
lark==1.2.2
lingua-language-detector==2.0.2
litellm==1.61.17
llvmlite==0.44.0
lm-format-enforcer==0.10.11
markdown-it-py==3.0.0
markdown2==2.5.3
MarkupSafe==3.0.2
matplotlib-inline==0.1.7
mdurl==0.1.2
mistral_common==1.5.3
modelscope==1.23.1
mpmath==1.3.0
msgpack==1.1.0
msgspec==0.19.0
multidict==6.1.0
multiprocess==0.70.16
nest-asyncio==1.6.0
networkx==3.4.2
numba==0.61.0
numpy==1.26.4
nvidia-cublas-cu12==12.4.5.8
nvidia-cuda-cupti-cu12==12.4.127
nvidia-cuda-nvrtc-cu12==12.4.127
nvidia-cuda-runtime-cu12==12.4.127
nvidia-cudnn-cu12==9.1.0.70
nvidia-cufft-cu12==11.2.1.3
nvidia-curand-cu12==10.3.5.147
nvidia-cusolver-cu12==11.6.1.9
nvidia-cusparse-cu12==12.3.1.170
nvidia-cusparselt-cu12==0.6.2
nvidia-ml-py==12.570.86
nvidia-nccl-cu12==2.21.5
nvidia-nvjitlink-cu12==12.4.127
nvidia-nvtx-cu12==12.4.127
-e git+https://github.com/allenai/olmocr.git@bd08fdb4761538c96224ace9e951e5d956589790#egg=olmocr
openai==1.64.0
opencv-python-headless==4.11.0.86
orjson==3.10.15
outlines==0.0.46
packaging==24.2
pandas==2.2.3
parso==0.8.4
partial-json-parser==0.2.1.1.post5
pexpect==4.9.0
pillow==11.1.0
prometheus-fastapi-instrumentator==7.0.2
prometheus_client==0.21.1
prompt_toolkit==3.0.50
propcache==0.3.0
proto-plus==1.26.0
protobuf==5.29.3
psutil==7.0.0
ptyprocess==0.7.0
pure_eval==0.2.3
py-cpuinfo==9.0.0
pyairports==2.1.1
pyarrow==19.0.1
pyasn1==0.6.1
pyasn1_modules==0.4.1
pycountry==24.6.1
pycparser==2.22
pydantic==2.10.6
pydantic_core==2.27.2
Pygments==2.19.1
pypdf==5.3.0
pypdfium2==4.30.1
python-dateutil==2.9.0.post0
python-dotenv==1.0.1
python-multipart==0.0.20
pytz==2025.1
PyYAML==6.0.2
pyzmq==26.2.1
RapidFuzz==3.12.1
ray==2.42.1
referencing==0.36.2
regex==2024.11.6
requests==2.32.3
rich==13.9.4
rpds-py==0.23.1
rsa==4.9
s3transfer==0.11.3
safetensors==0.5.3
sentencepiece==0.2.0
sequence_align==0.2.0
setproctitle==1.3.5
setuptools==75.8.2
sgl-kernel==0.0.3.post1
sglang==0.4.2
six==1.17.0
smart-open==7.1.0
sniffio==1.3.1
stack-data==0.6.3
starlette==0.45.3
sympy==1.13.1
tiktoken==0.9.0
tokenizers==0.21.0
torch==2.5.1
torchao==0.8.0
torchvision==0.20.1
tqdm==4.67.1
traitlets==5.14.3
transformers==4.49.0
triton==3.1.0
typing_extensions==4.12.2
tzdata==2025.1
urllib3==2.3.0
uvicorn==0.34.0
uvloop==0.21.0
vllm==0.6.4.post1
watchfiles==1.0.4
wcwidth==0.2.13
webencodings==0.5.1
websockets==15.0
wrapt==1.17.2
xformers==0.0.28.post3
xgrammar==0.1.14
xxhash==3.5.0
yarl==1.18.3
zipp==3.21.0
zstandard==0.23.0

The text was updated successfully, but these errors were encountered:

jakep-allenai · 2025-02-27T17:08:25Z

Hey, thanks for the report, a few follow up questions:

What GPU exactly are you using?
Can you link to your PDF if it is not private?

jakep-allenai · 2025-02-27T21:22:58Z

Nvm, I see it's a 3090, it's strange I have tested on 4090's please share your PDF if you can.

infinitytrans · 2025-02-28T11:40:30Z

Nvm, I see it's a 3090, it's strange I have tested on 4090's please share your PDF if you can.

PDF is not confidential, but copyrighted. I will sent you a link by mail.

infinitytrans · 2025-02-28T19:23:54Z

EDIT: This only happens when SGLang is started standalone:
python -m sglang.launch_server --model-path allenai/olmOCR-7B-0225-preview --port 30024

when the pipeline starts the server on its own, everything is fine.

jakep-allenai · 2025-02-28T19:32:49Z

Ahh, I know why. I've noticed that on devices with less than 40GB or so, you need to specify a lower mem fraction static for sglang:

  # Check GPU memory, lower mem devices need a bit less KV cache space because the VLM takes additional memory
  gpu_memory = torch.cuda.get_device_properties(0).total_memory / (1024**3)  # Convert to GB
  mem_fraction_arg = ["--mem-fraction-static", "0.80"] if gpu_memory < 60 else []

So try:
python -m sglang.launch_server --model-path allenai/olmOCR-7B-0225-preview --port 30024 --mem-fraction-static .80

infinitytrans added the bug Something isn't working label Feb 27, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA-ooM with large PDFs #50

CUDA-ooM with large PDFs #50

infinitytrans commented Feb 27, 2025

jakep-allenai commented Feb 27, 2025

jakep-allenai commented Feb 27, 2025

infinitytrans commented Feb 28, 2025

infinitytrans commented Feb 28, 2025

jakep-allenai commented Feb 28, 2025

CUDA-ooM with large PDFs #50

CUDA-ooM with large PDFs #50

Comments

infinitytrans commented Feb 27, 2025

🐛 Describe the bug

Versions

jakep-allenai commented Feb 27, 2025

jakep-allenai commented Feb 27, 2025

infinitytrans commented Feb 28, 2025

infinitytrans commented Feb 28, 2025

jakep-allenai commented Feb 28, 2025