-
Notifications
You must be signed in to change notification settings - Fork 243
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CUDA-ooM with large PDFs #50
Comments
Hey, thanks for the report, a few follow up questions:
|
Nvm, I see it's a 3090, it's strange I have tested on 4090's please share your PDF if you can. |
PDF is not confidential, but copyrighted. I will sent you a link by mail. |
EDIT: This only happens when SGLang is started standalone: when the pipeline starts the server on its own, everything is fine. |
Ahh, I know why. I've noticed that on devices with less than 40GB or so, you need to specify a lower mem fraction static for sglang: # Check GPU memory, lower mem devices need a bit less KV cache space because the VLM takes additional memory
gpu_memory = torch.cuda.get_device_properties(0).total_memory / (1024**3) # Convert to GB
mem_fraction_arg = ["--mem-fraction-static", "0.80"] if gpu_memory < 60 else [] So try: |
🐛 Describe the bug
Component: Pipeline (pipeline.py)
Version: Latest as of February 27, 2025 (assumed from git clone)
Environment:
OS: Ubuntu
Python: 3.12
GPU: NVIDIA RTX 3090 (24 GB VRAM)
CUDA: 12.4
SGLang: 0.4.2
PyTorch: Installed via olmocr dependencies
Description:
When processing large PDF files (e.g., tests/2.pdf with 4180 pages), the olmocr.pipeline module excessively allocates CUDA memory (VRAM) during the preparation phase, leading to a torch.OutOfMemoryError. This occurs even when parameters like --pages_per_group and --workers are set to limit the number of pages processed simultaneously. The issue appears to stem from the pipeline loading or rendering all pages into CUDA memory at once, rather than respecting the batch size defined by --pages_per_group.
Steps to Reproduce:
Set up the olmocr environment:
Start the SGLang server:
Run the pipeline with a large PDF (e.g., 4180 pages):
Monitor VRAM usage:
Actual Result:
The pipeline process allocates excessive CUDA memory (~3 GiB observed, growing with PDF size) before sending data to the server.
Logs show Got 4180 pages to do for tests/2.pdf in worker 0, indicating all pages are prepared at once, ignoring --pages_per_group 10. VRAM fills up (e.g., 20.45 GiB by server + 3.18 GiB by pipeline), triggering:
Expected Result:
The pipeline should respect --pages_per_group 10, preparing and processing only 10 pages at a time in CUDA memory. VRAM usage by the pipeline should remain minimal (e.g., <1-2 GiB for 10 pages), allowing large PDFs (1000+ pages) to be processed without OOM errors.
Additional Information:
Pipeline VRAM usage spikes during PDF preparation, before server inference begins (server logs show no significant activity beyond initial requests).
Setting CUDA_VISIBLE_DEVICES="" prevents CUDA usage but crashes the pipeline with RuntimeError: No CUDA GPUs are available, indicating a hard dependency on CUDA.
Suggested Fix:
Modify pipeline.py to incrementally load and render PDF pages in batches defined by --pages_per_group, avoiding loading all pages into CUDA memory at once.
Optionally, allow CPU-only rendering as a fallback (remove or make optional the GPU check in check.py:38).
Logs:
2025-02-27 01:41:39,217 - main - INFO - Got 4180 pages to do for tests/2.pdf in worker 0
[...]
Versions
aiohappyeyeballs==2.4.6
aiohttp==3.11.13
aiosignal==1.3.2
annotated-types==0.7.0
anthropic==0.47.2
anyio==4.8.0
asttokens==3.0.0
attrs==25.1.0
beaker-py==1.34.1
bitsandbytes==0.45.3
bleach==6.2.0
boto3==1.37.2
botocore==1.37.2
cached_path==1.6.7
cachetools==5.5.2
certifi==2025.1.31
cffi==1.17.1
charset-normalizer==3.4.1
click==8.1.8
cloudpickle==3.1.1
compressed-tensors==0.8.0
cryptography==44.0.1
cuda-bindings==12.8.0
cuda-python==12.8.0
datasets==3.3.2
decorator==5.2.1
decord==0.6.0
dill==0.3.8
diskcache==5.6.3
distro==1.9.0
docker==7.1.0
einops==0.8.1
executing==2.2.0
fastapi==0.115.8
filelock==3.17.0
flashinfer==0.1.6+cu124torch2.4
frozenlist==1.5.0
fsspec==2024.12.0
ftfy==6.3.1
fuzzysearch==0.7.3
gguf==0.10.0
google-api-core==2.24.1
google-auth==2.38.0
google-cloud-core==2.4.2
google-cloud-storage==2.19.0
google-crc32c==1.6.0
google-resumable-media==2.7.2
googleapis-common-protos==1.68.0
h11==0.14.0
hf_transfer==0.1.9
httpcore==1.0.7
httptools==0.6.4
httpx==0.28.1
huggingface-hub==0.27.1
idna==3.10
importlib_metadata==8.6.1
interegular==0.3.3
ipython==8.32.0
jedi==0.19.2
Jinja2==3.1.5
jiter==0.8.2
jmespath==1.0.1
jsonschema==4.23.0
jsonschema-specifications==2024.10.1
lark==1.2.2
lingua-language-detector==2.0.2
litellm==1.61.17
llvmlite==0.44.0
lm-format-enforcer==0.10.11
markdown-it-py==3.0.0
markdown2==2.5.3
MarkupSafe==3.0.2
matplotlib-inline==0.1.7
mdurl==0.1.2
mistral_common==1.5.3
modelscope==1.23.1
mpmath==1.3.0
msgpack==1.1.0
msgspec==0.19.0
multidict==6.1.0
multiprocess==0.70.16
nest-asyncio==1.6.0
networkx==3.4.2
numba==0.61.0
numpy==1.26.4
nvidia-cublas-cu12==12.4.5.8
nvidia-cuda-cupti-cu12==12.4.127
nvidia-cuda-nvrtc-cu12==12.4.127
nvidia-cuda-runtime-cu12==12.4.127
nvidia-cudnn-cu12==9.1.0.70
nvidia-cufft-cu12==11.2.1.3
nvidia-curand-cu12==10.3.5.147
nvidia-cusolver-cu12==11.6.1.9
nvidia-cusparse-cu12==12.3.1.170
nvidia-cusparselt-cu12==0.6.2
nvidia-ml-py==12.570.86
nvidia-nccl-cu12==2.21.5
nvidia-nvjitlink-cu12==12.4.127
nvidia-nvtx-cu12==12.4.127
-e git+https://github.com/allenai/olmocr.git@bd08fdb4761538c96224ace9e951e5d956589790#egg=olmocr
openai==1.64.0
opencv-python-headless==4.11.0.86
orjson==3.10.15
outlines==0.0.46
packaging==24.2
pandas==2.2.3
parso==0.8.4
partial-json-parser==0.2.1.1.post5
pexpect==4.9.0
pillow==11.1.0
prometheus-fastapi-instrumentator==7.0.2
prometheus_client==0.21.1
prompt_toolkit==3.0.50
propcache==0.3.0
proto-plus==1.26.0
protobuf==5.29.3
psutil==7.0.0
ptyprocess==0.7.0
pure_eval==0.2.3
py-cpuinfo==9.0.0
pyairports==2.1.1
pyarrow==19.0.1
pyasn1==0.6.1
pyasn1_modules==0.4.1
pycountry==24.6.1
pycparser==2.22
pydantic==2.10.6
pydantic_core==2.27.2
Pygments==2.19.1
pypdf==5.3.0
pypdfium2==4.30.1
python-dateutil==2.9.0.post0
python-dotenv==1.0.1
python-multipart==0.0.20
pytz==2025.1
PyYAML==6.0.2
pyzmq==26.2.1
RapidFuzz==3.12.1
ray==2.42.1
referencing==0.36.2
regex==2024.11.6
requests==2.32.3
rich==13.9.4
rpds-py==0.23.1
rsa==4.9
s3transfer==0.11.3
safetensors==0.5.3
sentencepiece==0.2.0
sequence_align==0.2.0
setproctitle==1.3.5
setuptools==75.8.2
sgl-kernel==0.0.3.post1
sglang==0.4.2
six==1.17.0
smart-open==7.1.0
sniffio==1.3.1
stack-data==0.6.3
starlette==0.45.3
sympy==1.13.1
tiktoken==0.9.0
tokenizers==0.21.0
torch==2.5.1
torchao==0.8.0
torchvision==0.20.1
tqdm==4.67.1
traitlets==5.14.3
transformers==4.49.0
triton==3.1.0
typing_extensions==4.12.2
tzdata==2025.1
urllib3==2.3.0
uvicorn==0.34.0
uvloop==0.21.0
vllm==0.6.4.post1
watchfiles==1.0.4
wcwidth==0.2.13
webencodings==0.5.1
websockets==15.0
wrapt==1.17.2
xformers==0.0.28.post3
xgrammar==0.1.14
xxhash==3.5.0
yarl==1.18.3
zipp==3.21.0
zstandard==0.23.0
The text was updated successfully, but these errors were encountered: