Skip to content

Commit

Permalink
viewer fix
Browse files Browse the repository at this point in the history
  • Loading branch information
jakep-allenai committed Jan 29, 2025
1 parent 4c35105 commit dbf6477
Show file tree
Hide file tree
Showing 2 changed files with 7 additions and 8 deletions.
6 changes: 4 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -69,10 +69,12 @@ cat localworkspace/results/output_*.jsonl

You can view your documents side-by-side with the original PDF renders using the `dolmaviewer` command.

```python

```bash
python -m olmocr.viewer.dolmaviewer localworkspace/results/output_*.jsonl
```

Now open `./dolma_previews/tests_gnarly_pdfs_horribleocr_pdf.html` in your favorite browser.


### Multi-node / Cluster Usage

Expand Down
9 changes: 3 additions & 6 deletions olmocr/viewer/dolmaviewer.py
Original file line number Diff line number Diff line change
Expand Up @@ -44,12 +44,9 @@ def process_document(data, s3_client, template, output_dir):
source_file = metadata.get('Source-File')

# Generate base64 image of the corresponding PDF page
if source_file and source_file.startswith('s3://'):
local_pdf = tempfile.NamedTemporaryFile("wb+", suffix=".pdf")
local_pdf.write(get_s3_bytes(s3_client, source_file))
local_pdf.flush()
else:
raise ValueError("Expecting s3 files only")
local_pdf = tempfile.NamedTemporaryFile("wb+", suffix=".pdf")
local_pdf.write(get_s3_bytes(s3_client, source_file))
local_pdf.flush()

pages = []
for span in pdf_page_numbers:
Expand Down

0 comments on commit dbf6477

Please sign in to comment.