Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(document_extractor): pptx file type and missing metadata_filename UnstructuredIO #11364

Conversation

hgbdev
Copy link
Contributor

@hgbdev hgbdev commented Dec 4, 2024

Summary

The issue occurs when two environment variables for UnstructuredIO, UNSTRUCTURED_API_URL and UNSTRUCTURED_API_KEY, have been defined.

After uploading a PPTX file and processing it through the Doc Extractor (UnstructuredIO), an error message appears: "If file is specified in partition_via_api, metadata_filename must be specified as well."

I checked and noticed that the file type might not be correctly set for the partition_via_api method when handling .pptx files. Additionally, I added the metadata_filename, which was previously missing, to extract the value from the file name.

References from the unstructured package:
Line 54
Line 107

Screenshots

Before: After:
image image image image

Checklist

Important

Please review the checklist below before submitting your pull request.

  • This change requires a documentation update, included: Dify Document
  • I understand that this PR may be closed in case there was no previous discussion or issues. (This doesn't apply to typos!)
  • I've added a test for each change that was introduced, and I tried as much as possible to make a single atomic change.
  • I've updated the documentation accordingly.
  • I ran dev/reformat(backend) and cd web && npx lint-staged(frontend) to appease the lint gods

@dosubot dosubot bot added size:M This PR changes 30-99 lines, ignoring generated files. 🐞 bug Something isn't working labels Dec 4, 2024
@dosubot dosubot bot added size:S This PR changes 10-29 lines, ignoring generated files. and removed size:M This PR changes 30-99 lines, ignoring generated files. labels Dec 5, 2024
@hgbdev hgbdev force-pushed the fix/pptx-file-type-and-missing-metadata-filename-UnstructuredIO branch from 93c78d7 to 4fb92cc Compare December 5, 2024 08:05
@dosubot dosubot bot added the lgtm This PR has been approved by a maintainer label Dec 5, 2024
@JohnJyong JohnJyong merged commit 9277156 into langgenius:main Dec 6, 2024
5 checks passed
iamjoel pushed a commit that referenced this pull request Dec 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🐞 bug Something isn't working lgtm This PR has been approved by a maintainer size:S This PR changes 10-29 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants