Skip to content

Bug: doesn't seem to perform OCR on image pdfs #6

@gasparschott

Description

@gasparschott

This is one of the very few pdf to md solutions I've found that is able to preserve italics and bold from text pdfs, which is great. However, when I attempt to convert an image pdf the only output is an md file with a link to the extracted image. All dependencies are installed, including opencv and pytesseract.

This is the terminal output:

2025-10-25 20:36:04,152 - main - INFO - Image captioning model set up successfully.
2025-10-25 20:36:04,185 - main - INFO - Extracted 0 tables from the PDF.
2025-10-25 20:36:04,185 - main - INFO - Processing page 1
2025-10-25 20:36:04,497 - main - INFO - Extracted 0 links from the page.
2025-10-25 20:36:05,672 - main - INFO - Markdown content saved successfully.
2025-10-25 20:36:05,672 - main - INFO - Markdown content has been saved to pdf_to_MD_output/Test-italics.md

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions