Text-based PDFs convert instantly. Scanned or image-only PDFs have no text layer, so iLoveMD reads them with OCR that runs entirely in your browser. The first scanned PDF downloads the OCR engine once (about 9 MB); after that it works offline.
PDF to Markdown Docs
iLoveMD has two paths for PDFs. Text-based PDFs (the ones where you can select and copy text with your mouse) extract instantly via PDF.js. Scanned PDFs — photographs, screenshots saved as PDF, many government forms — have no text layer, so the converter automatically falls back to in-browser OCR powered by Tesseract.js. Both paths run entirely on your device.
Text-layer extraction
The fast path. Most PDFs produced by Word, LaTeX, browsers, or any tool that authored real text have an embedded text layer. PDF.js parses it, and the converter rebuilds the document structure as Markdown: headings get # prefixes, list items become bullets, link annotations carry through.
How to tell if your PDF has a text layer: open it in any reader and try to select a sentence with your cursor. If the selection follows the text word by word, you have a text layer. If the whole page selects as one image, you're on the scanned path.
OCR fallback for scanned PDFs feature
When PDF.js returns little or no text (the heuristic threshold is ~10 non-whitespace characters total across the document), the converter automatically switches to OCR. The Tesseract engine + English language model load once (~9 MB total) and stay cached for subsequent scans. After the first download, OCR works offline.
OCR is slower than text extraction — page-by-page processing with progress updates in the preview pane. Best results come from clear, high-resolution scans of printed text. Handwriting, very low resolution (below ~150 DPI), and complex layouts (multi-column with images) reduce accuracy.
Tip: if a scan reads as gibberish, the source is usually too low-resolution. Re-scan at 300 DPI in grayscale and re-run.
What carries over
Headings, paragraphs, ordered and unordered lists, hyperlinks (from text-layer PDFs), and basic table-like structures. The output is a readable Markdown document; for serious round-trip fidelity (a PDF you produced from Markdown that you want to extract back without loss), the conversion is one-way lossy by design.
What does NOT carry: images, signatures, form fields, footnotes (extracted as inline text), multi-column layouts (often collapse to single column), and most embedded fonts beyond their characters. Page numbers and headers/footers usually appear as repeated lines — sometimes useful, sometimes noise.
Privacy posture for OCR
OCR is the operation people are most nervous about — typical online services upload the scan to their servers and bill per page. iLoveMD does not. The Tesseract engine runs as WebAssembly in your browser; the only network traffic is the one-time download of the engine + language model on your first scan. The scan itself never leaves your device.
This is why the CSP includes wasm-unsafe-eval: Tesseract requires it to compile its WASM module. The token is the minimum relaxation required; it permits WebAssembly compilation only, not arbitrary JS eval. Without it, scanned PDFs would silently fail to read.
Mermaid in source PDFs feature
If your PDF was produced from Markdown that contained \`\`\`mermaid blocks (for example, exported from our Markdown → PDF converter), the diagrams in the PDF are rasterized images. The reverse converter extracts the surrounding text but not the embedded image — the diagrams will be missing in the output Markdown.
If you have the original Mermaid source, paste it back into the output via the Mermaid Editor to round-trip. If you only have the PDF and need the diagram, treat the image as documentation in its own right.