FLUX.2 is live! High-fidelity image generation made simple.

This guide walks you from zero to working: you’ll learn what OCR is (and why PDFs can be tricky), how to turn any PDF—including those with screenshots of tables—into text, and how to let an LLM do the heavy lifting to clean OCR noise, reconstruct tables, and summarize the document. We’ll use DeepInfra’s OpenAI-compatible API and the Kimi K2 model.
You’ll also get a beginner-friendly explanation of a complete, working script that you can paste into a notebook or .py file and run immediately.
A PDF isn’t a “document” in the word-processor sense—it’s a set of drawing instructions. On one page, you might have vector text (actual characters placed at x/y coordinates), while on another, you only have embedded images (such as scans, screenshots, or photos). In many real files, you get hybrids where both appear on the same page. Vector text is easy: libraries can read the characters directly. Images are not: the text is just pixels and must be recognized. Even when text exists, extraction can be brittle due to ligatures (“fi”), kerning tricks, unusual encodings, multi-column layouts, or invisible clipping layers. That’s why a naïve “copy text from PDF” often returns jumbled paragraphs, missing numbers, or headers mixed into the body.
Optical Character Recognition (OCR) converts pictures of text into machine-readable characters. In practice, you rasterize each PDF page (often at 300–400 DPI), feed the image to an OCR engine such as Tesseract, and get back the characters it “sees.” OCR is powerful, but it’s not magic. Low resolution, motion blur, skewed scans, glare, small fonts, and decorative typefaces can lead to misreads, such as 0 ↔ O or 1 ↔ l, dropping diacritics, or breaking words at line wraps. Complex layouts are hard too: tables tend to be returned as plain lines of text, columns may be read in the wrong order, and headers/footers sometimes leak into the main content. Good preprocessing (grayscale, autocontrast, deskew), the right page-segmentation mode, and correct language packs help a lot—but you should still expect some noise.
An LLM is the missing cleanup crew. After OCR, you can ask the model to:
This pipeline is especially effective when your PDFs are screenshots of tables. The OCR recovers the numbers; the LLM then organizes them into human-friendly tables and narratives, while you steer quality with strict prompts (e.g., “do not invent numbers; note uncertainty”). The result feels like a readable, searchable document—even when the original was just pixels.
Below is a ready-to-run example that:
System prerequisites
Ubuntu/Debian: sudo apt-get install poppler-utils
Windows: install Poppler and add bin to PATH.
Ubuntu/Debian: sudo apt-get install tesseract-ocr
Windows: install Tesseract and add it to PATH.
Python packages
pip install –upgrade openai pdf2image pytesseract pillow
To use the LLM later on, we need to set up a client with DeepInfra using their API key. Either you have it already in your environment variables, or you use this code to prompt it directly:
# Setup: store your token securely and create a DeepInfra OpenAI-compatible client
import os, getpass
from openai import OpenAI
os.environ["DEEPINFRA_API_TOKEN"] = getpass.getpass("Paste your DeepInfra API token: ")
client = OpenAI(
api_key=os.environ["DEEPINFRA_API_TOKEN"],
base_url="https://api.deepinfra.com/v1/openai",
)The following block turns a PDF into plain text by rendering each page as an image and then reading that image with Tesseract. The work starts in ocr_pdf_to_pages: convert_from_path rasterizes every PDF page at the chosen resolution (dpi=300 by default). That step relies on Poppler; if Poppler isn’t installed or on your PATH, pdf2image can’t produce images, and you’ll see a “PDFInfoNotInstalledError”. Once the pages are images, the function loops in order and passes each image to ocr_page. The return value is a Python list of strings where item 0 is page 1’s text, item 1 is page 2’s text, and so on—ideal to keep page order intact for downstream summarization.
from pdf2image import convert_from_path
from PIL import Image, ImageOps
import pytesseract
def ocr_page(img, lang="eng", psm=6):
# Light cleanup only
g = ImageOps.grayscale(img)
g = ImageOps.autocontrast(g, cutoff=1)
cfg = f"--oem 3 --psm {psm} -l {lang} --dpi 300"
return pytesseract.image_to_string(g, config=cfg)
def ocr_pdf_to_pages(pdf_path, dpi=300, lang="eng", psm=6):
images = convert_from_path(pdf_path, dpi=dpi)
ocr_texts = []
for i, img in enumerate(images):
text = ocr_page(img, lang=lang, psm=psm)
ocr_texts.append(text)
return ocr_textsInside ocr_page, the image is lightly preprocessed to support OCR while avoiding heavy computer-vision operations. Converting the image to grayscale reduces color noise, and applying autocontrast stretches the histogram so that dark text becomes darker and faint text becomes more visible. This often improves recognition quality on scanned documents and photos of tables.
The cfg string contains the main Tesseract settings:
The next block is the LLM orchestration that turns raw OCR text into a clean, structured result. First, MODEL selects the DeepInfra-hosted Kimi K2 model (–> internal link) that will do the heavy lifting. The SYSTEM prompt defines the model’s job description: it should act like a document restoration and summarization specialist. Those instructions are intentionally specific—clean common OCR mistakes (like 0/O, 1/l and hyphenation), infer structure (headings/sections), and rebuild tables as Markdown. Crucially, it also tells the model to return a compact JSON object only. That single requirement makes the output predictable and easy to parse later.
MODEL = "moonshotai/Kimi-K2-Instruct-0905"
SYSTEM = (
"You are a document restoration and summarization specialist. "
"Given raw OCR text of each PDF page, clean obvious OCR errors (hyphenation, 0/O, 1/l), "
"infer headings and structure, reconstruct any tabular data as Markdown tables with headers, "
"and produce an executive summary that captures purpose, key figures, dates, and decisions. "
"If something is ambiguous, note uncertainty briefly. Respond with a compact JSON object only."
)
USER_TEMPLATE = (
"Document title: {title}\n\n"
"Here are the OCR texts for each page (in order). Use them to build a cleaned, structured report.\n\n"
"<PAGES>\n{pages}\n</PAGES>\n\n"
"Output JSON schema:\n"
"{{\n"
' "title": string,\n'
' "executive_summary": string,\n'
' "sections": [ {{ "heading": string, "summary": string }} ],\n'
' "tables_markdown": [ string ],\n'
' "entities": {{ "people": [string], "orgs": [string], "dates": [string], "figures": [string] }}\n'
"}}\n"
)
def pages_block(ocr_texts):
lines = []
for i, t in enumerate(ocr_texts, 1):
lines.append(f"<PAGE index=\"{i}\">\n{t}\n</PAGE>")
return "\n\n".join(lines)
def llm_restore_and_summarize(ocr_pages, title, max_tokens=1200, temperature=0):
user = USER_TEMPLATE.format(title=title, pages=pages_block(ocr_pages))
messages = [
{"role": "system", "content": SYSTEM},
{"role": "user", "content": user}
]
r = client.chat.completions.create(
model=MODEL,
messages=messages,
temperature=temperature,
max_tokens=max_tokens,
)
# usage (when available)
try:
u = r.usage
print(f"→ pt={getattr(u,'prompt_tokens',None)}, ct={getattr(u,'completion_tokens',None)}, est={getattr(u,'estimated_cost',None)}")
except Exception:
pass
return r.choices[0].message.contentThe USER_TEMPLATE is the envelope for your actual content. It inserts two things: {title} (a human-friendly document title) and {pages} (the full OCR content). The OCR text is wrapped in <PAGE …> … </PAGE> blocks to preserve order and give the model clear boundaries. At the end of the template, you show a target JSON schema.
Notice the doubled braces {{ … }}—that’s how you escape literal braces so Python’s .format() doesn’t treat them as placeholders. Showing the schema nudges the model to emit the exact structure you’ll parse downstream: title, an executive_summary, a list of sections, any tables_markdown, and key entities.
pages_block(ocr_texts) just assembles those <PAGE> blocks in order (1-based indexing). Keeping page order matters because many PDFs carry meaning across pages, and this gives the LLM a clean, chronological transcript.
llm_restore_and_summarize(…) builds the final messages with the system role (policy) and the user role (your OCR content + schema). It then calls client.chat.completions.create with a conservative temperature=0 for reproducibility and a generous max_tokens so the model has room to return JSON and any reconstructed tables. After the call, it prints usage metrics when available—prompt tokens, completion tokens, and the provider’s estimated cost—so you can keep an eye on spend. The function returns the raw JSON string from the model; in the next step of your pipeline, you typically json.loads(…) it and render a Markdown report or store structured fields in a database.
A few important implications: keeping the SYSTEM text byte-identical across calls improves cache hits (cheaper warm requests). Tight, explicit output instructions (“JSON only”) reduce parsing headaches. If your PDFs get very large, you can pre-chunk the OCR pages and call this function per chunk, then run a final merge pass—but the interface stays the same. Finally, because the schema includes tables_markdown, this flow works even when the PDF only contains screenshots of tables: the OCR gives you plaintext, and the model reorganizes it back into readable Markdown tables.
This function is the end-to-end runner that turns a PDF into both structured data and a readable Markdown report. You give it a file path; it handles OCR, calls the LLM, parses the model’s JSON, formats nice Markdown, saves it, and returns everything for further use.
import json, pathlib
def run_llm_heavy_pipeline(pdf_path, dpi=300, lang="eng", psm=6, title=None):
title = title or pathlib.Path(pdf_path).name
print(f"[INFO] OCR… {pdf_path}")
pages = ocr_pdf_to_pages(pdf_path, dpi=dpi, lang=lang, psm=psm)
print(f"[INFO] {len(pages)} pages OCR'd. Sending to LLM…")
raw = llm_restore_and_summarize(pages, title=title)
data = json.loads(raw) # Expecting strict JSON per system prompt
# Build a human‑readable Markdown
md_lines = [f"# {data.get('title', title)}\n", data.get('executive_summary','').strip(), "\n\n## Sections\n"]
for s in data.get('sections', []):
md_lines.append(f"### {s.get('heading','(untitled)')}\n{s.get('summary','').strip()}\n")
tables = data.get('tables_markdown', [])
if tables:
md_lines.append("\n## Tables\n")
for t in tables:
md_lines.append(t.strip() + "\n")
ents = data.get('entities', {})
if ents:
md_lines.append("\n## Entities\n")
for k, vs in ents.items():
if vs:
md_lines.append(f"**{k.capitalize()}**: " + ", ".join(vs) + "\n")
out = "\n".join(md_lines).strip()
out_path = pathlib.Path("llm_heavy_summary_" + pathlib.Path(pdf_path).stem + ".md")
out_path.write_text(out, encoding="utf-8")
print(f"\n[SAVED] {out_path.resolve()}\n")
return data, outThe function keeps roles clear: OCR yields raw text; the LLM fixes noise, restores structure, and standardizes output into a predictable JSON schema; the renderer turns that schema into Markdown. Because the JSON contains both prose and tables, you get the best of both worlds: a quick report for humans and structured fields for programmatic use. If a document is very large, you can adapt this runner to chunk the pages and then merge several partial JSONs—without changing the output format your consumers rely on.
The WHO document referenced here is a small, public quick-reference handout with cardiovascular disease (CVD) risk charts for the High-income Asia Pacific region (countries like Japan, Singapore, Brunei, South Korea). It typically spans two pages:
Both pages are dominated by large color-coded tables (grid “heatmaps”) where each cell shows a risk band such as <5%, 5–<10%, 10–<20%, 20–<30%, or ≥30%. Because the information is packed into image-style tables, this PDF is an ideal stress test for your pipeline: OCR must read numbers and headers from pixels, and the LLM must reconstruct those grids into clean Markdown tables and a readable summary.
#Example: https://cdn.who.int/media/docs/default-source/cardiovascular-diseases/high-income-asia-pacific.pdf?sfvrsn=186ec883_2
pdf_path = "who_example_screenshot.pdf"
run_llm_heavy_pipeline(pdf_path)In this example, pdf_path = “who_example_screenshot.pdf” points at a rasterized copy of the WHO file—i.e., a version where the tables are screenshots rather than selectable text. That’s intentional: it proves the workflow works even when the PDF contains no real text, only images. When you call run_llm_heavy_pipeline(pdf_path), the function:
As we can see from the result, our pipeline has identified the PDF and was able to recover the results such that they are usable for the LLM to interpret and do further analysis on it:
[INFO] OCR… who_example_screenshot.pdf
[INFO] 2 pages OCR’d. Sending to LLM…
→ pt=3703, ct=841, est=0.0035325
[SAVED] /Users/niklaslang/Desktop/Privat/Blog/DeepInfra/llm_heavy_summary_who_example_screenshot.md
({‘title’: ‘WHO CVD Risk Charts – High-income Asia Pacific (Non-lab & Lab-based)’,
‘executive_summary’: ‘Two WHO quick-reference charts estimate 10-year fatal/non-fatal cardiovascular-disease risk for adults in Brunei Darussalam, Japan, South Korea and Singapore. Page 1 gives laboratory-based risk using age, sex, smoking, systolic BP and total cholesterol; page 2 gives a non-laboratory version that replaces cholesterol with BMI. Colour bands show risk <5 %, 5–<10 %, 10–<20 %, 20–<30 % and ≥30 %. No explicit date or authorship visible; charts are intended for primary-care screening and treatment decisions.’,
‘sections’: [{‘heading’: ‘Laboratory-based risk chart’,
‘summary’: ’10-year CVD risk by age (30-74), sex, smoking status, systolic BP (120-<180 mmHg) and total cholesterol (3-8 mmol/L). Each cell gives the risk category colour code.’},
{‘heading’: ‘Non-laboratory risk chart’,
‘summary’: ‘Same structure but replaces cholesterol with BMI (18-34 kg/m²) so no blood test is required.’}],
‘tables_markdown’: [‘### Laboratory-based 10-year CVD risk (%) – Men & Women without diabetes\n| Age | Sex-Smoke | SBP <120 | 120-139 | 140-159 | 160-179 | ≥180 |\n|—–|———–|———-|———-|———-|———-|——|\n| 30-34 | M non-s | <5 | <5 | <5 | 5-<10 | 10-<20 |\n| 30-34 | M smoker | <5 | 5-<10 | 10-<20 | 10-<20 | 20-<30 |\n| 30-34 | F non-s | <5 | <5 | <5 | <5 | 5-<10 |\n| 30-34 | F smoker | <5 | <5 | 5-<10 | 10-<20 | 10-<20 |\n| 70-74 | M non-s | 10-<20 | 20-<30 | ≥30 | ≥30 | ≥30 |\n| 70-74 | M smoker | ≥30 | ≥30 | ≥30 | ≥30 | ≥30 |\n| 70-74 | F non-s | 10-<20 | 20-<30 | ≥30 | ≥30 | ≥30 |\n| 70-74 | F smoker | ≥30 | ≥30 | ≥30 | ≥30 | ≥30 |’,
‘### Non-laboratory 10-year CVD risk (%) – Men & Women (BMI instead of cholesterol)\n| Age | Sex-Smoke | BMI <20 | 20-<25 | 25-<30 | ≥30 |\n|—–|———–|———-|———-|———-|——|\n| 30-34 | M non-s | <5 | <5 | <5 | 5-<10 |\n| 30-34 | M smoker | <5 | 5-<10 | 10-<20 | 10-<20 |\n| 70-74 | M non-s | 20-<30 | ≥30 | ≥30 | ≥30 |\n| 70-74 | M smoker | ≥30 | ≥30 | ≥30 | ≥30 |’],
‘entities’: {‘people’: [],
‘orgs’: [‘WHO’],
‘dates’: [],
‘figures’: [’10-year CVD risk <5 %, 5–<10 %, 10–<20 %, 20–<30 %, ≥30 %’,
‘Total cholesterol 3–8 mmol/L’,
‘BMI 18–34 kg/m²’,
‘Systolic BP 120–≥180 mmHg’]}},
‘# WHO CVD Risk Charts – High-income Asia Pacific (Non-lab & Lab-based)\n\nTwo WHO quick-reference charts estimate 10-year fatal/non-fatal cardiovascular-disease risk for adults in Brunei Darussalam, Japan, South Korea and Singapore. Page 1 gives laboratory-based risk using age, sex, smoking, systolic BP and total cholesterol; page 2 gives a non-laboratory version that replaces cholesterol with BMI. Colour bands show risk <5 %, 5–<10 %, 10–<20 %, 20–<30 % and ≥30 %. No explicit date or authorship visible; charts are intended for primary-care screening and treatment decisions.\n\n\n## Sections\n\n### Laboratory-based risk chart\n10-year CVD risk by age (30-74), sex, smoking status, systolic BP (120-<180 mmHg) and total cholesterol (3-8 mmol/L). Each cell gives the risk category colour code.\n\n### Non-laboratory risk chart\nSame structure but replaces cholesterol with BMI (18-34 kg/m²) so no blood test is required.\n\n\n## Tables\n\n### Laboratory-based 10-year CVD risk (%) – Men & Women without diabetes\n| Age | Sex-Smoke | SBP <120 | 120-139 | 140-159 | 160-179 | ≥180 |\n|—–|———–|———-|———-|———-|———-|——|\n| 30-34 | M non-s | <5 | <5 | <5 | 5-<10 | 10-<20 |\n| 30-34 | M smoker | <5 | 5-<10 | 10-<20 | 10-<20 | 20-<30 |\n| 30-34 | F non-s | <5 | <5 | <5 | <5 | 5-<10 |\n| 30-34 | F smoker | <5 | <5 | 5-<10 | 10-<20 | 10-<20 |\n| 70-74 | M non-s | 10-<20 | 20-<30 | ≥30 | ≥30 | ≥30 |\n| 70-74 | M smoker | ≥30 | ≥30 | ≥30 | ≥30 | ≥30 |\n| 70-74 | F non-s | 10-<20 | 20-<30 | ≥30 | ≥30 | ≥30 |\n| 70-74 | F smoker | ≥30 | ≥30 | ≥30 | ≥30 | ≥30 |\n\n### Non-laboratory 10-year CVD risk (%) – Men & Women (BMI instead of cholesterol)\n| Age | Sex-Smoke | BMI <20 | 20-<25 | 25-<30 | ≥30 |\n|—–|———–|———-|———-|———-|——|\n| 30-34 | M non-s | <5 | <5 | <5 | 5-<10 |\n| 30-34 | M smoker | <5 | 5-<10 | 10-<20 | 10-<20 |\n| 70-74 | M non-s | 20-<30 | ≥30 | ≥30 | ≥30 |\n| 70-74 | M smoker | ≥30 | ≥30 | ≥30 | ≥30 |\n\n\n## Entities\n\n**Orgs**: WHO\n\n**Figures**: 10-year CVD risk <5 %, 5–<10 %, 10–<20 %, 20–<30 %, ≥30 %, Total cholesterol 3–8 mmol/L, BMI 18–34 kg/m², Systolic BP 120–≥180 mmHg’)
If you prefer to use the original WHO URL directly, you can download it to a local file first and pass that path in; the rest of the pipeline remains exactly the same.
Once you have a reliable OCR + LLM pipeline in place, the real value comes from refining it for scale, cost control, and domain accuracy. Add lightweight usage logging (token counts, estimated cost) to spot regressions early, and introduce spend guardrails that alert you when a single call or session grows beyond your budget.
If you routinely process structured documents such as invoices or technical tables, consider augmenting your pipeline with a specialized table-OCR step for layout-aware extraction, while reserving the LLM for interpretation, normalization, and narrative output.
For multilingual workflows, don’t forget to install the appropriate Tesseract language packs (e.g., deu, fra) and pass the correct lang parameter—this alone can dramatically reduce hallucinations and misreads.
By iterating on these components, you can evolve a simple OCR routine into a robust, scalable, and language-flexible document understanding system ready for production use.
Seed Anchoring and Parameter Tweaking with SDXL Turbo: Create Stunning Cubist ArtIn this blog post, we're going to explore how to create stunning cubist art using SDXL Turbo using some advanced image generation techniques.
Chat with books using DeepInfra and LlamaIndexAs DeepInfra, we are excited to announce our integration with LlamaIndex.
LlamaIndex is a powerful library that allows you to index and search documents
using various language models and embeddings. In this blog post, we will show
you how to chat with books using DeepInfra and LlamaIndex.
We will ...
Nemotron 3 Nano Explained: NVIDIA’s Efficient Small LLM and Why It Matters<p>The open-source LLM space has exploded with models competing across size, efficiency, and reasoning capability. But while frontier models dominate headlines with enormous parameter counts, a different category has quietly become essential for real-world deployment: small yet high-performance models optimized for edge devices, private on-prem systems, and cost-sensitive applications. NVIDIA’s Nemotron family brings together open […]</p>
© 2026 Deep Infra. All rights reserved.