Build an OCR-Powered PDF Reader & Summarizer with DeepInfra (Kimi K2)

We use essential cookies to make our site work. With your consent, we may also use non-essential cookies to improve user experience and analyze website traffic…

NVIDIA Nemotron 3 Super - blazing-fast agentic AI, ready to deploy today!

Published on 2026.01.13 by DeepInfra

This guide walks you from zero to working: you’ll learn what OCR is (and why PDFs can be tricky), how to turn any PDF—including those with screenshots of tables—into text, and how to let an LLM do the heavy lifting to clean OCR noise, reconstruct tables, and summarize the document. We’ll use DeepInfra’s OpenAI-compatible API and the Kimi K2 model.

You’ll also get a beginner-friendly explanation of a complete, working script that you can paste into a notebook or .py file and run immediately.

What is the issue with PDFs?

A PDF isn’t a “document” in the word-processor sense—it’s a set of drawing instructions. On one page, you might have vector text (actual characters placed at x/y coordinates), while on another, you only have embedded images (such as scans, screenshots, or photos). In many real files, you get hybrids where both appear on the same page. Vector text is easy: libraries can read the characters directly. Images are not: the text is just pixels and must be recognized. Even when text exists, extraction can be brittle due to ligatures (“fi”), kerning tricks, unusual encodings, multi-column layouts, or invisible clipping layers. That’s why a naïve “copy text from PDF” often returns jumbled paragraphs, missing numbers, or headers mixed into the body.

What OCR is (and isn’t)

Optical Character Recognition (OCR) converts pictures of text into machine-readable characters. In practice, you rasterize each PDF page (often at 300–400 DPI), feed the image to an OCR engine such as Tesseract, and get back the characters it “sees.” OCR is powerful, but it’s not magic. Low resolution, motion blur, skewed scans, glare, small fonts, and decorative typefaces can lead to misreads, such as 0 ↔ O or 1 ↔ l, dropping diacritics, or breaking words at line wraps. Complex layouts are hard too: tables tend to be returned as plain lines of text, columns may be read in the wrong order, and headers/footers sometimes leak into the main content. Good preprocessing (grayscale, autocontrast, deskew), the right page-segmentation mode, and correct language packs help a lot—but you should still expect some noise.

Why add a Large Language Model?

An LLM is the missing cleanup crew. After OCR, you can ask the model to:

Clean obvious artifacts (fix common 0/O and 1/l confusions, undo broken hyphenation).
Infer structure by identifying headings, sections, and bullet points so the text reads like a document again.
Rebuild tables by turning the flat OCR text back into clear Markdown tables with headers and rows.
Summarize long or repetitive material into a concise report that highlights entities, dates, and key figures.

This pipeline is especially effective when your PDFs are screenshots of tables. The OCR recovers the numbers; the LLM then organizes them into human-friendly tables and narratives, while you steer quality with strict prompts (e.g., “do not invent numbers; note uncertainty”). The result feels like a readable, searchable document—even when the original was just pixels.

How the code work

Below is a ready-to-run example that:

renders PDF pages to images,
runs Tesseract to OCR each page,
sends raw OCR text to Kimi K2 (–> internal link) with instructions to fix obvious OCR artifacts, reconstruct Markdown tables, and produce a structured summary in JSON,
renders a human-readable Markdown file from the model’s JSON.

System prerequisites

Poppler for page rasterization (pdf2image):
macOS: brew install poppler

Ubuntu/Debian: sudo apt-get install poppler-utils

Windows: install Poppler and add bin to PATH.

Tesseract OCR:
macOS: brew install tesseract,

Ubuntu/Debian: sudo apt-get install tesseract-ocr

Windows: install Tesseract and add it to PATH.

Python packages

pip install –upgrade openai pdf2image pytesseract pillow

To use the LLM later on, we need to set up a client with DeepInfra using their API key. Either you have it already in your environment variables, or you use this code to prompt it directly:

# Setup: store your token securely and create a DeepInfra OpenAI-compatible client
import os, getpass
from openai import OpenAI


os.environ["DEEPINFRA_API_TOKEN"] = getpass.getpass("Paste your DeepInfra API token: ")
client = OpenAI(
   api_key=os.environ["DEEPINFRA_API_TOKEN"],
   base_url="https://api.deepinfra.com/v1/openai",
)copy

The following block turns a PDF into plain text by rendering each page as an image and then reading that image with Tesseract. The work starts in ocr_pdf_to_pages: convert_from_path rasterizes every PDF page at the chosen resolution (dpi=300 by default). That step relies on Poppler; if Poppler isn’t installed or on your PATH, pdf2image can’t produce images, and you’ll see a “PDFInfoNotInstalledError”. Once the pages are images, the function loops in order and passes each image to ocr_page. The return value is a Python list of strings where item 0 is page 1’s text, item 1 is page 2’s text, and so on—ideal to keep page order intact for downstream summarization.

from pdf2image import convert_from_path
from PIL import Image, ImageOps
import pytesseract


def ocr_page(img, lang="eng", psm=6):
   # Light cleanup only
   g = ImageOps.grayscale(img)
   g = ImageOps.autocontrast(g, cutoff=1)
   cfg = f"--oem 3 --psm {psm} -l {lang} --dpi 300"
   return pytesseract.image_to_string(g, config=cfg)


def ocr_pdf_to_pages(pdf_path, dpi=300, lang="eng", psm=6):
   images = convert_from_path(pdf_path, dpi=dpi)
   ocr_texts = []
   for i, img in enumerate(images):
       text = ocr_page(img, lang=lang, psm=psm)
       ocr_texts.append(text)
   return ocr_textscopy

Inside ocr_page, the image is lightly preprocessed to support OCR while avoiding heavy computer-vision operations. Converting the image to grayscale reduces color noise, and applying autocontrast stretches the histogram so that dark text becomes darker and faint text becomes more visible. This often improves recognition quality on scanned documents and photos of tables.

The cfg string contains the main Tesseract settings:

–oem 3 uses the modern LSTM OCR engine for higher accuracy.
–psm 6 sets the page segmentation mode to “a single uniform block of text,” which works well for most documents and table snapshots. If the PDF has multiple columns or a more complex layout, switching to psm=4 can improve the reading order.
-l eng loads the English language model. You can install and switch to deu, fra, or eng+deu for multilingual content.
–dpi 300 tells Tesseract the assumed resolution. 300 DPI is a solid default, though increasing it to 400 DPI can improve accuracy for very small fonts (with higher cost in speed and memory).

The next block is the LLM orchestration that turns raw OCR text into a clean, structured result. First, MODEL selects the DeepInfra-hosted Kimi K2 model (–> internal link) that will do the heavy lifting. The SYSTEM prompt defines the model’s job description: it should act like a document restoration and summarization specialist. Those instructions are intentionally specific—clean common OCR mistakes (like 0/O, 1/l and hyphenation), infer structure (headings/sections), and rebuild tables as Markdown. Crucially, it also tells the model to return a compact JSON object only. That single requirement makes the output predictable and easy to parse later.

MODEL = "moonshotai/Kimi-K2-Instruct-0905"


SYSTEM = (
   "You are a document restoration and summarization specialist. "
   "Given raw OCR text of each PDF page, clean obvious OCR errors (hyphenation, 0/O, 1/l), "
   "infer headings and structure, reconstruct any tabular data as Markdown tables with headers, "
   "and produce an executive summary that captures purpose, key figures, dates, and decisions. "
   "If something is ambiguous, note uncertainty briefly. Respond with a compact JSON object only."
)


USER_TEMPLATE = (
   "Document title: {title}\n\n"
   "Here are the OCR texts for each page (in order). Use them to build a cleaned, structured report.\n\n"
   "<PAGES>\n{pages}\n</PAGES>\n\n"
   "Output JSON schema:\n"
   "{{\n"
   '  "title": string,\n'
   '  "executive_summary": string,\n'
   '  "sections": [ {{ "heading": string, "summary": string }} ],\n'
   '  "tables_markdown": [ string ],\n'
   '  "entities": {{ "people": [string], "orgs": [string], "dates": [string], "figures": [string] }}\n'
   "}}\n"
)


def pages_block(ocr_texts):
   lines = []
   for i, t in enumerate(ocr_texts, 1):
       lines.append(f"<PAGE index=\"{i}\">\n{t}\n</PAGE>")
   return "\n\n".join(lines)


def llm_restore_and_summarize(ocr_pages, title, max_tokens=1200, temperature=0):
   user = USER_TEMPLATE.format(title=title, pages=pages_block(ocr_pages))
   messages = [
       {"role": "system", "content": SYSTEM},
       {"role": "user", "content": user}
   ]
   r = client.chat.completions.create(
       model=MODEL,
       messages=messages,
       temperature=temperature,
       max_tokens=max_tokens,
   )
   # usage (when available)
   try:
       u = r.usage
       print(f"→ pt={getattr(u,'prompt_tokens',None)}, ct={getattr(u,'completion_tokens',None)}, est={getattr(u,'estimated_cost',None)}")
   except Exception:
       pass
   return r.choices[0].message.contentcopy

The USER_TEMPLATE is the envelope for your actual content. It inserts two things: {title} (a human-friendly document title) and {pages} (the full OCR content). The OCR text is wrapped in <PAGE …> … </PAGE> blocks to preserve order and give the model clear boundaries. At the end of the template, you show a target JSON schema.

Notice the doubled braces {{ … }}—that’s how you escape literal braces so Python’s .format() doesn’t treat them as placeholders. Showing the schema nudges the model to emit the exact structure you’ll parse downstream: title, an executive_summary, a list of sections, any tables_markdown, and key entities.

pages_block(ocr_texts) just assembles those <PAGE> blocks in order (1-based indexing). Keeping page order matters because many PDFs carry meaning across pages, and this gives the LLM a clean, chronological transcript.

llm_restore_and_summarize(…) builds the final messages with the system role (policy) and the user role (your OCR content + schema). It then calls client.chat.completions.create with a conservative temperature=0 for reproducibility and a generous max_tokens so the model has room to return JSON and any reconstructed tables. After the call, it prints usage metrics when available—prompt tokens, completion tokens, and the provider’s estimated cost—so you can keep an eye on spend. The function returns the raw JSON string from the model; in the next step of your pipeline, you typically json.loads(…) it and render a Markdown report or store structured fields in a database.

A few important implications: keeping the SYSTEM text byte-identical across calls improves cache hits (cheaper warm requests). Tight, explicit output instructions (“JSON only”) reduce parsing headaches. If your PDFs get very large, you can pre-chunk the OCR pages and call this function per chunk, then run a final merge pass—but the interface stays the same. Finally, because the schema includes tables_markdown, this flow works even when the PDF only contains screenshots of tables: the OCR gives you plaintext, and the model reorganizes it back into readable Markdown tables.

This function is the end-to-end runner that turns a PDF into both structured data and a readable Markdown report. You give it a file path; it handles OCR, calls the LLM, parses the model’s JSON, formats nice Markdown, saves it, and returns everything for further use.

import json, pathlib


def run_llm_heavy_pipeline(pdf_path, dpi=300, lang="eng", psm=6, title=None):
   title = title or pathlib.Path(pdf_path).name
   print(f"[INFO] OCR… {pdf_path}")
   pages = ocr_pdf_to_pages(pdf_path, dpi=dpi, lang=lang, psm=psm)
   print(f"[INFO] {len(pages)} pages OCR'd. Sending to LLM…")
   raw = llm_restore_and_summarize(pages, title=title)
   data = json.loads(raw)  # Expecting strict JSON per system prompt
   # Build a human‑readable Markdown
   md_lines = [f"# {data.get('title', title)}\n", data.get('executive_summary','').strip(), "\n\n## Sections\n"]
   for s in data.get('sections', []):
       md_lines.append(f"### {s.get('heading','(untitled)')}\n{s.get('summary','').strip()}\n")
   tables = data.get('tables_markdown', [])
   if tables:
       md_lines.append("\n## Tables\n")
       for t in tables:
           md_lines.append(t.strip() + "\n")
   ents = data.get('entities', {})
   if ents:
       md_lines.append("\n## Entities\n")
       for k, vs in ents.items():
           if vs:
               md_lines.append(f"**{k.capitalize()}**: " + ", ".join(vs) + "\n")
   out = "\n".join(md_lines).strip()
   out_path = pathlib.Path("llm_heavy_summary_" + pathlib.Path(pdf_path).stem + ".md")
   out_path.write_text(out, encoding="utf-8")
   print(f"\n[SAVED] {out_path.resolve()}\n")
   return data, outcopy

Parameters.
- pdf_path is the input PDF.
- dpi controls rasterization sharpness for OCR (300 is a good default; use 400 for tiny fonts).
- lang picks the Tesseract language model (e.g., “eng”, “deu”, or “eng+deu”).
- psm is Tesseract’s page-segmentation mode (6 = one uniform text block; try 4 for multi-column).
- title lets you override the display title; if omitted, the filename becomes the title.
Step 1: OCR. It prints a progress line, then calls ocr_pdf_to_pages(…) (defined earlier) to rasterize each page and run Tesseract. The result, pages, is a list of page texts in order. A second progress line confirms how many pages were read.
Step 2: LLM restoration + summary. It passes the per-page OCR texts and a title to llm_restore_and_summarize(…). That function applies your strict prompts (clean OCR artifacts, infer structure, rebuild tables as Markdown, return compact JSON only) and returns a JSON string. Here we assume strict JSONand parse it with json.loads(raw) into data. If your model sometimes wraps output in “`json code fences, you could strip those before loading.
Step 3: Build human-readable Markdown. From the parsed data dict, the function assembles a clean report:
- A top-level # Title, then the executive_summary.
- A Sections chapter, iterating data[“sections”] and rendering each heading + summary.
- A Tables chapter, only if tables_markdown is present; the model already produced valid Markdown tables from OCR’d text.
- An Entities chapter listing people, orgs, dates, and figures if provided.
Step 4: Save and return. The Markdown is joined into a single string, written next to your notebook as llm_heavy_summary_<pdfname>.md, and the function prints the absolute path so you can open it immediately. Finally, it returns a tuple (data, out) where data is the machine-friendly JSON dict and out is the Markdown report string—use data for downstream automation (databases, dashboards) and out for human reading or sharing.

The function keeps roles clear: OCR yields raw text; the LLM fixes noise, restores structure, and standardizes output into a predictable JSON schema; the renderer turns that schema into Markdown. Because the JSON contains both prose and tables, you get the best of both worlds: a quick report for humans and structured fields for programmatic use. If a document is very large, you can adapt this runner to chunk the pages and then merge several partial JSONs—without changing the output format your consumers rely on.

Example Usage of the Code

The WHO document referenced here is a small, public quick-reference handout with cardiovascular disease (CVD) risk charts for the High-income Asia Pacific region (countries like Japan, Singapore, Brunei, South Korea). It typically spans two pages:

Page 1 (lab-based) estimates 10-year CVD risk using age, sex, smoking status, systolic blood pressure, and total cholesterol.
Page 2 (non-lab) uses the same layout but swaps cholesterol for BMI, so it can be used without a blood test.

Both pages are dominated by large color-coded tables (grid “heatmaps”) where each cell shows a risk band such as <5%, 5–<10%, 10–<20%, 20–<30%, or ≥30%. Because the information is packed into image-style tables, this PDF is an ideal stress test for your pipeline: OCR must read numbers and headers from pixels, and the LLM must reconstruct those grids into clean Markdown tables and a readable summary.

#Example: https://cdn.who.int/media/docs/default-source/cardiovascular-diseases/high-income-asia-pacific.pdf?sfvrsn=186ec883_2
pdf_path = "who_example_screenshot.pdf"
run_llm_heavy_pipeline(pdf_path)copy

In this example, pdf_path = “who_example_screenshot.pdf” points at a rasterized copy of the WHO file—i.e., a version where the tables are screenshots rather than selectable text. That’s intentional: it proves the workflow works even when the PDF contains no real text, only images. When you call run_llm_heavy_pipeline(pdf_path), the function:

renders each page to an image and runs Tesseract OCR,
sends the raw page text to Kimi K2 with strict instructions to clean OCR artifacts, infer structure, and rebuild tables as Markdown,
returns structured JSON and writes a Markdown report (llm_heavy_summary_<name>.md) you can open immediately.

As we can see from the result, our pipeline has identified the PDF and was able to recover the results such that they are usable for the LLM to interpret and do further analysis on it:

[INFO] OCR… who_example_screenshot.pdf

[INFO] 2 pages OCR’d. Sending to LLM…

→ pt=3703, ct=841, est=0.0035325

[SAVED] /Users/niklaslang/Desktop/Privat/Blog/DeepInfra/llm_heavy_summary_who_example_screenshot.md

({‘title’: ‘WHO CVD Risk Charts – High-income Asia Pacific (Non-lab & Lab-based)’,

‘executive_summary’: ‘Two WHO quick-reference charts estimate 10-year fatal/non-fatal cardiovascular-disease risk for adults in Brunei Darussalam, Japan, South Korea and Singapore. Page 1 gives laboratory-based risk using age, sex, smoking, systolic BP and total cholesterol; page 2 gives a non-laboratory version that replaces cholesterol with BMI. Colour bands show risk <5 %, 5–<10 %, 10–<20 %, 20–<30 % and ≥30 %. No explicit date or authorship visible; charts are intended for primary-care screening and treatment decisions.’,

‘sections’: [{‘heading’: ‘Laboratory-based risk chart’,

‘summary’: ’10-year CVD risk by age (30-74), sex, smoking status, systolic BP (120-<180 mmHg) and total cholesterol (3-8 mmol/L). Each cell gives the risk category colour code.’},

{‘heading’: ‘Non-laboratory risk chart’,

‘summary’: ‘Same structure but replaces cholesterol with BMI (18-34 kg/m²) so no blood test is required.’}],

‘tables_markdown’: [‘### Laboratory-based 10-year CVD risk (%) – Men & Women without diabetes\n| Age | Sex-Smoke | SBP <120 | 120-139 | 140-159 | 160-179 | ≥180 |\n|—–|———–|———-|———-|———-|———-|——|\n| 30-34 | M non-s | <5 | <5 | <5 | 5-<10 | 10-<20 |\n| 30-34 | M smoker | <5 | 5-<10 | 10-<20 | 10-<20 | 20-<30 |\n| 30-34 | F non-s | <5 | <5 | <5 | <5 | 5-<10 |\n| 30-34 | F smoker | <5 | <5 | 5-<10 | 10-<20 | 10-<20 |\n| 70-74 | M non-s | 10-<20 | 20-<30 | ≥30 | ≥30 | ≥30 |\n| 70-74 | M smoker | ≥30 | ≥30 | ≥30 | ≥30 | ≥30 |\n| 70-74 | F non-s | 10-<20 | 20-<30 | ≥30 | ≥30 | ≥30 |\n| 70-74 | F smoker | ≥30 | ≥30 | ≥30 | ≥30 | ≥30 |’,

‘### Non-laboratory 10-year CVD risk (%) – Men & Women (BMI instead of cholesterol)\n| Age | Sex-Smoke | BMI <20 | 20-<25 | 25-<30 | ≥30 |\n|—–|———–|———-|———-|———-|——|\n| 30-34 | M non-s | <5 | <5 | <5 | 5-<10 |\n| 30-34 | M smoker | <5 | 5-<10 | 10-<20 | 10-<20 |\n| 70-74 | M non-s | 20-<30 | ≥30 | ≥30 | ≥30 |\n| 70-74 | M smoker | ≥30 | ≥30 | ≥30 | ≥30 |’],

‘entities’: {‘people’: [],

‘orgs’: [‘WHO’],

‘dates’: [],

‘figures’: [’10-year CVD risk <5 %, 5–<10 %, 10–<20 %, 20–<30 %, ≥30 %’,

‘Total cholesterol 3–8 mmol/L’,

‘BMI 18–34 kg/m²’,

‘Systolic BP 120–≥180 mmHg’]}},

‘# WHO CVD Risk Charts – High-income Asia Pacific (Non-lab & Lab-based)\n\nTwo WHO quick-reference charts estimate 10-year fatal/non-fatal cardiovascular-disease risk for adults in Brunei Darussalam, Japan, South Korea and Singapore. Page 1 gives laboratory-based risk using age, sex, smoking, systolic BP and total cholesterol; page 2 gives a non-laboratory version that replaces cholesterol with BMI. Colour bands show risk <5 %, 5–<10 %, 10–<20 %, 20–<30 % and ≥30 %. No explicit date or authorship visible; charts are intended for primary-care screening and treatment decisions.\n\n\n## Sections\n\n### Laboratory-based risk chart\n10-year CVD risk by age (30-74), sex, smoking status, systolic BP (120-<180 mmHg) and total cholesterol (3-8 mmol/L). Each cell gives the risk category colour code.\n\n### Non-laboratory risk chart\nSame structure but replaces cholesterol with BMI (18-34 kg/m²) so no blood test is required.\n\n\n## Tables\n\n### Laboratory-based 10-year CVD risk (%) – Men & Women without diabetes\n| Age | Sex-Smoke | SBP <120 | 120-139 | 140-159 | 160-179 | ≥180 |\n|—–|———–|———-|———-|———-|———-|——|\n| 30-34 | M non-s | <5 | <5 | <5 | 5-<10 | 10-<20 |\n| 30-34 | M smoker | <5 | 5-<10 | 10-<20 | 10-<20 | 20-<30 |\n| 30-34 | F non-s | <5 | <5 | <5 | <5 | 5-<10 |\n| 30-34 | F smoker | <5 | <5 | 5-<10 | 10-<20 | 10-<20 |\n| 70-74 | M non-s | 10-<20 | 20-<30 | ≥30 | ≥30 | ≥30 |\n| 70-74 | M smoker | ≥30 | ≥30 | ≥30 | ≥30 | ≥30 |\n| 70-74 | F non-s | 10-<20 | 20-<30 | ≥30 | ≥30 | ≥30 |\n| 70-74 | F smoker | ≥30 | ≥30 | ≥30 | ≥30 | ≥30 |\n\n### Non-laboratory 10-year CVD risk (%) – Men & Women (BMI instead of cholesterol)\n| Age | Sex-Smoke | BMI <20 | 20-<25 | 25-<30 | ≥30 |\n|—–|———–|———-|———-|———-|——|\n| 30-34 | M non-s | <5 | <5 | <5 | 5-<10 |\n| 30-34 | M smoker | <5 | 5-<10 | 10-<20 | 10-<20 |\n| 70-74 | M non-s | 20-<30 | ≥30 | ≥30 | ≥30 |\n| 70-74 | M smoker | ≥30 | ≥30 | ≥30 | ≥30 |\n\n\n## Entities\n\n**Orgs**: WHO\n\n**Figures**: 10-year CVD risk <5 %, 5–<10 %, 10–<20 %, 20–<30 %, ≥30 %, Total cholesterol 3–8 mmol/L, BMI 18–34 kg/m², Systolic BP 120–≥180 mmHg’)

If you prefer to use the original WHO URL directly, you can download it to a local file first and pass that path in; the rest of the pipeline remains exactly the same.

What you can do next

Once you have a reliable OCR + LLM pipeline in place, the real value comes from refining it for scale, cost control, and domain accuracy. Add lightweight usage logging (token counts, estimated cost) to spot regressions early, and introduce spend guardrails that alert you when a single call or session grows beyond your budget.

If you routinely process structured documents such as invoices or technical tables, consider augmenting your pipeline with a specialized table-OCR step for layout-aware extraction, while reserving the LLM for interpretation, normalization, and narrative output.

For multilingual workflows, don’t forget to install the appropriate Tesseract language packs (e.g., deu, fra) and pass the correct lang parameter—this alone can dramatically reduce hallucinations and misreads.

By iterating on these components, you can evolve a simple OCR routine into a robust, scalable, and language-flexible document understanding system ready for production use.

How to use CivitAI LoRAs: 5-Minute AI Guide to Stunning Double Exposure ArtLearn how to create mesmerizing double exposure art in minutes using AI. This guide shows you how to set up a LoRA model from CivitAI and create stunning artistic compositions that blend multiple images into dreamlike masterpieces.

Power the Next Era of Image Generation with FLUX.2 Visual Intelligence on DeepInfraDeepInfra is excited to support FLUX.2 from day zero, bringing the newest visual intelligence model from Black Forest Labs to our platform at launch. We make it straightforward for developers, creators, and enterprises to run the model with high performance, transparent pricing, and an API designed for productivity.

A Milestone on Our Journey Building Deep Infra and Scaling Open Source AI InfrastructureToday we're excited to share that Deep Infra has raised $18 million in Series A funding, led by Felicis and our earliest believer and advisor Georges Harik.

View all