关于DeepSeek-OCR和PaddleOCR对数学类书籍识别的对比

发表于 2025-10-30 分类于 AI 阅读次数：本文字数： 15k 阅读时长 ≈ 38 分钟

利用DeepSeek-OCR和PaddleOCR对小蓝本进行OCR，DeepSeek-OCR的效果更好。

最近跑通了DeepSeek-OCR和PaddleOCR对于PDF的识别流程，于是尝试用它们来识别完整的书籍。

如果OCR的结果基本可用的话，就能减少很多录题的工作。

这次我选择初中的小蓝本（《数学奥林匹克小丛书》）作为测试，因为我手头正好有第三版的电子版，而且扫描的非常好。

1. 运行OCR

使用DeepSeek-OCR
使用PaddleOCR

DeepSeek-OCR在本地也能运行，不过在推理的时候显存就不够了，需要用到一部分内存，速度会慢很多。

我大致测算过，都是本地运行，识别同样的内容（测试文件是多页PDF），DeepSeek-OCR的耗时大概是PaddleOCR的5倍。

如果把DeepSeek-OCR放到Kaggle上运行的话，由于显存足够，耗时是本地运行PaddleOCR的3倍。不过此时瓶颈应该是在CPU上，CPU一直显示占用100%，GPU显示占用20%~30%，显存用了10G。

代码主要来自官方的run_dpsk_ocr_pdf.py，但是其中的推理部分改为使用run_dpsk_ocr.py里的代码。这样就可以避免去配置vllm了。

DeepSeek-OCR代码

https://github.com/wangjiezhe/deepseek_ocr_app/blob/main/backend/pdf2md.py

import io
import os
import re
import tempfile
from pathlib import Path
from typing import List

import fitz  # type: ignore
import img2pdf  # type: ignore
import numpy as np
import torch
import typer
from PIL import Image, ImageDraw, ImageFont
from rich.progress import track
from transformers import AutoModel, AutoTokenizer


def pdf_to_images_high_quality(
    pdf_path: Path, temp_dir: Path, dpi=144, image_format="PNG"
) -> List[Path]:
    image_files = []

    pdf_document = fitz.open(pdf_path)

    zoom = dpi / 72.0
    matrix = fitz.Matrix(zoom, zoom)

    for page_num in range(pdf_document.page_count):
        page = pdf_document[page_num]

        pixmap = page.get_pixmap(matrix=matrix, alpha=False)
        Image.MAX_IMAGE_PIXELS = None

        if image_format.upper() == "PNG":
            img_data = pixmap.tobytes("png")
            img = Image.open(io.BytesIO(img_data))
        else:
            img_data = pixmap.tobytes("png")
            img = Image.open(io.BytesIO(img_data))
            if img.mode in ("RGBA", "LA"):
                background = Image.new("RGB", img.size, (255, 255, 255))
                background.paste(
                    img, mask=img.split()[-1] if img.mode == "RGBA" else None
                )
                img = background  # type: ignore

        img_path = temp_dir / f"{page_num}.png"
        img.save(img_path)
        img.close()
        image_files.append(img_path)

    pdf_document.close()
    return image_files


def pil_to_pdf_img2pdf(pil_images, output_path: Path):
    if not pil_images:
        return

    image_bytes_list = []

    for img in pil_images:
        if img.mode != "RGB":
            img = img.convert("RGB")

        img_buffer = io.BytesIO()
        img.save(img_buffer, format="JPEG", quality=95)
        img_bytes = img_buffer.getvalue()
        image_bytes_list.append(img_bytes)

    try:
        pdf_bytes = img2pdf.convert(image_bytes_list)
        assert pdf_bytes is not None
        with open(output_path, "wb") as f:
            f.write(pdf_bytes)

    except Exception as e:
        print(f"error: {e}")


def re_match(text):
    pattern = r"(<\|ref\|>(.*?)<\|/ref\|><\|det\|>(.*?)<\|/det\|>)"
    matches = re.findall(pattern, text, re.DOTALL)

    mathes_image = []
    mathes_other = []
    for a_match in matches:
        if "<|ref|>image<|/ref|>" in a_match[0]:
            mathes_image.append(a_match[0])
        else:
            mathes_other.append(a_match[0])
    return matches, mathes_image, mathes_other


def extract_coordinates_and_label(ref_text, image_width, image_height):
    try:
        label_type = ref_text[1]
        cor_list = eval(ref_text[2])
    except Exception as e:
        print(e)
        return None

    return (label_type, cor_list)


def draw_bounding_boxes(image, refs, jdx, out_path: Path):
    image_width, image_height = image.size
    img_draw = image.copy()
    draw = ImageDraw.Draw(img_draw)

    overlay = Image.new("RGBA", img_draw.size, (0, 0, 0, 0))
    draw2 = ImageDraw.Draw(overlay)

    #     except IOError:
    font = ImageFont.load_default()

    img_idx = 0

    for i, ref in enumerate(refs):
        try:
            result = extract_coordinates_and_label(ref, image_width, image_height)
            if result:
                label_type, points_list = result

                color = (
                    np.random.randint(0, 200),
                    np.random.randint(0, 200),
                    np.random.randint(0, 255),
                )

                color_a = color + (20,)
                for points in points_list:
                    x1, y1, x2, y2 = points

                    x1 = int(x1 / 999 * image_width)
                    y1 = int(y1 / 999 * image_height)

                    x2 = int(x2 / 999 * image_width)
                    y2 = int(y2 / 999 * image_height)

                    if label_type == "image":
                        try:
                            cropped = image.crop((x1, y1, x2, y2))
                            cropped.save(out_path / f"images/{jdx}_{img_idx}.jpg")
                        except Exception as e:
                            print(e)
                            pass
                        img_idx += 1

                    try:
                        if label_type == "title":
                            draw.rectangle([x1, y1, x2, y2], outline=color, width=4)
                            draw2.rectangle(
                                [x1, y1, x2, y2],
                                fill=color_a,
                                outline=(0, 0, 0, 0),
                                width=1,
                            )
                        else:
                            draw.rectangle([x1, y1, x2, y2], outline=color, width=2)
                            draw2.rectangle(
                                [x1, y1, x2, y2],
                                fill=color_a,
                                outline=(0, 0, 0, 0),
                                width=1,
                            )

                        text_x = x1
                        text_y = max(0, y1 - 15)

                        text_bbox = draw.textbbox((0, 0), label_type, font=font)
                        text_width = text_bbox[2] - text_bbox[0]
                        text_height = text_bbox[3] - text_bbox[1]
                        draw.rectangle(
                            [text_x, text_y, text_x + text_width, text_y + text_height],
                            fill=(255, 255, 255, 30),
                        )

                        draw.text((text_x, text_y), label_type, font=font, fill=color)
                    except Exception:
                        pass
        except Exception:
            continue
    img_draw.paste(overlay, (0, 0), overlay)
    return img_draw


def process_image_with_refs(image, ref_texts, jdx, out_path):
    result_image = draw_bounding_boxes(image, ref_texts, jdx, out_path)
    return result_image


app = typer.Typer(help="Convert PDF to Markdown using DeepSeek-OCR")


@app.command()
def convert(
    input_file: Path = typer.Argument(..., help="Input PDF file path"),
    out_path: Path = typer.Option(
        "output", "-o", "--output", help="Output directory for markdown file"
    ),
):
    os.makedirs(out_path / "images", exist_ok=True)
    temp_dir = tempfile.TemporaryDirectory()

    typer.echo(f"📄 Converting {input_file} to images...")
    image_files = pdf_to_images_high_quality(input_file, Path(temp_dir.name))

    MODEL_NAME = "deepseek-ai/DeepSeek-OCR"

    typer.echo("🤖 Loading DeepSeek-OCR model...")
    tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, trust_remote_code=True)
    model = AutoModel.from_pretrained(
        MODEL_NAME,
        attn_implementation="flash_attention_2",
        trust_remote_code=True,
        use_safetensors=True,
        torch_dtype=torch.bfloat16,  # type: ignore
    )
    model = model.eval().cuda()

    prompt = "<image>\n<|grounding|>Convert the document to markdown."

    mmd_det_path = out_path / (Path(input_file).stem + "_det.md")
    mmd_path = out_path / (Path(input_file).stem + ".md")
    pdf_out_path = out_path / (Path(input_file).stem + "_layouts.pdf")

    contents_det = ""
    contents = ""
    draw_images = []
    jdx = 0

    typer.echo("🔍 Processing pages with OCR...")
    for image_file in track(image_files):
        content = model.infer(
            tokenizer,
            prompt=prompt,
            image_file=image_file,
            output_path=temp_dir.name,
            base_size=1024,
            image_size=640,
            crop_mode=True,
            save_results=False,
            test_compress=True,
            eval_mode=True,
        )

        page_num = "\n<--- Page Split --->"
        contents_det += content + f"\n{page_num}\n"

        matches_ref, matches_images, matches_other = re_match(content)

        with Image.open(image_file) as image_draw:
            result_image = process_image_with_refs(
                image_draw, matches_ref, jdx, out_path
            )

        draw_images.append(result_image)

        for idx, a_match_image in enumerate(matches_images):
            content = content.replace(
                a_match_image, "![](images/" + str(jdx) + "_" + str(idx) + ".jpg)\n"
            )

        for idx, a_match_other in enumerate(matches_other):
            content = (
                content.replace(a_match_other, "")
                .replace("\\coloneqq", ":=")
                .replace("\\eqqcolon", "=:")
                .replace("\n\n\n\n", "\n\n")
                .replace("\n\n\n", "\n\n")
            )

        contents += content + f"\n{page_num}\n"

        jdx += 1

    typer.echo(f"💾 Saving markdown to {mmd_path}...")
    with open(mmd_det_path, "w", encoding="utf-8") as afile:
        afile.write(contents_det)

    with open(mmd_path, "w", encoding="utf-8") as afile:
        afile.write(contents)

    pil_to_pdf_img2pdf(draw_images, pdf_out_path)

    temp_dir.cleanup()
    typer.echo("✅ Conversion completed successfully!")


if __name__ == "__main__":
    app()

由于PP-StructureV3在关掉表格识别时候能够完全加载到显存（8G）中并完成推理，因此是在本地完成的。

另外，use_doc_unwarping在扫描的非常好的时候一定要关掉，否则可能对页面造成不必要的裁剪。

代码主要来自官方文档，PP-StructureV3-notable.yaml相比原始的PP-StructureV3关掉了表格结构识别模块、文本行方向分类模块、文本图像校正模块。直接在配置文件里面关闭可以完全不加载对应的模型，减少显存占用。

PaddleOCR的本地部署也很简单：

1
2
3

uv venv --seed --python python3.12
uv pip install paddlepaddle-gpu==3.2.0 --default-index https://www.paddlepaddle.org.cn/packages/stable/cu129/
uv pip install "paddleocr[all]" typer

之后就可以正常使用了。

PaddleOCR代码

https://github.com/wangjiezhe/PaddleX-local/blob/main/pdf2md.py

from pathlib import Path

import typer
from paddlex import create_pipeline  # type: ignore

app = typer.Typer(
    help="Convert PDF and image files to Markdown using PaddleX PP-StructureV3"
)

def process_pdf_file(pdf_path: Path, pipeline, output_dir: Path) -> Path:
    typer.echo(f"Processing PDF file: {pdf_path}")

    output = pipeline.predict(
        input=str(pdf_path),
        use_doc_orientation_classify=False,
        use_doc_unwarping=False,
        use_textline_orientation=False,
    )

    markdown_list = []
    markdown_images = []

    for res in output:
        md_info = res.markdown
        markdown_list.append(md_info)
        markdown_images.append(md_info.get("markdown_images", {}))

    markdown_texts = pipeline.concatenate_markdown_pages(markdown_list)

    mkd_file_path = output_dir / f"{pdf_path.stem}.md"
    mkd_file_path.parent.mkdir(parents=True, exist_ok=True)

    with open(mkd_file_path, "w", encoding="utf-8") as f:
        f.write(markdown_texts)

    for item in markdown_images:
        if item:
            for path, image in item.items():
                file_path = output_dir / path
                file_path.parent.mkdir(parents=True, exist_ok=True)
                image.save(file_path)

    return mkd_file_path


@app.command()
def convert(
    input_file: Path = typer.Argument(..., help="Input PDF or image file path"),
    output_dir: Path = typer.Option(
        "./output", "-o", "--output", help="Output directory path"
    ),
    hpip: bool = typer.Option(
        False, "--hpip", help="Enable high performance inference"
    ),
):

    pipeline_config = "./PP-StructureV3-notable.yaml"

    if hpip:
        typer.echo("🚀 Enabling high performance inference mode")

    pipeline = create_pipeline(
        pipeline=pipeline_config,
        use_hpip=hpip,
        hpi_config={"auto_config": "False", "backend": "onnxruntime"},
    )

    output_path = process_pdf_file(input_file, pipeline, output_dir)

    typer.echo(f"✅ Conversion completed! Markdown file saved to: {output_path}")


if __name__ == "__main__":
    app()

2. 简单修正

对于一些简单的错误，我们可以直接处理掉。

2.1. 乘号

由于小蓝本里面的乘号特别粗，因此经常被识别成\bullet。将其替换成正确\cdot即可。

2.2. 平行符号

国内书籍习惯使用的平行符号//有时无法被识别成对应的 $\LaTeX$ 代码\parallel，需要进行替换。

2.3. 多行公式

这个主要是DeepSeek-OCR的问题。它在识别到多行公式的时候，大部分时候不会使用align或array环境，而是识别成多个行间公式。这里全部改为使用aligned环境，但是对齐位置还需要之后进行手动修正。

另外，Typora要求多行公式在开始的$$或\[之后必须换行正确显示，因此也一并进行修改。

完整的后处理代码

https://gist.github.com/wangjiezhe/9b74cf9d492a958c90360a16780a2d12

import re
from pathlib import Path

import typer

app = typer.Typer()


def parse_multiline_formula(match: re.Match[str]) -> str:
    block = match.group(1)

    formulas = re.split(r"\\\] \\\[", block)
    if len(formulas) == 1:
        return match.group(0)

    res = r"\[\begin{aligned}"
    res += "\n"
    for i, formula in enumerate(formulas):
        res += f"&{formula}"
        if i < len(formulas) - 1:
            res += r"\\"
        res += "\n"
    res += r"\end{aligned}\]"
    return res


def parse_parallel(match: re.Match[str]) -> str:
    res = match.group(2)
    res = re.sub(r"/\s*/", r"\\parallel", res)
    return match.group(1) + res + match.group(3)


def format_deepseek(content: str) -> str:
    content = content.replace("<--- Page Split --->\n", "")
    content = content.replace(r"\bullet", r"\cdot")
    content = re.sub(r"(\\\()(.*?)(\\\))", parse_parallel, content)
    content = re.sub(r"(\\\[)(.*?)(\\\])", parse_parallel, content)
    content = re.sub(
        r"\\\((.*?)\\\) // \\\((.*?)\\\)", r"\(\1 \\parallel \2\)", content
    )
    content = re.sub(r"\\\[(.*)\\\]", parse_multiline_formula, content)
    content = re.sub(r"\\\[(.*?)\\\]", r"\[\n\1\n\]", content, flags=re.DOTALL)
    return content


def format_paddle(content: str) -> str:
    content = content.replace(r"\bullet", r"\cdot")
    content = re.sub(r"(\$)(.+?)(\$)", parse_parallel, content)
    content = re.sub(r"\$\$(.+?)\$\$", r"$$\n\1\n$$", content)
    return content


@app.command()
def main(
    input_file: Path = typer.Argument(..., help="Input markdown file"),
    formatter: str = typer.Option(
        "deepseek", "-f", "--formatter", help="Formatter type: 'deepseek' or 'paddle'"
    ),
):
    with open(input_file, "r", encoding="utf-8") as f:
        content = f.read()

    if formatter == "deepseek":
        content = format_deepseek(content)
    elif formatter == "paddle":
        content = format_paddle(content)
    else:
        raise typer.BadParameter("Formatter must be either 'deepseek' or 'paddle'")

    with open(
        f"{input_file.stem}_modified{input_file.suffix}",
        "w",
        encoding="utf-8",
        newline="\n",
    ) as f:
        _ = f.write(content)


if __name__ == "__main__":
    app()

3. 结果对比

从最终的结果来看，DeepSeek-OCR的效果比PaddleOCR的效果要好一些。

在下面的对比图片中，左侧是DeepSeek-OCR的结果，右侧是PaddleOCR的结果。

3.1. DeepSeek-OCR的主要问题

3.1.1. 多行公式

DeepSeek-OCR的一个问题就是对于多行的行间公式识别比较差。大部分时候都是把每行单独识别成一个独立的行间公式。不过，这个应该只是识别倾向的问题，DeepSeek其实识别到了整块的公式。例如，

SumatraPDF_TJil1agvnD

DeepSeek识别到了上面一整块公式，但最终生成的Markdown代码却是多个行间公式：

Typora_Yxfd5fswNi

不过，这个问题其实很好解决。因为DeepSeek把这些行间公式都放到了一行，只需要简单做一下替换就可以了。上面就已经处理了。

3.1.2. 漏大括号

DeepSeek-OCR最大的问题就是在识别行间公式的时候偶尔会漏掉最后的大括号，例如，

Typora_5udpTLGboN

这和我对DeepSeek的印象倒是一致的。之前使用DeepSeek生成代码的时候，也遇到过类似的问题。生成的代码运行不了，最后发现就是仅仅少了几个大括号，而且都是右大括号。

PaddleOCR出现这种情况的时候要少得多。

3.2. PaddleOCR的主要问题

PaddleOCR的问题就比较多了。

3.2.1. 退化

最令我没有想到的是，PaddleOCR在识别数学公式的时候多次出现了退化的问题。例如，

Typora_h6WfmGkkkQ

Typora_ky3igXXlEy

与之形成鲜明对比的是，DeepSeek-OCR一次退化的情况也没有出现。

3.2.2. 排版错误

PaddleOCR的另一个问题是很多地方排版有问题。其中最主要原因有两个。

其一，PaddleOCR经常识别不到序号，例如

Typora_LF5kebPvvW

其二，小蓝本中有一些不符合常规排版的用法，例如

SumatraPDF_2twj5XkvPU

对于这种情况，PaddleOCR无法正确识别文字和公式的位置，而是倾向于直接将中间视为连续的行间公式，造成最终的排版错误：

Typora_t2PD3tD0Jf

而DeepSeek-OCR能够正确识别文字和公式的关系，能够保证不会发生错行。

这个问题严重的时候甚至会导致不仅仅是排版的问题，还会出现识别错误：

Typora_UN4FbjamSL

原书如下：

SumatraPDF_HclttytMxi

3.2.3. 插图识别

在上面的图中还存在另一个问题，就是当遇到一行有多个图片的时候，PaddleOCR倾向于把它们视为一张插图，而DeepSeek-OCR大部分时候都能够正确地把它们分开，保存成为单独的图片文件。

3.2.4. 简单字母的识别

PaddleOCR对于行内公式的识别比较保守，对于单独的字母，基本上不把它识别成公式：

Typora_lXQxRafuAj

Typora_XqobVdCMmf

3.2.5. 公式与文字的分界

PaddleOCR对于文字和公式的分界经常识别错误，经常把临近的文字也放到公式中。下面这个图特别明显，同样的格式，识别出四种不同的结果：

Typora_qTVJ8P2qp8

3.2.6. 特殊符号

PaddleOCR对于特殊符号（例如∥和⊥）的识别效果比DeepSeek-OCR要差很多，例如

Typora_wusiMAVZTB

Typora_irgHcxvBG3

二者都有识别错的情况，但是PaddleOCR的错误明显要多很多。

3.2.7. 错误的加粗/斜体效果

在上面的图片中，还可以看到另外一种错误，就是PaddleOCR经常给公式里的字母加不必要的字体效果，上面是加了倾斜，下面是加粗，都是原文中没有的。

Typora_kdP8MCImCg

3.2.8. 标题

不知道为什么，PaddleOCR对于标题的识别比DeepSeek-OCR差很多，虽然二者识别的都不是很好。

如图，PaddleOCR将标题直接整个识别成了图片

Typora_FF1iUdH37i

而且这种情况发生了很多次：

3.3. 两者共同存在的问题

3.3.1. 单纯的识别错误

二者都存在，不过都是个例：

Typora_KC7JOatbFB

Typora_4623oSFPcC

3.3.2. 特殊符号

在文档中存在一些不太常用的符号，例如

Typora_8FY5hrHkMW

实际上应该是 $\Leftarrow$ 。

再比如，「相似」符号。国内书籍使用的相似符号和国外不一样，在 $LaTeX$ 中不存在完全对应的命令，因此也容易识别错误：

Typora_7IZuqijyQB

另外，还有一些符号完全不存在对应的 $LaTeX$ 命令，因此也就无法正确识别，例如「平行且等于」的符号：

SumatraPDF_56jSOZBLQX

Typora_mIXTni8Z1T

DeepSeek-OCR识别成了垂直，PaddleOCR识别成了平行，还发生了退化。

类似的还有「平行四边形」的符号。

另外我发现，PaddleOCR在遇到类似平行∥符号的时候特别容易发生退化，例如

Typora_fKJcbUnSBd

3.3.3. 漏图

两者都发生了部分插图未能正确识别的现象，不过都是个例。

Typora_cmwC2uRS5k

Typora_UfGce2tKyn

另外像上图中行间公式嵌套文字的情况，基本上都无法正确处理。（不过这个也在我的预期之内。）

4. 总结

整体来看，二者的结果都是可用的。不过整体来看，DeepSeek-OCR要明显更优。

另外，DeepSeek-OCR出现的错误改起来都比较容易。而PaddleOCR需要进行修正的要多得多，包括大量未识别到的行内公式（主要是简单的字母）、多行排版错误等等。主要是这些错误出现的频次太高了，到处都要修改。