Skip to content

tesseract.js 识别图片文字

约 393 字大约 1 分钟

OCR

2025-01-20

上次通过brew安装了Tesseract并测试了一下效果,这次我们试试tesseract.js

使用

安装tesseract.js

$ pnpm add -D tesseract.js

引入并使用

import { createWorker } from 'tesseract.js'

const buttonClick = async () => {
    const url = imageUrl.value
    if (!url) {
        return ''
    }
    console.log(url)
    const worker = await createWorker({
        logger: (m) => console.log(m),
    })
    await worker.loadLanguage('eng')
    await worker.initialize('eng')
    const {
        data: { text },
    } = await worker.recognize(url)
    console.log(text)
    await worker.terminate()
}

输出卡在loading eng.traineddata,但是未报错。出现这种情况是,需要下载语言包。上面就是加载eng.traineddata时未成功简体中文下载地址其他语言包,在对应的目录下面找。

下载完成后将chi_sim.traineddata.gz文件,放到根目录下的/static/tesseract下面。并设置语言包

import { createWorker } from 'tesseract.js'

const buttonClick = async () => {
    const url = imageUrl.value
    if (!url) {
        return ''
    }
    console.log(url)
    const worker = await createWorker({
        logger: (m) => console.log(m),
        // 设置语言地址,相对于根目录
        langPath: './static/tesseract',
    })
    await worker.loadLanguage('chi_sim+eng')
    await worker.initialize('chi_sim+eng')
    const {
        data: { text },
    } = await worker.recognize(url)
    console.log(text)
    await worker.terminate()
}

设置白名单,如果只需要识别数字等,则推荐设置白名单。

import { createWorker } from 'tesseract.js'

const buttonClick = async () => {
    const url = imageUrl.value
    if (!url) {
        return ''
    }
    console.log(url)
    const worker = await createWorker({
        logger: (m) => console.log(m),
        // 设置语言地址,相对于根目录
        langPath: './static/tesseract',
    })
    // 设置白名单
    worker.setParameters({
        tessedit_char_whitelist:
            // 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789.,:-\'"!?-/# ',
            'ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789.*[]',
    })
    await worker.loadLanguage('chi_sim+eng')
    await worker.initialize('chi_sim+eng')
    const {
        data: { text },
    } = await worker.recognize(url)
    console.log(text)
    await worker.terminate()
}

参考

[OCR]Tesseract 图像识别

tesseract.js GitHub

tesseract.js 文档

简体中文下载地址