tesseract.js 识别图片文字
上次通过brew
安装了Tesseract
并测试了一下效果,这次我们试试tesseract.js
。
使用
安装tesseract.js
$ pnpm add -D tesseract.js
引入并使用
import { createWorker } from 'tesseract.js'
const buttonClick = async () => {
const url = imageUrl.value
if (!url) {
return ''
}
console.log(url)
const worker = await createWorker({
logger: (m) => console.log(m),
})
await worker.loadLanguage('eng')
await worker.initialize('eng')
const {
data: { text },
} = await worker.recognize(url)
console.log(text)
await worker.terminate()
}
输出卡在loading eng.traineddata
,但是未报错。出现这种情况是,需要下载语言包。上面就是加载eng.traineddata
时未成功简体中文下载地址其他语言包,在对应的目录下面找。
下载完成后将chi_sim.traineddata.gz
文件,放到根目录下的/static/tesseract
下面。并设置语言包
import { createWorker } from 'tesseract.js'
const buttonClick = async () => {
const url = imageUrl.value
if (!url) {
return ''
}
console.log(url)
const worker = await createWorker({
logger: (m) => console.log(m),
// 设置语言地址,相对于根目录
langPath: './static/tesseract',
})
await worker.loadLanguage('chi_sim+eng')
await worker.initialize('chi_sim+eng')
const {
data: { text },
} = await worker.recognize(url)
console.log(text)
await worker.terminate()
}
设置白名单,如果只需要识别数字等,则推荐设置白名单。
import { createWorker } from 'tesseract.js'
const buttonClick = async () => {
const url = imageUrl.value
if (!url) {
return ''
}
console.log(url)
const worker = await createWorker({
logger: (m) => console.log(m),
// 设置语言地址,相对于根目录
langPath: './static/tesseract',
})
// 设置白名单
worker.setParameters({
tessedit_char_whitelist:
// 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789.,:-\'"!?-/# ',
'ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789.*[]',
})
await worker.loadLanguage('chi_sim+eng')
await worker.initialize('chi_sim+eng')
const {
data: { text },
} = await worker.recognize(url)
console.log(text)
await worker.terminate()
}