OCR tools di Linux

Buat saya, Linux itu seksi menarik sekaligus misterius. Ada saja hal yang ingin saya ketahui tentang bagaimana membuat sesuatu atau menyelesaikan suatu pekerjaan di Linux. Misalnya soal OCR tools.

Optical character recognition (also optical character reader, OCR) is the mechanical or electronic conversion of images of typed, handwritten or printed text into machine-encoded text, whether from a scanned document, a photo of a document, a scene-photo (for example the text on signs and billboards in a landscape …
— from Wikipedia

Singkat cerita, OCR tools adalah program atau software yang bisa membaca dan mengambil teks dari file gambar, baik hasil scan dari dokumen cetak maupun hasil olahan program lain. Linux, khususnya Ubuntu, punya banyak program OCR. Salah satunya adalah tesseract-ocr.

tesseract is a commercial quality OCR engine originally developed at HP between 1985 and 1995. In 1995, this engine was among the top 3 evaluated by UNLV. It was open-sourced by HP and UNLV in 2005, and has been developed at Google since then.
— from man tesseract

Instalasi

sudo apt-get install tesseract-ocr

Penggunaan

Format:

tesseract imagename|stdin outputbase|stdout [options...] [configfile...]
tesseract INPUT OUTPUT

Contoh:

tesseract gambar.png teks

gambar.png adalah file gambar hasil scan format .png. Sedangkan teks adalah nama file teks yang akan berisi teks dari file .png tadi. Nama file memang tak perlu diberi imbuhan .txt, sebab nanti otomatis akan berubah menjadi teks.txt.

Tak hanya file teks, tesseract juga bisa menyimpan file dalam format .pdf.

Contoh:

tesseract gambar.png teks pdf

Tapi menurut saya lebih enak diubah dalam format teks, sebab mudah untuk diolah lebih lanjut.

Lamanya proses konversi teks tergantung resolusi dan kualitas file gambar. Semakin bagus dan jelas, semakin cepat prosesnya. Demikian sebaliknya.

PDF to text

tesseract tak bisa mengubah file PDF menjadi teks secara langsung. File PDF mesti diubah dulu menjadi file gambar menggunakan convert (baca di sini), baru kemudian diubah menjadi file teks (PDF → gambar → teks).

convert -density 300 dokumen.pdf dokumen.png
tesseract dokumen.png dokumen

Cara lain yang lebih cepat dan praktis adalah pdftotext (baca di sini).

pdftotext dokumen.pdf

Selamat mencoba, semoga bermanfaat.

gambar.png
gambar.png
Dummy text

He heard quiet steps behind him. That didn't bode well. Who
could be following him this late at night and in this deadbeat
part of town? And at this particular moment, just after he pulled
off the big time and was making off with the greenbacks. Was
there another crook who'd had the same idea, and was now
watching him and waiting for a chance to grab the fruit of his
labor? Or did the steps behind him mean that one of many law
officers in town was on to him and just waiting to pounce and
snap those cuffs on his wrists? He nervously looked all around.

Suddenly he saw the alley. Like lightning he darted off to the
left and disappeared between the two warehouses almost
falling over the trash can lying in the middle of the sidewalk.
He tried to nervously tap his way along in the inky darkness
and suddenly stiffened: it was a dead-end, he would have to go
back the way he had come.

The steps got louder and louder, he saw the black outline of a
figure coming around the comer. Is this the end of the line? he
thought pressing himself back against the wall trying to make
himself invisible in the dark, was all that planning and energy
wasted? He was dripping with sweat now, cold and wet, he
could smell the fear coming off his clothes. Suddenly next to
him, with a barely noticeable squeak, a door swung quietly to
and fro in the night’s breeze. Could this be the haven he'd
prayed for? Slowly he slid toward the door, pressing himself
more and more into the wall, into the dark, away from his
enemy. Would this door save his hide?

Baca juga:

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s