Buat saya, Linux itu
seksi menarik sekaligus misterius. Ada saja hal yang ingin saya ketahui tentang cara membuat atau menyelesaikan suatu pekerjaan di menggunakan aplikasi yang tersedia di Linux, misalnya soal OCR tools.
Optical character recognition (also optical character reader, OCR) is the mechanical or electronic conversion of images of typed, handwritten or printed text into machine-encoded text, whether from a scanned document, a photo of a document, a scene-photo (for example the text on signs and billboards in a landscape …
Singkat cerita, OCR tools adalah program atau software yang bisa membaca dan mengubah teks dalam file gambar (baik hasil scan dari dokumen cetak maupun hasil olahan program lain) menjadi file teks. Linux, khususnya Ubuntu, punya banyak program OCR. Salah satunya adalah
tesseractis a commercial quality OCR engine originally developed at HP between 1985 and 1995. In 1995, this engine was among the top 3 evaluated by UNLV. It was open-sourced by HP and UNLV in 2005, and has been developed at Google since then.
Buka terminal lalu ketik
sudo apt-get update sudo apt-get install tesseract-ocr
tesseract imagename|stdin outputbase|stdout [options...] [configfile...] tesseract INPUT OUTPUT
tesseract gambar.png teks
gambar.png adalah file gambar hasil scan dalam format
teks adalah nama file teks hasil konversi file
.png tadi. Nama file memang tak perlu diberi imbuhan
.txt, sebab nanti otomatis akan berubah menjadi
Tak hanya file teks,
tesseract juga bisa menyimpan file dalam format
tesseract gambar.png teks pdf
Tapi menurut saya lebih enak diubah dalam format teks, sebab mudah untuk diolah lebih lanjut.
Lamanya proses konversi teks tergantung resolusi dan kualitas file gambar. Semakin bagus dan jelas, semakin cepat prosesnya. Demikian sebaliknya.
PDF to text
tesseract tak bisa mengubah file PDF menjadi teks secara langsung. File PDF mesti diubah dulu menjadi file gambar menggunakan
convert (baca di sini), baru kemudian diubah menjadi file teks (PDF → gambar → teks).
convert -density 300 dokumen.pdf dokumen.png tesseract dokumen.png dokumen
Cara lain yang lebih cepat dan praktis adalah
pdftotext (baca di sini).
Selamat mencoba, semoga bermanfaat.
Dummy text He heard quiet steps behind him. That didn't bode well. Who could be following him this late at night and in this deadbeat part of town? And at this particular moment, just after he pulled off the big time and was making off with the greenbacks. Was there another crook who'd had the same idea, and was now watching him and waiting for a chance to grab the fruit of his labor? Or did the steps behind him mean that one of many law officers in town was on to him and just waiting to pounce and snap those cuffs on his wrists? He nervously looked all around. Suddenly he saw the alley. Like lightning he darted off to the left and disappeared between the two warehouses almost falling over the trash can lying in the middle of the sidewalk. He tried to nervously tap his way along in the inky darkness and suddenly stiffened: it was a dead-end, he would have to go back the way he had come. The steps got louder and louder, he saw the black outline of a figure coming around the comer. Is this the end of the line? he thought pressing himself back against the wall trying to make himself invisible in the dark, was all that planning and energy wasted? He was dripping with sweat now, cold and wet, he could smell the fear coming off his clothes. Suddenly next to him, with a barely noticeable squeak, a door swung quietly to and fro in the night’s breeze. Could this be the haven he'd prayed for? Slowly he slid toward the door, pressing himself more and more into the wall, into the dark, away from his enemy. Would this door save his hide?