
Characters connected by small artifacts can be isolated from each other and characters that are broken apart can be linked together. Modern OCR Engines perform more sophisticated character level segmenting than just looking for small gaps between characters. This can be seen using a histogram projection profile. Lines of text are distinguished by the horizontal space between one line and another. In general, this process involves distinguishing between text and the white space between text. This is the process of breaking up a page into first lines, then words, and, finally, individual characters. One of the most important aspects of pre-processing is "segmentation". Custom Image Processing can be performed using IP Profiles made of highly configurable IP Commands. At best, the OCR Engine may allow you to turn the property "on" or "off" but may not allow you to configure it further to fine tune its results. OCR Engines typically place these pre-processing functions in a "black box" for users. Grooper has it's own pre-processing capabilities through its Image Processing operations. Some OCR Engines also contain de-skewing, despeckling, line removal, aspect ratio normalization, or other pre-processing functions to improve OCR results. is turned black and white to divide the page into black pixels (text) and white pixels (the background). You are left with only black and white pixels, with (ideally) all text in black and everything else faded into a white background. Lighter pixels are then turned into white and darker ones are turned into black pixels. This is done by a process called "thresholding" which determines a middle point between light pixels and dark pixels on the page. So, color and grayscale images must be converted to black and white. They analyze the pixels on the image and figure out what text characters they match.įirst and foremost, OCR applications require a black and white image in order to determine what pixels on a page are text. OCR Engines perform the "heavy lift" of the OCR operation by getting the raw character data off document images. 2 OCR Engines: What is the best OCR engine?.
