The conversion of image-based document files (TIFFs and PDFs) into editable and searchable electronic files requires specialized Optical Character Recognition (OCR) software that is widely available within the marketplace. However, the majority of these products are geared towards the English language and fail to produce quality results with Asian, Latin and Cyrillic languages that are composed of unique accents or characters.
In the computer world, these characters sets are handled under the Unicode standard that "provides a unique number for every character, no matter what the platform, no matter what the program, no matter what the language." The latest version of the Unicode standard represents over 109,000 characters from over 90 languages, which obvious represents a significant challenge to OCR tools that have been geared toward the 26 letters of the English alphabet.
Global EDD Group has invested in specialized OCR tools that accept the Unicode standard and create editable and searchable electronic files for languages from around the world. Additionallly, a subset of languages can receive Enhanced OCR Processing that includes dictionary lookups, format retention and image enhancement. The following are typical service options available, though some may not be available for every language.
INPUT FORMATS: Scanned Paper Documents (TIFF, PDF) or Digital Photographs (JPG)
OUTPUT FORMATS: Text Files (TXT), Documents (DOC), Spreadsheets (XLS), Web Page (HTML)