Robot will understand the document based on structure of the documents means robot can understand based on the text along with its positions. All the documents are converted to document object model which will comprise of the structure and content.
Document Object Model (DOM)
Document Object model is a form of representation of structured documents as an object -oriented model. The document object is at root level of object model. The DOM API provides flexibility to manipulate PDF document and more helpful in retrieving text segments attributes (Text, Position etc).
High level steps in Document Understanding
Document understanding will be used for extraction of data from documents as steps followed as part of data extraction is:
- Digitization
- Classification
- Extraction
In addition to above steps will use Taxonomy to be used for classification (Document Types) and extraction (Fields to be extracted) purpose.
Taxonomy
Taxonomy used to define the type of documents and fields to be extracted for each type of document. And can be defined in taxonomy manager which can be accessed from design ribbon Taxonomy Manager

Type of documents can be organized groups and categories.
Digitization
Digitization is a methodology to convert documents to a machine-readable file by obtaining text and structure. And this will help to return text and document object model of JSON object containing basic information of document object model such as text, pages, and page orientation.
Digitization may require OCR engine to digitize the document of scanned or images where as digital (native) PDF doesn’t require any OCR Engines.
OCR Engine
OCR stands for Optical Character Recognition which will help to read the text from scanned documents or images where native files are not available.
OCR Engines available in UiPath
UiPath Document OCR, Omni Page OCR, Google Cloud Vision OCR, Microsoft Azure Computer Vision OCR, Microsoft OCR, Tesseract OCR, and Abbyy Document OCR.
Confidence Score
OCR engine returns confidence score of document extraction data and ranges from 0 to 1. Confidence score is the quality of characters recognised from the document or Images with the help of OCR Engine.
Classification
Classification is to classify the document to identify type of document based on document type defined in taxonomy. Once classification is done based on structure and content of document. Classification can be validated by human for correction of automatic classification. Based on classification validation classification can be trained.
Extraction
Extraction is to extract the information from the document based on the fields defined in taxonomy. Once extraction is done based on confidence score extracted information can be validated by human. Later validated extracted information can be used for training the extractor to improve extraction information data quality.
Extracted information can be utilized in the process as per requirement.