Concept of Document Understanding

Robot will understand the document based on structure of the documents means robot can understand based on the text along with its positions. All the documents are converted to document object model which will comprise of the structure and content.

Document Object Model (DOM)

Document Object model is a form of representation of structured documents as an object -oriented model. The document object is at root level of object model. The DOM API provides flexibility to manipulate PDF document and more helpful in retrieving text segments attributes (Text, Position etc).

High level steps in Document Understanding

Document understanding will be used for extraction of data from documents as steps followed as part of data extraction is:

Digitization
Classification
Extraction

In addition to above steps will use Taxonomy to be used for classification (Document Types) and extraction (Fields to be extracted) purpose.

Taxonomy

Taxonomy used to define the type of documents and fields to be extracted for each type of document. And can be defined in taxonomy manager which can be accessed from design ribbon Taxonomy Manager

Type of documents can be organized groups and categories.

Digitization

Digitization is a methodology to convert documents to a machine-readable file by obtaining text and structure. And this will help to return text and document object model of JSON object containing basic information of document object model such as text, pages, and page orientation.

Digitization may require OCR engine to digitize the document of scanned or images where as digital (native) PDF doesn’t require any OCR Engines.

OCR Engine

OCR stands for Optical Character Recognition which will help to read the text from scanned documents or images where native files are not available.

OCR Engines available in UiPath

UiPath Document OCR, Omni Page OCR, Google Cloud Vision OCR, Microsoft Azure Computer Vision OCR, Microsoft OCR, Tesseract OCR, and Abbyy Document OCR.

Confidence Score

OCR engine returns confidence score of document extraction data and ranges from 0 to 1. Confidence score is the quality of characters recognised from the document or Images with the help of OCR Engine.

Classification

Classification is to classify the document to identify type of document based on document type defined in taxonomy. Once classification is done based on structure and content of document. Classification can be validated by human for correction of automatic classification. Based on classification validation classification can be trained.

Extraction

Extraction is to extract the information from the document based on the fields defined in taxonomy. Once extraction is done based on confidence score extracted information can be validated by human. Later validated extracted information can be used for training the extractor to improve extraction information data quality.

Extracted information can be utilized in the process as per requirement.

Tags:

Concept of Document Understanding

Tags:

Pathrudu Chintakayala

Other Articles

Document Understanding – Document Types

Exploring Remote Debugging in UiPath Studio

Exploring Remote Debugging in UiPath Studio

Document Understanding – Document Types

No Comment! Be the first one.

Leave a Reply Cancel reply

Related Posts

Document Understanding – Document Types

Subscribe to our newsletter & stay updated.

© 2025, All Rights Reserved.

Design By "Dhanaliya InfoTech"

Quick Links

Category

Follow Us

Type and hit Enter to search

Concept of Document Understanding

Tags:

Share Article

Pathrudu Chintakayala

Other Articles

Document Understanding – Document Types

Exploring Remote Debugging in UiPath Studio

Exploring Remote Debugging in UiPath Studio

Document Understanding – Document Types

No Comment! Be the first one.

Leave a Reply Cancel reply

Related Posts

Document Understanding – Document Types

Subscribe to our newsletter & stay updated.

© 2025, All Rights Reserved.

Design By "Dhanaliya InfoTech"

Quick Links

Category

Follow Us