Publishing our work.
Common statistical or data mining methods cannot cope with unstructured data (e.g., texts, images or videos) and the business analyses are often based on the structured information only. The unstructured sources are thus omitted even though they usually contain important insights. The DATLOWE solution can help to process the text data, so that they can be analyzed by classic statistical algorithms.
We closely collaborate with Faculty of Mathematics and Physics at Charles University in Prague and we use their latest results of research on computational linguistics and ontology engineering.
Lately, we focus on projects in customer analytics, healthcare and automated processing of legal documents. In all of these areas people are struggling with huge amount of text data: emails and online reviews from customers, medical reports or all sorts of contracts. In this paper, we will shortly describe our legal documents pipeline.
Law or real estate companies manage tens of thousands of cases, which contain related legal documents (usually some contracts or licenses, their amendments, invoices, blueprints, letters, etc.). Currently, looking up some specific information is almost impossible. Our goal is to automate the processing of these cases: to categorize each document and to extract the most critical data about contract subject, mentioned parties, terms or fees. Information extraction along with full text search would enable the owners to filter their documents and easily perform other reporting tasks and analyses.
The documents are usually only scanned PDFs so we start with OCR engine to get them into machine readable form. We also collect information about formatting and positions of pages, paragraphs, pictures and tables as the side product of OCR analysis.
Then we run our linguistic tools, which include state-of-the-art language models, on machine readable texts from OCR. The most essential part is lemmatization, which transforms the words into its base forms – lemmas. The base form of nouns is nominative of singular, the lemma of verbs is the infinitive etc. This is superior to stemming, which simply trims prefixes and suffixes.
After linguistic analysis, the text data are stored in a structured table (one row for each lemma) and can be further transformed into document term matrix – pivot table where documents are in rows, lemmas are in columns and cells contain (normalized) frequencies. We also add semantic features (e.g., frequencies of mentioned dates, personal names, amounts, towns), which are products of our annotation application.
We build another set of attributes from document layout. We compute number of pages, number of characters per page, absolute and relative page proportions, absolute and relative areas of paragraphs, tables and pictures and many other indicators. Relationships between some of these variables and the type of the document is obvious even without any data mining technique. E.g., letters are usually shorter than contracts and reports contain more pictures than contracts.
We also create several page sectors by dividing each page to quarters, halves and header, body and footer (see figure 1). We then observe our attributes per whole page and (when it is possible) per these sectors. This is based on typical layout of the documents, e.g., business letters usually contain address in Q1 sector on the first page.
This means that we create more than 1300 variables based solely on layout and we add them to the document term matrix. Document term matrix by itself is usually large and sparse and therefore feature selection can’t be omitted before following analysis to prevent overfitting.
Both of our tasks (document and paragraph classification) belong to supervised learning, so we need target variables. We use manually annotated data (thousands of documents) to create our models and we train separate model for each business case as we believe that each data set is unique and some general model wouldn’t provide required performance.
Data set is aggregated on document level (one row per document) and divided into training, testing and evaluation partitions. Document type attribute from manually annotated data serves as the target variable. Document type is nominal variable having a number of categories (e.g., contract, plan, report) and each document belongs to exactly one of them. Therefore, we use multinomial classification methods.
We train several classification models (e.g., random forests, gradient boost machines, AdaBoost) and apply ensemble modeling. We tune our hyperparameters on the testing data and keep the evaluation data intact for final performance assessment.
The model performance depends on the data set, number of categories and selected machine learning method, however it usually achieves accuracy over 0.95 and sensitivity over 0.9 for most of the categories. It is common, that the most important variables are mixture of lemmas, semantic features and layout variables (so the target significantly depends not only on the text, but on the layout of the document as well).
During the information extraction phase, we try to highlight specific text chunks about parties, options, clauses and other mentioned facts.
When analyzing the paragraphs, each paragraph can belong to more than one category, therefore we use several binary classifiers (one per each paragraph type). Data set is aggregated on paragraph level and the training procedure is similar to the one for categorizing whole documents.
After we preselect paragraphs (e.g., all the paragraphs that are likely to contain options), we apply sets of rules to determine the details (e.g., what is the option type, can it be exercised by the landlord or tenant, what are the dates). These can be simple regular expressions or very complex business rules.
At the end, we implement our models, so they can be used for batch or real time scoring of newly arrived documents.
We usually display the results in our application, which can be used to browse the processed documents easily, to check the details prefilled by machine learning models (and to correct them when the scoring isn’t exact) or to create advanced reports.
One of the biggest obstacles can be the first step in our analysis pipeline - to prepare data for analysis. The data of our customers is stored in different document formats, e.g. machine readable PDFs, scans, MS office documents like words, excels and others. There are two scenarios how to deploy our pipeline. We can run the pipeline in the cloud or in the environment of our customers. A setup of the pipeline should be as simple as possible and without any additional license costs for our customers.
We had a solution for document transformation implemented in .NET on Windows machines with dependency on MS Outlook and MS Office products. In order to automate this very important step in the pipeline and make it scalable, we wanted to build this as an asynchronous, highly scalable microservice without heavy dependencies.
The Aspose SDK provided us a Java API which we are able to run without the dependency on Windows environment. We are now able to build lightweight text extraction and file conversion microservice, packaged as a docker image. It is now possible to run these microservices in a highly parallel and scalable way in our Apache Mesos environment. It is also very simple to setup a docker environment and deploy these microservices in the environment of our customers.
We have implemented basic functionality of office documents converter by ourselves in the past and tried several other free and paid solutions. But we found out the Aspose.Total for Java be the clear winner in this case. It has a lot of functionality, is easy to use with clear documentation and no dependency on Windows or native Office .NET API. It is easily scalable and embeddable, and has convenient license policy. We are also using the functionality of text extraction from machine readable PDF and Word files with metadata.
We plan to experiment with the OCR-library and try to collaborate with Aspose to provide support for East European Languages. In the next coming months, we are going to implement the document generation as well and we are looking forward to use Aspose for this scenario too.
We see Aspose library as the key driver of several of our core components in document analysis pipeline. It is easy to recommend Aspose to anyone who has a need to deal with different Microsoft Office and PDF documents in different scenarios ranging from text extraction to document conversion.