Publishing our work.
Have you ever thought about the amounts of data, which are produced every day in each hospital? All those anamneses, daily diagnoses, laboratory reports, nursing records, X-ray reports, patient’s educational protocols and agreements, discharge summaries and many more. Majority of these files are written in plain texts, they are meant to be put into archives and never read again when the treatment is over. However, that’s about to change as we have been intensively working on our platform for processing medical records during the past year.
Nowadays, most of these files are fortunately typewritten on computers (even though there are still some doctors in Czechia, who prefer to write by hand). The ratio of typewritten documents will probably further increase as the physicians are forced to use the information technologies by their governments (e.g., the implementation of electronic prescriptions).
Going through all medical records manually is extremely demanding since the production of texts in hospitals is huge.
We found out that a hospital department with 4,000 hospitalizations per year can produce more than 100,000 standard pages of common medical records describing hospitalizations each year. Just one department! Let’s suppose, that you’d like to read through all of these records and create some analysis. It would take you more than a year just to read all the texts, even if you read fast (50 pages an hour, 8 hours a day, 225 days a year). And this is only one department. If you consider preparing such an analysis for a whole medium sized hospital with 20,000 hospitalizations per year, it would take approximately 5 people for a year to read all the produced documentation. Not to mention that another comparable analysis would mean 5 people for a year again.
Therefore, we have applied natural language processing (NLP) methods, so our platform can process medical records and extract relevant information in a fraction of time (hours, not years) needed by humans. Our solution gains insights about patient history (diagnoses, symptoms, medication etc.) directly from unstructured medical records, connects them into compendious timelines and make them available in a database or a information discovery platform (search and browse not only the extracted information). We further analyze these data to discover high-risk patients or to classify the hospitalizations.
We are successful in detecting potential healthcare associated infections (HAI) and their risk factors.
Nowadays, we are successful in detecting potential healthcare associated infections (HAI). These are the infections which patients get while receiving medical treatment in a healthcare facility (e.g., ventilator-associated pneumonia or catheter-associated urinary tract infection). Such infections can be life-threatening for patients and are definitely costly for hospitals. Moreover, we also look for the factors, which are likely to cause HAI, in order to improve the preventive measures.
We can help with the design of medical research experiments, because the extracted data can be browsed by our information discovery application and we can easily define the groups of patients with specific combinations of conditions and laboratory results.
We are now working on another use case: utilizing information extracted from medical records to support the classification of hospitalizations into diagnosis-related groups (DRG), which can improve the coding quality and raise the reimbursements from DRG-based payment systems.
Common statistical or data mining methods cannot cope with unstructured data (e.g., texts, images or videos) and the business analyses are often based on the structured information only. The unstructured sources are thus omitted even though they usually contain important insights. The DATLOWE solution can help to process the text data, so that they can be analyzed by classic statistical algorithms.
We closely collaborate with Faculty of Mathematics and Physics at Charles University in Prague and we use their latest results of research on computational linguistics and ontology engineering.
Lately, we focus on projects in customer analytics, healthcare and automated processing of legal documents. In all of these areas people are struggling with huge amount of text data: emails and online reviews from customers, medical reports or all sorts of contracts. In this paper, we will shortly describe our legal documents pipeline.
Law or real estate companies manage tens of thousands of cases, which contain related legal documents (usually some contracts or licenses, their amendments, invoices, blueprints, letters, etc.). Currently, looking up some specific information is almost impossible. Our goal is to automate the processing of these cases: to categorize each document and to extract the most critical data about contract subject, mentioned parties, terms or fees. Information extraction along with full text search would enable the owners to filter their documents and easily perform other reporting tasks and analyses.
The documents are usually only scanned PDFs so we start with OCR engine to get them into machine readable form. We also collect information about formatting and positions of pages, paragraphs, pictures and tables as the side product of OCR analysis.
Then we run our linguistic tools, which include state-of-the-art language models, on machine readable texts from OCR. The most essential part is lemmatization, which transforms the words into its base forms – lemmas. The base form of nouns is nominative of singular, the lemma of verbs is the infinitive etc. This is superior to stemming, which simply trims prefixes and suffixes.
After linguistic analysis, the text data are stored in a structured table (one row for each lemma) and can be further transformed into document term matrix – pivot table where documents are in rows, lemmas are in columns and cells contain (normalized) frequencies. We also add semantic features (e.g., frequencies of mentioned dates, personal names, amounts, towns), which are products of our annotation application.
We build another set of attributes from document layout. We compute number of pages, number of characters per page, absolute and relative page proportions, absolute and relative areas of paragraphs, tables and pictures and many other indicators. Relationships between some of these variables and the type of the document is obvious even without any data mining technique. E.g., letters are usually shorter than contracts and reports contain more pictures than contracts.
We also create several page sectors by dividing each page to quarters, halves and header, body and footer (see figure 1). We then observe our attributes per whole page and (when it is possible) per these sectors. This is based on typical layout of the documents, e.g., business letters usually contain address in Q1 sector on the first page.
This means that we create more than 1300 variables based solely on layout and we add them to the document term matrix. Document term matrix by itself is usually large and sparse and therefore feature selection can’t be omitted before following analysis to prevent overfitting.
Both of our tasks (document and paragraph classification) belong to supervised learning, so we need target variables. We use manually annotated data (thousands of documents) to create our models and we train separate model for each business case as we believe that each data set is unique and some general model wouldn’t provide required performance.
Data set is aggregated on document level (one row per document) and divided into training, testing and evaluation partitions. Document type attribute from manually annotated data serves as the target variable. Document type is nominal variable having a number of categories (e.g., contract, plan, report) and each document belongs to exactly one of them. Therefore, we use multinomial classification methods.
We train several classification models (e.g., random forests, gradient boost machines, AdaBoost) and apply ensemble modeling. We tune our hyperparameters on the testing data and keep the evaluation data intact for final performance assessment.
The model performance depends on the data set, number of categories and selected machine learning method, however it usually achieves accuracy over 0.95 and sensitivity over 0.9 for most of the categories. It is common, that the most important variables are mixture of lemmas, semantic features and layout variables (so the target significantly depends not only on the text, but on the layout of the document as well).
During the information extraction phase, we try to highlight specific text chunks about parties, options, clauses and other mentioned facts.
When analyzing the paragraphs, each paragraph can belong to more than one category, therefore we use several binary classifiers (one per each paragraph type). Data set is aggregated on paragraph level and the training procedure is similar to the one for categorizing whole documents.
After we preselect paragraphs (e.g., all the paragraphs that are likely to contain options), we apply sets of rules to determine the details (e.g., what is the option type, can it be exercised by the landlord or tenant, what are the dates). These can be simple regular expressions or very complex business rules.
At the end, we implement our models, so they can be used for batch or real time scoring of newly arrived documents.
We usually display the results in our application, which can be used to browse the processed documents easily, to check the details prefilled by machine learning models (and to correct them when the scoring isn’t exact) or to create advanced reports.
One of the biggest obstacles can be the first step in our analysis pipeline - to prepare data for analysis. The data of our customers is stored in different document formats, e.g. machine readable PDFs, scans, MS office documents like words, excels and others. There are two scenarios how to deploy our pipeline. We can run the pipeline in the cloud or in the environment of our customers. A setup of the pipeline should be as simple as possible and without any additional license costs for our customers.
We had a solution for document transformation implemented in .NET on Windows machines with dependency on MS Outlook and MS Office products. In order to automate this very important step in the pipeline and make it scalable, we wanted to build this as an asynchronous, highly scalable microservice without heavy dependencies.
The Aspose SDK provided us a Java API which we are able to run without the dependency on Windows environment. We are now able to build lightweight text extraction and file conversion microservice, packaged as a docker image. It is now possible to run these microservices in a highly parallel and scalable way in our Apache Mesos environment. It is also very simple to setup a docker environment and deploy these microservices in the environment of our customers.
We have implemented basic functionality of office documents converter by ourselves in the past and tried several other free and paid solutions. But we found out the Aspose.Total for Java be the clear winner in this case. It has a lot of functionality, is easy to use with clear documentation and no dependency on Windows or native Office .NET API. It is easily scalable and embeddable, and has convenient license policy. We are also using the functionality of text extraction from machine readable PDF and Word files with metadata.
We plan to experiment with the OCR-library and try to collaborate with Aspose to provide support for East European Languages. In the next coming months, we are going to implement the document generation as well and we are looking forward to use Aspose for this scenario too.
We see Aspose library as the key driver of several of our core components in document analysis pipeline. It is easy to recommend Aspose to anyone who has a need to deal with different Microsoft Office and PDF documents in different scenarios ranging from text extraction to document conversion.