Automated data extraction from scanned documents with MATLAB
Overview
Companies and government authorities have huge amounts of data stored in scanned PDF files, for instance invoices, maintenance reports, forms, contracts, and others. While those files contain important information in a structured or semi-structured format like tables, charts or images, it is often a challenge to access and process the data in a convenient, ideally automated way. Efficient data digitization is therefore high on the list of priorities of many organizations.
MathWorks offers a broad range of solutions to extract and process various types of data like text, charts, graphs, tables and other types of data within scanned PDF files. Advanced image and text processing capabilities enable an efficient post-processing and seamless integration in existing workflows.
In this webinar we will specifically show a live example of how to extract and process tabular data from scanned PDFs using a publicly available dataset. Intermediate steps in this workflow include the conversion of scanned PDF files to high-quality images as well as the generation of free form textual information using optical character recognition techniques. The text data is further cleaned and preprocessed before being exported to e.g. Excel files.
Furthermore, you will get insight into various other data extraction use cases, which can be solved using MathWorks technology. The session will be concluded with a discussion about how to apply Machine Learning and Deep Learning methods on the resulting data. Prior knowledge about MATLAB or data extraction techniques is not required.
Highlights
We will cover the following topics:
- Efficient data digitization with MATLAB using different data extraction techniques
- Overview of advanced text analytics and image processing options in MATLAB
- Case study: extraction of tabular data from scanned reports containing agricultural data
- Machine Learning and Deep Learning capabilities for automated data extraction
About the Presenter
Sagar Hukkire, Application Engineer
Sagar Hukkire works as an Application Engineer, focusing on the usage of MATLAB in Data Science, Text Analytics, and Image Processing applications. Prior to joining MathWorks in 2020, he had been working for an international Consultancy developing AI solutions for business applications. Sagar holds a Master of Science degree in Information Technology from the University of Stuttgart, Germany.
Recorded: 19 Oct 2021
Download Code and Files
Dataset Sources
Clemens, Michael, 2017, "Raw scanned PDFs of primary sources for workers, wages, and crops", Harvard Dataverse, V1