Intelligently Extract Text & Data from Document with OCR NER

Intelligently Extract Text & Data from Document with OCR NER
Intelligently Extract Text & Data from Document with OCR NER

Intelligently Extract Text & Data from Document with OCR NER

Develop Document Scanner App project that is Named entity extraction from scan documents with OpenCV, Pytesseract, Spacy

Welcome to Course “Intelligently Extract Text & Data from Document with OCR NER” !!!
In this course, you will learn how to develop a customized Named Entity Recognizer. The main idea of this course is to extract entities from the scanned documents like invoices, Business Card, Shipping Bill, Bill of Lading documents, etc. However, for the sake of data privacy, we restricted our views to Business Card. But you can use the framework explained to all kinds of financial documents. Below given is the curriculum we are following to develop the project.
To develop this project we will use two main technologies in data science are,
  1. Computer Vision
  2. Natural Language Processing
In the Computer Vision module, we will scan the document, identify the location of the text and finally extract text from the image. Then in Natural language processing, we will extract the entitles from the text and do necessary text cleaning and parse the entities from the text.
Python Libraries used in Computer Vision Module.
  • OpenCV
  • Numpy
  • Pytesseract
Python Libraries used in Natural Language Processing
  • Spacy
  • Pandas
  • Regular Expression
  • String
As are combining two major technologies to develop the project, for the sake of easy understanding we divide the course into several stages of development.
Stage -1: We will set up the project by doing the necessary installations and requirements.
  • Install Python
  • Install Dependencies
Stage -2: We will do data preparation. That is we will extract text from images using Pytesseract and also do the necessary cleaning.
  • Gather Images
  • Overview on Pytesseract
  • Extract Text from all Image
  • Clean and Prepare text
Stage -3: We will see how to label NER data using BIO tagging.
  • Manually Labeling with BIO technique
    • B – Beginning
    • I  –  Inside
    • O – Outside
Stage -4: We will further clean the text and preprocess the data to train machine learning.
  • Prepare Training Data for Spacy
  • Convert data into the spacy format
Stage -5: With the preprocess data we will train the Named Entity model.
  • Configuring NER Model
  • Train the model
Stage -6: We will predict the entitles using NER and model and create a data pipeline for parsing text.
  • Load Model
  • Render and Serve with Displacy
  • Draw Bounding Box on Image
  • Parse Entitles from Text
Finally, we will put it all together and create a document scanner app.
Are you ready !!!
Let start developing the Artificial Intelligence project.

What you’ll learn?
  • Develop and Train Named Entity Recognition Model
  • Not only Extract text from the Image but also Extract Entities from Business Card
  • Develop Business Card Scanner like ABBY from Scratch
  • High-Level Data Preprocess Techniques for Natural Language Problem
  • Real-Time NER apps
Requirements:
  • Should be at least beginner in Python
  • Understand aggregation techniques with Pandas DataFrames
  • Read, Write Images with OpenCV and Drawing Rectangles on Image
  • Understand HTML, Boostrap
Course content
1. Introduction
—————
3. Facing any Issue with the Course  Here is the solution
—————
2. Project Setup
1. Install Python
—————
2. Install Virtual Environment
—————
3. Install Packages into Virtual Environment
—————
4. Install Tesseract OCR _ Pytesseract
—————
5. Install spaCy
—————
6. Test, the packages are installed
—————
3. Data Preparation
1. Project Plan
—————
2. Load Business Card using OpenCV _ PIL
—————
3. Pytesseract Extract text from Image
—————
4. Pytesseract Tesseract Error
—————
5. Pytesseract How Pytesseract with work
—————
6. Pytesseract Image to text to dataframe
—————
7. Pytesseract Clean Text in Dataframe
—————
8. Pytesseract Draw Bounding Box around each word
—————
9. Extract Text and Data from all Business Card
—————
10. Save data in csv
—————
11. Labeling
—————
4. Data Preprocessing and Cleaning
1. Spacy Training Data Format
—————
2. Load Data and convert into Pandas DataFrame
—————
3. Cleaning Text
—————
4. Convert Data into spacy format
—————
5. Testing Entities
—————
6. Convert data into spacy format for all Business card text
—————
7. Splitting Data into Training and Testing Set
—————
5. Train Named Entity Recognition (NER) model
1. Spacy Fill the Configuration
—————
2. Spacy Prepare Data
—————
3. Spacy Train NER pipeline model
—————
4. Spacy Save NER Model
—————
6. Predictions
1. Import Required Libraries
—————
2. Clean Text Function
—————
3. Load Spacy NER Model
—————
4. Extract Text from Image and Convert into Data Frame
—————
5. Convert Data Frame into Content
—————
6. Get Named Entities from model
—————
7. Displacy render
—————
8. Tagging Each Word
—————
9. Join Label to tokens dataframe
—————
10. Join token dataframe with Pytesseract data
—————
11. Bounding Box and Tagging Predicted Entities
—————
12. Combine the BIO information
—————
13. Bounding Box
—————
14. Parsing Function
—————
15. Testing
—————
16. Parse Entitles
—————
17. Predictions Function
—————
18. Final Prediction Pipeline
—————
7. Improve Model Performance
1. Ideas to Improve model accuracy
—————
2. Version-2 model framework Data Preprocessing
—————
3. Train Version 2 model
—————
4. Get Predictions from the model
—————
8. Document Scanner
2. What and Why Document Scanner in OpenCV
—————
3. Setup and Read Image
—————
4. Resize Image with same aspect ratio
—————
5. Edge Detection (Enhance, Blur and Canny) to Document
—————
6. Dilate Edges with morphological transform
—————
7. Find Four Point Countours (Identify Location of document)
—————
8. Apply Wrap transform and crop only document
—————
9. Document Scanner Function Putting All together
—————
10. Magic Color to Image
—————
11. Integrate NER Predictions

Download all files used in the course from here | Password: freeudemycourses.online

Source: https://www.udemy.com/course/business-card-reader-app/