What is NER?
Named Entity Recognition (NER) is a Natural Language Processing Technique which is used to extract proper entities in a given text content and classify the extracted entites under pre-defined classes. To put in simple words, NER is a technique used to extract entities such as person names, location names, company names, etc from a given text. NER has its own importance when it comes to information retrieval.
How does NER work?
Naturally after reading a particular text, Humans can recognize some common entities like person name , date and so on. But to do the same with the aid of computers, we have to help the computer learn and do the task for us. To do so, we can avail services of Natural Language Processing (NLP) and Machine Learning (ML). The role of NLP is to make possible for the computer to read text, communicate with humans , understand their sentiments and interpret it by knowing the patterns and rules of languages. And the role of ML is to help machines learn and improve in time.
Like how we define a heartbeat as a two-part pumping action, we define the working of NER as a two-step process, 1. Identify the named entity 2. Categorize the named entity.
Let us take an example.
Output
A NER algorithm can highlight and extract particular entities from a given text.
Output
Spacy library allows us to train a NER by updating the existing model according to the specific context or train a fresh NER model as well. In this article we can explore how to build a custom NER model to extract education details from resume data.
Building a custom NER model
Importing necessary libraries
Like performing rituals before kick-starting a new project, we have to import necessary libraries.
Training Data
First we need to create entity categories such as Degree, School name, Location, Percentage & Date and feed the NER model with relevant training data.
Spacy library accepts the training data in the form of tuples containing text data and a dictionary. The dictionary should contain the start and end indices of the named entity in the text and category of the named entity.
Creating Blank Model
The very first baby step in building a cutom model is to create a blank ‘en’ model. This blank model is built to carry out NER process.
Pipeline Set-up
Next step is to set-up the pipeline with only NER using create_pipe function.
Training the model
Before starting to train the model, we have to add the categories of the named entities (Labels) to the ‘ner’ using ner.add_label() method and then we have to disable other pipeline components apart from ‘ner’ since these components should not get affected while training. we train the recognizer by disabling those components using nlp.disable_pipes() method.
To train the ‘ner’ model, the model has to be looped over the training data for sufficient number of iterations. For that, we use n_iter which is set to 100. Inorder to ensure that the model does not make generalizations based on the order of the examples, we will shuffle the training data randomly before every iteration using random.shuffle() function.
We use tqdm() function for creating Progress Meters or Progress Bars. Example class holds the information for one training instance. It stores two objects, one for holding the predictions of the pipeline and other for holding the reference data. Example.from_dict(doc,annotations) method is used to construct an Example object from the predicted document (doc) and the reference annotations provided as a dictionary (annotations).The nlp_update() function can be used to train the recognizer.
Training Losses
Saving the model
Save the model which is stored in the output_dir variable and export the model as a pkl file.
Testing the trained model
The output will look like this
Use Cases
- Extract Structure from Unstructured Text Data — Entity extraction like education and other professional information from Resumes.
- Recommendation system — NER can aid recommendation algorithms by extracting entities from one document and storing these entities in a relational database. Data science teams can then create tools to recommend other documents that have similar entities mentioned.
- Customer Support — NER can be used to categorize the complaints registered by customers and assign it to the relevant department within the organization that should be handling this.
- Efficient Search algorithm — NER can be run on all documents to extract entities and be stored separately. The next time a user searches for a term, that search term would be matched with a smaller list of entities in each document, which leads to faster search execution.
Pros
- SpaCy NER model learns very quickly with few lines of annotated data. More the training data better will be the performance of the model.
- Many open-source annotation tools are available for creating training data for SpaCy NER model.
Cons
- Ambiguity and Abbreviations -One of the major challenges in identifying named entities is language. Recognizing words which can have multiple meanings is difficult.
- Words which are not used very frequently these days is another major challenge. Words like person names, location names etc.
Conclusion
For entity extraction from resumes, we prefer custom NER over pre-trained NER. This is because pre-trained NER model will have only common categories like PERSON,ORG,GPE etc. But when we build a custom NER model, we can have our own set of categories which is suitable to the context that we are working on.