Handscript of 《Deep Learning for NER Survey》

6 minute read

Published:

Li, Jing, et al. “A survey on deep learning for named entity recognition.” IEEE Transactions on Knowledge and Data Engineering 34.1 (2020): 50-70.

  • We delve into the application of deep learning techniques in NER to inspire and guide researchers and practitioners in the field.
  • We compile NER corpora and ready-to-use NER systems (from academia and industry) in tabular form to provide useful resources to the NER research community.
  • We propose a new taxonomy that systematically organizes DL-based NER methods along three axes: distributed representations of inputs, context encoders (for capturing contextual dependencies for tag decoding), and tag decoders (for predicting word labels in a sequence).
  • Additionally, we survey the most representative DL techniques recently applied in novel NER problem settings and applications.
  • Finally, we present the challenges faced by NER systems and outline future directions in the field.

Background

Techniques for NER:

  1. Rule-based methods that do not require annotated data and rely on handcrafted rules.
  2. Unsupervised learning methods relying on algorithms without labeled training examples.
  3. Feature-based supervised learning methods using supervised algorithms with careful feature engineering.
  4. Deep learning-based methods that automatically learn representations in an end-to-end manner from raw input.

What is NER?

A named entity is a word or phrase that distinctly identifies an item from a group of entities with similar attributes.
Examples include organization, person, and location names in general domains; and genes, proteins, drugs, and disease names in biomedical domains.

NER is the process of locating and classifying named entities in text into predefined categories.

For example, given a sequence \(s = <w_1,w_2,...,w_N>\), NER produces tuples \(<I_s,I_e,t>\).
\(I_s, I_e\) denote the start and end indices of the NE.
\(t\) represents the type of the NE.

NER Evaluation Metrics

Exact Match Evaluation

NER involves identifying entity boundaries and types. In “exact match evaluation,” both must be correct for the entity to be considered correctly recognized.

  • True Positive (TP): Entities identified by the NER system that match the ground truth.
  • False Positive (FP): Entities identified by NER that do not match the ground truth.
  • False Negative (FN): Entities in the ground truth not recognized by the NER system.

Precision measures the ability of the NER system to present correct entities

\[Precision = \frac{TP}{TP+FP}\]

Recall measures the system’s ability to identify all entities in the corpus

\[Recall = \frac{TP}{TP+FN}\]

F-score is the harmonic mean of Precision and Recall

\[F\text{-}score = 2\times \frac{Precision \times Recall}{Precision + Recall}\]

As most NER systems involve multiple entity types, evaluation across all types is necessary. Two methods:

  1. Macro-averaged F-score: Compute F-score per type and average equally.
  2. Micro-averaged F-score: Aggregate all entities from all types before computing. May be biased by performance on large entity classes.

Lenient Match Evaluation

If an entity is assigned the correct type regardless of exact boundaries, it’s accepted as correct. Alternatively, correct boundary detection is credited regardless of type.

Traditional NER Approaches

  1. Rule-based methods
  2. Unsupervised learning
  3. Feature-based supervised learning

Rule-Based

Rule-based NER systems rely on hand-crafted rules, often built from domain-specific dictionaries and syntactic patterns.

Examples:

  • Brill rule inference method with POS taggers
  • ProMiner for biomedical text using synonym dictionaries
  • Dictionary-based methods in EHRs

Systems like LaSIE-II, NetOwl, Facile, SAR, FASTUS, LTG rely on extensive handcrafted lexicons. They typically yield high precision but low recall and are hard to transfer across domains.

Unsupervised

A typical approach is clustering based on contextual similarity in large corpora.

  • Collins et al. reduced supervision to 7 seed rules
  • KNOWITALL used predicate names and extraction patterns
  • Nadeau et al. combined extraction and disambiguation heuristics
  • Zhang and Elhadad used IDF, context vectors, and shallow parsing

Feature-Based Supervised Learning

NER is modeled as a multi-class classification or sequence labeling task using crafted features.

Common algorithms:

  1. Hidden Markov Models (HMM)
  2. Decision Trees
  3. Maximum Entropy Models
  4. Support Vector Machines (SVM)
  5. Conditional Random Fields (CRF)

Examples:

  • IdentiFinder (HMM-based)
  • Szarvas et al. (C4.5 + AdaBoost)
  • MENE (Maximum Entropy)
  • McNamee and Mayfield (SVMs)
  • McCallum and Li (CRFs)
  • Krishnan and Manning (coupled CRFs in two-stage system)

Deep Learning in NER

Three main advantages:

  1. Nonlinear transformations provide powerful input-output mappings.
  2. Reduces engineering effort in feature design.
  3. Enables end-to-end training via gradient descent.

DL-based NER models include three key components:

  1. Input distributed representations: word-level and character-level embeddings, possibly with added features.
  2. Context encoders: CNNs, RNNs, or Transformers to capture contextual dependencies.
  3. Tag decoders: Predict output labels.

Distributed Representations

Word-Level Embedding

Word embeddings are mappings:

\[f: X \rightarrow Y\]

that are injective and structure-preserving.

Popular embeddings: Word2Vec (Google), GloVe (Stanford), fastText (Facebook), SENNA.

Character-Level Embedding

Captures subword info (prefix/suffix), handles out-of-vocabulary tokens.

Architectures:

  • CNN-based
  • RNN-based (LSTM, GRU)

Hybrid Representations

Combining DL representations with traditional features may improve performance, though at a cost to generalizability.

Context Encoders

Common architectures:

  • CNNs
  • RNNs
  • Recursive NNs
  • Transformers

Neural Language Models

For a token sequence \(t_1, t_2, ..., t_N\), the forward LM:

\[p(t_1, ..., t_N) = \prod_{k=1}^N p(t_k|t_1, ..., t_{k-1})\]

Backward LM:

\[p(t_1, ..., t_N) = \prod_{k=1}^N p(t_k|t_{k+1}, ..., t_N)\]

Forward and backward outputs are combined as context-aware embeddings.

Deep Transformers

Transformers (Vaswani et al.) eliminate recurrence and convolution. Built with stacked self-attention and feedforward layers. They offer higher quality and faster training.

Tag Decoders

Final NER stage: tag decoders output the label sequence from encoded context representations.

Four architectures:

  1. MLP + Softmax
  2. CRF
  3. RNN
  4. Pointer Networks

MLP + Softmax: Treats labeling as multiclass classification; independent label per word.

CRF: Models global sequence dependencies. Often used in DL-based NER. Limitations in capturing segment-level info.

RNN decoders: Greedy generation of tag sequences; faster and better with many entity types.

Pointer Networks: Use RNNs with attention-like “pointers” to positions in the input sequence.

Recent DL Techniques in NER

Deep Multi-Task Learning for NER

Multi-task learning (MTL) jointly trains related tasks to leverage shared representations. Applications: joint entity and relation extraction, or splitting NER into segmentation + classification sub-tasks.

Deep Transfer Learning for NER

Transfer learning leverages knowledge from a source domain to perform tasks in a target domain. Also called domain adaptation.

Deep Active Learning for NER

Active learning selects informative samples to label, reducing annotation cost. Useful for DL models that require large training datasets.

Deep Reinforcement Learning for NER

RL is inspired by behavioral psychology. An agent interacts with the environment to learn optimal actions by maximizing cumulative reward.

Key components:

  • State transition function
  • Observation function
  • Reward function
  • Agent’s own state update function and policy

Deep Adversarial Networks for NER

Adversarial training aims to improve robustness or reduce test error by exposing models to adversarial examples. GANs involve:

  • Generator: learns mapping from latent space to data
  • Discriminator: distinguishes between generated and real data

Neural Attention in NER

Inspired by human visual attention. Attention allows models to focus on informative input regions. NER models can learn to attend to contextually important tokens.