Check Factual Consistency of Auto-generated Summaries Using Masked Summarization

Key information in source texts and reference summaries is masked to artificially generate factually inconsistent summaries

SNU AI
SNU AIIS Blog

--

by Sue Hyun Park

There is a great amount of textual content in the world. And we seek the most important information in a short time. Automatic text summarization systems have helped us quickly browse news articles, customer reviews, financial documents, and so on. With advances of neural text generation methods, instead of simply concatenating selected sentences from the source text, a system can generate paraphrases to construct a more coherent, informative summary. This approach is called abstractive summarization.

Extractive summarization extracts key terms verbatim. Abstractive summarization paraphrases the document. (Source: AWS blog)

As an abstractive summarization system generates sentences that are different from the original ones, oftentimes (nearly 30%) the outputs are fake, or in other words, factually inconsistent. A model may misrepresent information present in the source text (intrinsic error) or add information not inferable from the source text (extrinsic error).

Types of errors in abstractive summarization (Based on Maynez et al., 2020)
Types of errors in abstractive summarization (Based on Maynez et al., 2020)

Since a single error can sharply decrease the reliability of the generated summary, it is crucial to check the factual inconsistency of abstractive summaries. The most recently developed idea to do so is to train a factual consistency classifier with two datasets, positive summaries (factually consistent) and negative summaries (factually inconsistent). Positive summaries are readily available from existing text summarization datasets. The challenge lies in generating quality negative summaries, because the intended factual inconsistency should not be largely apparent.

In this blog post, we introduce Masked-and-Fill with Masked Article (MFMA), a novel method for generating plausible but factually inconsistent summaries. Experiments on seven benchmark datasets demonstrate that factual consistency classifiers trained on negative summaries generated with our method mostly outperform existing models. Also, they show a competitive correlation with human judgment. Our paper was listed at the Findings of NAACL 2022.

How to Improve the Mask-and-Fill (MF) Method

Our negative summary generation method starts from the previous Mask-and-Fill (MF) approach. Suppose an article (i.e., source text) and a positive summary of the article is given. The MF model masks salient information in the positive summary, and lets a masked language model predict and replace each masked information to generate a negative summary. Words with specific part-of-speech tags (such as verb, noun, proper noun, and number), entities, or a randomly selected sentence become the objects of replacement.

Example of a negative summary generated by the MF model. Spans that are highlighted are masked when generating the negative summary. Red spans are factually inconsistent with the given article and blue spans are factually consistent.
Example of a negative summary generated by the MF model. Spans that are highlighted are masked when generating the negative summary. Red spans are factually inconsistent with the given article and blue spans are factually consistent.

However, the output of the MF model is very irrelevant to the article and the unnaturalness makes it too easy to be discerned. Upon this finding, we discover two crucial conditions an effectively augmented negative summary should satisfy:

  1. guarantee of inconsistency: the generated negative summaries should indeed be inconsistent with the source article.
  2. relevance to the source article: the generated negative summaries should include contents related to the article.

1. Mask-and-Fill with Masked Article (MFMA)

To model negative summaries somehow related to the article, we propose a new model: Mask-and-Fill with Masked Article (MFMA). The MFMA model generates a negative summary using both masked article and masked positive summary. The difference from the MF model is that the MFMA model additionally masks the source article and assumes a trained summarizer model reconstructs the masked positive summary with reference to the masked article. The key to increasing relevance is to feed context of the article during the generation process.

Example of a negative summary generated by our MFMA model

We assume noun phrases and entities are salient information, and mask a portion of them in articles and positive summaries. We use spaCy to extract the specified information. Then, we concatenate the masked article and masked summary by prepending a prefix token for each input text. Next, we train a summarizer based on an encoder-decoder model, BART, to reconstruct the original positive summary. Finally, the trained summarizer generates negative summaries through inference.

Overall flow of our proposed MFMA method
Overall flow of our proposed MFMA method

2. Masked Summarization (MSM)

As a variant of MFMA, our Masked SuMmarization (MSM) model generates a negative summary using only the masked article. We use T5 as the summarizer model instead of BART due to better empirical performance. Otherwise, the overall process is similar to that of MFMA. Since the MSM model does not adhere to the structure of masked positive summaries, it is capable of creating more diverse forms of negative summaries.

Training the Factual Consistency Checking Model

A binary classifier trained with positive summaries and negative summaries generated through our MFMA or MSM model serves as the final factual consistency checking model. For implementation, the pair of summary and the corresponding article are concatenated and then fed into the classification model, ELECTRA, as an input.

Our pre-trained model and code are released here.

Experiments

We used the CNN/Daily Mail (CNN/DM) dataset for training and testing our negative summarizers MFMA and MSM. For evaluation, we used seven benchmark datasets providing human judgments of factual consistency of abstractive summaries. The summaries in benchmark datasets are either generated from the CNN/DM dataset or the BBC eXtreme Summarization (XSum) dataset. We used benchmark metrics based on entailment, question-answering, n-gram similarity, or other methods.

Classification Accuracy

The macro-F1 score of our proposed MFMA model outperforms baseline entailment metrics in five of seven benchmark datasets. Our proposed MSM model shows competitive performance for CNN/DM benchmarks but falls behind our MFMA model. This is because the MSM model does not follow the information guidance of positive summaries and often results in noisy samples.

Macro F1-score and class balanced accuracy of the human annotated factual consistency for the benchmark datasets based on CNN/DM
Macro F1-score and class balanced accuracy of the human annotated factual consistency for the benchmark datasets based on XSum

Correlation with Human Judgments

Compared with general metrics that are not classification level, our MFMA model exhibits high correlation with human judgments as well.

Summary level Pearson Correlation(r) and Spearman’s Correlation(ρ) between various automatic metrics and human judgments of factual consistency for the model generated summaries

Masked Ratio

We demonstrate that there is a tradeoff in adjusting the article and summary masking ratios for generating negative summaries. Too high a masking ratio decreases the generated negative summary’s relevance to the source article, and too low a masking ratio leads to generate rather positive summaries. The optimal masking ratio combination is 0.6 for articles and 0.8 for summaries.

Validation Performance among Masked Ratio for MFMA
Generated negative summaries for an example article in the CNN/DM dataset. For MFMA, the summary masking ratio is 0.6.

Wrapping Up

Misinformation is fatal to the reliability of automatically generated summaries. Since summaries generated in an abstractive way are prone to including incorrect sentences, evaluating factual consistency of the summarization system is preliminary for safe deployment. In this regard, we propose an effective factually inconsistent summary generation method for training factual consistency classifiers. The key concept is to feed both masked article and masked reference summary to a summarizer model so that the generated summary is somehow relevant to the article yet lacks factualness.

We have designed an improved system for checking the factual consistency of generated summaries. We end this blog post by introducing promising pathways for creating factually consistent summaries in the beginning. Notable approaches include optimizing factual consistency during generation and correcting factual errors as a post-processing step.

Acknowledgement

This blog post is based on the following paper:

  • Hwanhee Lee, Kang Min Yoo, Joonsuk Park, Hwaran Lee, and Kyomin Jung. Masked Summarization to Generate Factually Inconsistent Summaries for Improved Factual Consistency Checking. arXiv preprint arXiv:2205.02035, 2022. (Findings of NAACL 2022, paper)

Thanks to Hwanhee Lee for helpful comments on this blog post.

--

--

SNU AI
SNU AIIS Blog

AIIS is an intercollegiate institution of Seoul National University, committed to integrate and support AI related research at Seoul National University.