Automatic Text Summarization

Posted on September 10 2015 in Natural Language Processing

Background

Automatic text summarization is an area of machine learning that has made significant progress over the past years. We read hundreds and thousands of articles either on our desktop, tablet, or mobile devices, and we simply don't have the time to peruse all of them. As such problem of information overload becomes more salient with the exponential growth in the quantity of data, such as the online and offline articles, books, and newspapers, so has our interest in automatic text summarization to reduce time and inefficiency of digesting them. Automatic summarization reduces a text document using natural language processing to create a coherent, robust summary without loss of important ideas of the original document. I've decided to build a naive but powerful text summarization tool in R that feeds in and summarizes TechCrunch and The New York Times articles from their respective Facebook pages.

There are two major approaches to text summarization: one that is extraction-based and the other abstraction-based. I'll be focusing on the extractive method which works by selecting a subset of words, phrases or sentences that already exist in the original text to make the summary. On the other hand, the abstractive method try to form the summary by building words, phrases or sentences from scratch. The latter is much more complicated than the former, so much of the research has focused on the former.

Algorithm

In extractive methods, culling key words, sentences, or paragraphs is very important. The naive algorithm I've written in R extracts the key sentence from each paragraph in the document. So if there are five paragraphs in an article, then the algorithm will return five key sentences as a summary. Why am I doing this? Well, I assume that each paragraph is well-written, meaning it contains one main idea and purpose. This is a pretty solid assumption in that the writers of news and blog articles want to make sure they convey their ideas succinctly and coherently. If their writings are sloppy and unclear, then the readers would get confused as to what they are trying to say. So for each paragraph, the algorithm computes a similarity score between any two sentences. The similarity score is roughly calculated as two times the number of words that overlap divided by the total number of words in two sentences:

\begin{equation*} \dfrac{2 \times |Sentence_1 \cap Sentence_2|}{|Sentence_1| + |Sentence_2|} \end{equation*}

where $Sentence_1$ represents a set of words in the first sentence and $Sentence_2$ a set of words in the second. The intuition is that we would split each sentence into words or tokens, count how many common words there are, and then normalize the result with the average length of the two sentences. This similarity function is called Dice coefficient, but feel free to use other similarity metrics such as Jaccard and Cosine similarity or even Euclidean distance. Once we have a list of similarity scores for every sentence in a paragraph, we choose the sentence that have the highest sum of its similarity scores to be the key sentence of the paragraph.

Pseudocode

Split a text document into pre-existing paragraphs.
Remove stopwords and non-alphanumeric characters in every sentence. Stem all words.
For every paragraph in order:
- For each sentence:
  - Calculate a similarity score with every other sentence
  - Find the sum of its similarity scores
- Choose a sentence that has the highest sum

Of course this is not the foolproof method of summarizing text as there can be a lot of ways my naive text summarization algorithm can be adjusted and improved. For instance, the title contains words that are usually keywords from the text, so you can give more weight to those sentences that contain one or more of title keywords in the similarity function. You can also experiment with different similarity metrics as I've mentioned above or only return key sentences that have similarity scores above a certain heuristic cutoff. Also instead of picking one best sentence from each paragraph, try picking 2-3 important paragraphs.

Data and Results

Let's use this naive algorithm to summarize this article from the TechCrunch Facebook post. The article is summarized using sentences whose similarity scores exceed 1.

Common, a co-living startup from General Assembly co-founder Brad Hargreaves, is unveiling its first building today in Brooklyn's Crown Heights. The Common opening comes at a time when venture-backed companies like WeWork are piling into co-living as a way to use urban residential space more cost-efficiently and to attract Millennials, who are delaying marriage and families later and later. Over the summer, Common partnered with a local New York City real estate developer to buy Crown Heights building earlier this year. "The whole idea here is to use common areas and activate typically under-utilized space," Hargreaves told me in a video tour via Skype. They're bringing in Common residents to the next one. About 250 people have applied for the 19 available Common spots so far in the neighborhood. They later fell out of favor as the American middle-class left for the suburbs in the mid-20th century, turning residential hotels into the housing choice of last resort for the poor left behind in the urban core. Unless low-rise, suburban areas ringing the urban core step up and add housing and unless cities start scrutinizing the higher-end of the market where external capital is sitting in vacant units as pure investment rather than shelter, there will be an extraordinary amount of pressure to make urban residential space more efficient. For now, Hargreaves is focused on New York City with two buildings slated for opening in Brooklyn.He has raised about $7.35 million from Maveron and other investors this year.

The summary, though not perfect, generally makes sense with main ideas in place and is much shorter than the original article! I used the Rfacebook package in R to grab the article from TechCrunch as well as New York Times Facebook page and rvest package in R to scrape information relevant to the article. I plan to create a Shiny app where the app summarizes articles from their Facebook pages, outputs the main idea, purpose, and summary, suggests relevant hashtags, and perhaps recommend similar articles from their past posts. Stay tuned!