Analyze News Headlines with newsgrab and spaCy

Analyze News Headlines with newsgrab and spaCy

Here’s an overview of how to use newsgrab to get news headlines from Google News. Then, the data can be analyzed using the spaCy natural language processing library.

The motivation behind newgrab was to pull data on New York colleges to compare headlines about how institutions were being affected by COVID-19. I used the College Navigator from the National Center for Education Statistics to get a list of 4-year colleges in New York to use as the search data.

I had trouble finding a clean way to scrape headlines from Google News. My brother Randy helped me use Javascript and playwright to write the code for newsgrab.

Run a Search with newsgrab

First, install newsgrab globally through npm from the command line.

npm install -g newsgrab

Run a line with the package name and specify the file path (if outside current working directory) of a line-separated list of desired search terms. For my example, I used the names of New York colleges.

newsgrab ny_colleges.txt

The output of newsgrab is a JSON file called output and will follow the array structure below:

[{"search_term":"term1","results":["result1","result2","result3"]},{"search_term":"term2","results":["result1","result2","result3"]}...]

Afterwards, the output can be handled with Python.

Analyze the JSON Data with spaCy

Import the necessary packages for handling the data. These include: json, pandas, matplotlib, seaborn, re, and spaCy. Specific modules to import are the json_normalize module from pandas and the counter module from collections.

import json
import pandas as pd
from pandas.io.json import json_normalize
import matplotlib.pyplot as plt
import seaborn as sb
import re
import spacy
from collections import Counter

Bring in one of the pre-trained models from spaCy. I use the model called en_core_web_sm. There are other options in their docs for English models, as well as those for different languages.

nlp = spacy.load("en_core_web_sm")

Read in the JSON data as a list and then normalize it with pandas. Specify the record path as ‘results’ and the meta as ‘search_term’ to correspond with the JSON array data structure from the output file.

with open('output.json',encoding="utf8") as raw_file1:
    list1 = json.load(raw_file1)

search_data = pd.json_normalize(list1, record_path='results', meta='search_term',record_prefix='results')

Gather all separate data through spaCy. I wanted to pull noun chunks, named entities, and tokens from my results column. For the token output, I use the attributes of rule-based matching to specify that I want all tokens except for stop words or punctuation. Then, each output is put into a column of the main dataframe.

noun_chunks = []
named_entity = []
tokens = []

for doc in nlp.pipe(df['results_lower'].astype('unicode').values, batch_size=50,
                        n_process=5):
    if doc.is_parsed:
        noun_chunks.append([chunk.text for chunk in doc.noun_chunks])
        named_entity.append([ent.text for ent in doc.ents])
        tokens.append([token.text for token in doc if not token.is_stop and not token.is_punct])
    else:
        noun_chunks.append(None)
        named_entity.append(None)       
        tokens.append(None)
        
df['results_noun_chunks'] = noun_chunks
df['results_named_entities'] = named_entity
df['results_tokens_clean'] = tokens

Process Tokens

Take the tokens column and flatten it into a list. Perform some general data cleaning like removing special characters and taking out line breaks and the remnants of ampersands. Then, use the counter module to get a frequency count of each of the words in the list.

word_frequency = Counter(string_list_of_words)
A raw output from the counter in collections shows words and their associated frequency in the text.
Raw output from the counter module shows tokens and their associated value counts in the total text.

Before analyzing the list, I also remove the tokens for my list of original search terms to keep it more focused on the terms outside of these. Then, I create a dataframe of the top results and plot those with seaborn.

A horizontal countplot shows descending value counts for the top tokens found in the text.
A countplot shows all keyword tokens with value counts over 21 for the college news headline data.

Process Noun Chunks

Perform some cleaning to separate the noun chunks lists per each individual search term. I remove excess characters after converting the output to strings, and then use the explode function from pandas to separate them.

Then, create a variable for the value count of each of the noun chunks, turn that into a dictionary, then map it to the dataframe for the following result.

A pandas dataframe shows news headlines, noun chunks, and separated noun segments and value counts.
A dataframe shows headlines, search terms, noun chunks, and new columns for separated noun chunks and associated value counts.

Then, I sort the values in a new dataframe in descending order, remove duplicates, and narrow down to the top 20 noun chunks with frequencies above 10 to graph in a countplot.

A horizontal countplot shows descending value counts for the top noun chunks found in the text.
A countplot shows all noun chunks with value counts over 9 for the college news headline data.

Process Named Entities

Cleaning the named entity outputs for each headline is nearly the same in process as cleaning the noun chunks. The lists are converted to strings, are cleaned, and use the explode function to separate individually. The outputs for named entities can be customized depending on desired type.

After separating the individual named entities, I use spaCy to identify the type of each and create a new column for these.

named_entity_type = []

for doc in nlp.pipe(named['named_entity'].astype('unicode').values, batch_size=50,
                        n_process=5):
    if doc.is_parsed:
        named_entity_type.append([ent.label_ for ent in doc.ents])
    else:
        named_entity_type.append(None)        

named['named_entities_type'] = named_entity_type

Then, I get the value counts for the named entities and append these to a dictionary. I map the dictionary to the named entity column, and put the result in a new column.

As seen in the snippet of the full dataframe below, the model for identifying named entity values and types is not always accurate. There is documentation for training spaCy’s models for those interested in increased accuracy.

A pandas dataframe shows news headlines, named entities, and separated named entities, named entity type, and value counts.
A dataframe shows headlines, search terms, named entities, and new columns for separated named entities, their type, and associated value counts.

From the dataframe, I narrow down the entity types to exclude cardinal and ordinal types to take out any numbers that may have high frequencies within the headlines. Then, I get the top named entity types with frequencies over 6 to graph.

A horizontal countplot shows descending value counts for the top non-numerical named entities found in the text.
A countplot shows all non-numerical named entities with value counts over 6 for the college news headline data.

For full details and cleaning steps to create the visualizations above, please reference below for the associated gist from Github.

Additional Resources

Natural Langauge Processing with Python and spaCy by Yuli Vasiliev

Natural Language Processing with spaCy in Python by Taranjeet Singh