Here’s an overview of how to use newsgrab to get news headlines from Google News. Then, the data can be analyzed using the spaCy natural language processing library.
The motivation behind newgrab was to pull data on New York colleges to compare headlines about how institutions were being affected by COVID-19. I used the College Navigator from the National Center for Education Statistics to get a list of 4-year colleges in New York to use as the search data.
I had trouble finding a clean way to scrape headlines from Google News. My brother Randy helped me use Javascript and playwright to write the code for newsgrab.
Run a Search with newsgrab
First, install newsgrab globally through npm from the command line.
npm install -g newsgrab
Run a line with the package name and specify the file path (if outside current working directory) of a line-separated list of desired search terms. For my example, I used the names of New York colleges.
newsgrab ny_colleges.txt
The output of newsgrab is a JSON file called output and will follow the array structure below:
[{"search_term":"term1","results":["result1","result2","result3"]},{"search_term":"term2","results":["result1","result2","result3"]}...]
Afterwards, the output can be handled with Python.
Analyze the JSON Data with spaCy
Import the necessary packages for handling the data. These include: json, pandas, matplotlib, seaborn, re, and spaCy. Specific modules to import are the json_normalize module from pandas and the counter module from collections.
import json
import pandas as pd
from pandas.io.json import json_normalize
import matplotlib.pyplot as plt
import seaborn as sb
import re
import spacy
from collections import Counter
Bring in one of the pre-trained models from spaCy. I use the model called en_core_web_sm. There are other options in their docs for English models, as well as those for different languages.
nlp = spacy.load("en_core_web_sm")
Read in the JSON data as a list and then normalize it with pandas. Specify the record path as ‘results’ and the meta as ‘search_term’ to correspond with the JSON array data structure from the output file.
with open('output.json',encoding="utf8") as raw_file1:
list1 = json.load(raw_file1)
search_data = pd.json_normalize(list1, record_path='results', meta='search_term',record_prefix='results')
Gather all separate data through spaCy. I wanted to pull noun chunks, named entities, and tokens from my results column. For the token output, I use the attributes of rule-based matching to specify that I want all tokens except for stop words or punctuation. Then, each output is put into a column of the main dataframe.
noun_chunks = []
named_entity = []
tokens = []
for doc in nlp.pipe(df['results_lower'].astype('unicode').values, batch_size=50,
n_process=5):
if doc.is_parsed:
noun_chunks.append([chunk.text for chunk in doc.noun_chunks])
named_entity.append([ent.text for ent in doc.ents])
tokens.append([token.text for token in doc if not token.is_stop and not token.is_punct])
else:
noun_chunks.append(None)
named_entity.append(None)
tokens.append(None)
df['results_noun_chunks'] = noun_chunks
df['results_named_entities'] = named_entity
df['results_tokens_clean'] = tokens
Process Tokens
Take the tokens column and flatten it into a list. Perform some general data cleaning like removing special characters and taking out line breaks and the remnants of ampersands. Then, use the counter module to get a frequency count of each of the words in the list.
word_frequency = Counter(string_list_of_words)
Before analyzing the list, I also remove the tokens for my list of original search terms to keep it more focused on the terms outside of these. Then, I create a dataframe of the top results and plot those with seaborn.
Process Noun Chunks
Perform some cleaning to separate the noun chunks lists per each individual search term. I remove excess characters after converting the output to strings, and then use the explode function from pandas to separate them.
Then, create a variable for the value count of each of the noun chunks, turn that into a dictionary, then map it to the dataframe for the following result.
Then, I sort the values in a new dataframe in descending order, remove duplicates, and narrow down to the top 20 noun chunks with frequencies above 10 to graph in a countplot.
Process Named Entities
Cleaning the named entity outputs for each headline is nearly the same in process as cleaning the noun chunks. The lists are converted to strings, are cleaned, and use the explode function to separate individually. The outputs for named entities can be customized depending on desired type.
After separating the individual named entities, I use spaCy to identify the type of each and create a new column for these.
named_entity_type = []
for doc in nlp.pipe(named['named_entity'].astype('unicode').values, batch_size=50,
n_process=5):
if doc.is_parsed:
named_entity_type.append([ent.label_ for ent in doc.ents])
else:
named_entity_type.append(None)
named['named_entities_type'] = named_entity_type
Then, I get the value counts for the named entities and append these to a dictionary. I map the dictionary to the named entity column, and put the result in a new column.
As seen in the snippet of the full dataframe below, the model for identifying named entity values and types is not always accurate. There is documentation for training spaCy’s models for those interested in increased accuracy.
From the dataframe, I narrow down the entity types to exclude cardinal and ordinal types to take out any numbers that may have high frequencies within the headlines. Then, I get the top named entity types with frequencies over 6 to graph.
For full details and cleaning steps to create the visualizations above, please reference below for the associated gist from Github.
Additional Resources
Natural Langauge Processing with Python and spaCy by Yuli Vasiliev
Natural Language Processing with spaCy in Python by Taranjeet Singh