In this tutorial, I review ways to take raw categorical survey data and create new variables for analysis and visualizations with Python using pandas and GeoPy. I’ll show how to make new pandas columns from encoding complex responses, geocoding locations, and measuring distances.
Here’s the associated GitHub repository for this workshop, which includes the data set and a Jupyter Notebook for the code.
Thanks to the St. Lawrence Eastern Lake Ontario Partnership for Regional Invasive Species Management (SLELO PRISM), I was able to use boat launch steward data from 2016 for this virtual workshop. The survey data was collected by boat launch stewards around Lake Ontario in upstate New York. Boaters were asked a series of survey questions and their watercrafts were inspected for aquatic invasive species.
This tutorial was originally designed for the Syracuse Women in Machine Learning and Data Science (Syracuse WiMLDS) Meetup group.
Visualizing qualitative data can be difficult if care is not taken for hierarchical characteristics. Variables representing levels of feelings can be presented in a horizontal range to improve comprehension. The online bank, Simple, includes a poll in its newsletter to account holders and often asks for levels of confidence with financial topics. Here’s how to present hierarchical qualitative data in a few different ways based on visualizations from Simple’s monthly newsletter.
To represent qualitative data, careful consideration should be given to:
Graph Type
Logical Order of Data
Color Scheme
Original Graphs
Graph 1
In September, Simple’s poll question was: “How confident do you feel making big purchases in today’s financial environment?” Here is the visualization that accompanied it.
Although the legend is presented in a sensible high-to-low order, this graph is pretty confusing. The choice of a pie chart muddles the range of emotions being presented. The viewer’s eye, if moving clockwise, hits ‘Not at all Confident’ at about the same time as ‘Very Confident’. The color palette has no inherent significance for the survey responses. It does not travel on an easily understood color spectrum of high to low.
Graph 2
In November, Simple’s poll question was: “How do you feel about the money you’ll be spending this holiday season?” Below is the graph that illustrated these results.
Simple’s graph shows various emotions, but does not show them in any particular order, whether by percentage or type of feeling. Similar to the pie chart, the color palette does not have any particular significance.
Improved Graphs
Using Python and matplotlib’s horizontal stacked bar chart, I created different representations of the survey data for big purchase confidence and feelings about holiday spending. A bar chart presents results for viewers to read logically from left to right.
Graph 1
I associated the levels of confidence with a green to red spectrum to signify the range of positive to negative feelings. Another variation could have been a monochrome spectrum where a dark shade moving to a lighter shades would signify decreasing confidence.
Graph 2
I arranged the emotions from negative to positive feelings so they could show a spectrum. The color palette reflects the movements from troubled to excited by moving from red to green.
References
The survey data, as mentioned, comes from Simple‘s monthly newsletter.
This article from matplotlib on discrete distribution provided me with the base for these graphs. The main distinction is that I only included one bar to achieve the singular spectrum of survey results. I found variations of tree maps and waffle plots did not divide sections horizontally in rectangles as well as the stacked bar plot would.
Code
Visual #1 – September Survey Data
category_names1 = ['very \nconfident', 'somewhat \nconfident', 'mixed \nfeelings', 'not really \nconfident', 'not at all \nconfident']
results1 = {'': [14,16,30,19,21]}
def survey1(results, category_names):
labels = list(results.keys())
data = np.array(list(results.values()))
data_cum = data.cumsum(axis=1)
category_colors = plt.get_cmap('RdYlGn_r')(
np.linspace(0.15, 0.85, data.shape[1]))
fig, ax = plt.subplots(figsize=(12, 4))
ax.invert_yaxis()
ax.xaxis.set_visible(False)
ax.set_xlim(0, np.sum(data, axis=1).max())
for i, (colname, color) in enumerate(zip(category_names, category_colors)):
widths = data[:, i]
starts = data_cum[:, i] - widths
ax.barh(labels, widths, left=starts, height=0.5,
label=colname, color=color)
xcenters = starts + widths / 2
r, g, b, _ = color
text_color = 'white' if r * g * b < 0.5 else 'darkgrey'
for y, (x, c) in enumerate(zip(xcenters, widths)):
ax.text(x, y, str(int(c))+'%', ha='center', va='center',
color=text_color, fontsize=20, fontweight='bold',
fontname='Gill Sans MT')
ax.legend(ncol=len(category_names), bbox_to_anchor=(0.007, 1),
loc='lower left',prop={'family':'Gill Sans MT', 'size':'15'})
ax.axis('off')
return fig, ax
survey1(results1, category_names1)
plt.suptitle(t ='How confident do you feel making big purchases in today\'s financial environment?', x=0.515, y=1.16,
fontsize=22, style='italic', fontname='Gill Sans MT')
#plt.savefig('big_purchase_confidence.jpeg', bbox_inches = 'tight')
plt.show()
Visual #2 – November Survey Data
category_names2 = ['in a pickle','worried','fine','calm','excited']
results2 = {'': [14,32,16,29,9]}
def survey2(results, category_names):
labels = list(results.keys())
data = np.array(list(results.values()))
data_cum = data.cumsum(axis=1)
category_colors = plt.get_cmap('RdYlGn')(
np.linspace(0.15, 0.85, data.shape[1]))
fig, ax = plt.subplots(figsize=(10.5, 4))
ax.invert_yaxis()
ax.xaxis.set_visible(False)
ax.set_xlim(0, np.sum(data, axis=1).max())
for i, (colname, color) in enumerate(zip(category_names,
category_colors)):
widths = data[:, i]
starts = data_cum[:, i] - widths
ax.barh(labels, widths, left=starts, height=0.5,
label=colname, color=color)
xcenters = starts + widths / 2
r, g, b, _ = color
text_color = 'white' if r * g * b < 0.5 else 'darkgrey'
for y, (x, c) in enumerate(zip(xcenters, widths)):
ax.text(x, y, str(int(c))+'%', ha='center', va='center',
color=text_color, fontsize=20, fontweight='bold', fontname='Gill Sans MT')
ax.legend(ncol=len(category_names), bbox_to_anchor=(- 0.01, 1),
loc='lower left', prop={'family':'Gill Sans MT', 'size':'16'})
ax.axis('off')
return fig, ax
survey2(results2, category_names2)
plt.suptitle(t ='How do you feel about the money you\'ll be spending this holiday season?', x=0.509, y=1.1, fontsize=22,
style='italic', fontname='Gill Sans MT')
#plt.savefig('holiday_money.jpeg', bbox_inches = 'tight')
plt.show()