Improving Visualizations of Hierarchical Qualitative Data

Improving Visualizations of Hierarchical Qualitative Data

Visualizing qualitative data can be difficult if care is not taken for hierarchical characteristics. Variables representing levels of feelings can be presented in a horizontal range to improve comprehension. The online bank, Simple, includes a poll in its newsletter to account holders and often asks for levels of confidence with financial topics. Here’s how to present hierarchical qualitative data in a few different ways based on visualizations from Simple’s monthly newsletter.

To represent qualitative data, careful consideration should be given to:

  • Graph Type
  • Logical Order of Data
  • Color Scheme

Original Graphs

Graph 1

In September, Simple’s poll question was: “How confident do you feel making big purchases in today’s financial environment?” Here is the visualization that accompanied it.

A pie graph created by Simple bank shows levels of confidence for account holder confidence making big purchases in today's financial climate.
Simple’s pie chart of its September survey results for: “How confident do you feel making big purchases in today’s financial environment?”

Although the legend is presented in a sensible high-to-low order, this graph is pretty confusing. The choice of a pie chart muddles the range of emotions being presented. The viewer’s eye, if moving clockwise, hits ‘Not at all Confident’ at about the same time as ‘Very Confident’. The color palette has no inherent significance for the survey responses. It does not travel on an easily understood color spectrum of high to low.

Graph 2

In November, Simple’s poll question was: “How do you feel about the money you’ll be spending this holiday season?” Below is the graph that illustrated these results.

A bar chart shows
Simple’s bar chart of November survey results for: “How do you feel about the money you’ll be spending this holiday season?”

Simple’s graph shows various emotions, but does not show them in any particular order, whether by percentage or type of feeling. Similar to the pie chart, the color palette does not have any particular significance.

Improved Graphs

Using Python and matplotlib’s horizontal stacked bar chart, I created different representations of the survey data for big purchase confidence and feelings about holiday spending. A bar chart presents results for viewers to read logically from left to right.

Graph 1

A horizontal bar chart shows Simple's survey results from high to low confidence levels for making big purchases in today's financial climate.
A horizontal stacked bar chart shows a variation of Simple’s September survey results.

I associated the levels of confidence with a green to red spectrum to signify the range of positive to negative feelings. Another variation could have been a monochrome spectrum where a dark shade moving to a lighter shades would signify decreasing confidence.

Graph 2

A horizontal stacked bar chart shows a range of emotions for holiday spending.
A horizontal stacked bar chart shows a variation of Simple’s November survey results.

I arranged the emotions from negative to positive feelings so they could show a spectrum. The color palette reflects the movements from troubled to excited by moving from red to green.

References

The survey data, as mentioned, comes from Simple‘s monthly newsletter.

This article from matplotlib on discrete distribution provided me with the base for these graphs. The main distinction is that I only included one bar to achieve the singular spectrum of survey results. I found variations of tree maps and waffle plots did not divide sections horizontally in rectangles as well as the stacked bar plot would.

Code

Visual #1 – September Survey Data

category_names1 = ['very \nconfident', 'somewhat \nconfident', 'mixed \nfeelings', 'not really \nconfident', 'not at all \nconfident']
results1 = {'': [14,16,30,19,21]}

def survey1(results, category_names):

    labels = list(results.keys())
    data = np.array(list(results.values()))
    data_cum = data.cumsum(axis=1)
    category_colors = plt.get_cmap('RdYlGn_r')(
        np.linspace(0.15, 0.85, data.shape[1]))

    fig, ax = plt.subplots(figsize=(12, 4))
    ax.invert_yaxis()
    ax.xaxis.set_visible(False)
    ax.set_xlim(0, np.sum(data, axis=1).max())

    for i, (colname, color) in enumerate(zip(category_names, category_colors)):
        widths = data[:, i]
        starts = data_cum[:, i] - widths
        ax.barh(labels, widths, left=starts, height=0.5,
                label=colname, color=color)
        xcenters = starts + widths / 2

        r, g, b, _ = color
        text_color = 'white' if r * g * b < 0.5 else 'darkgrey'
        for y, (x, c) in enumerate(zip(xcenters, widths)):
            ax.text(x, y, str(int(c))+'%', ha='center', va='center',
                    color=text_color, fontsize=20, fontweight='bold',
                   fontname='Gill Sans MT')
    ax.legend(ncol=len(category_names), bbox_to_anchor=(0.007, 1),
              loc='lower left',prop={'family':'Gill Sans MT', 'size':'15'})
    ax.axis('off')
    return fig, ax

survey1(results1, category_names1)

plt.suptitle(t ='How confident do you feel making big purchases in today\'s financial environment?', x=0.515, y=1.16, 
    fontsize=22, style='italic', fontname='Gill Sans MT')
#plt.savefig('big_purchase_confidence.jpeg', bbox_inches = 'tight')
plt.show()

Visual #2 – November Survey Data

category_names2 = ['in a pickle','worried','fine','calm','excited']
results2 = {'': [14,32,16,29,9]}


def survey2(results, category_names):

    labels = list(results.keys())
    data = np.array(list(results.values()))
    data_cum = data.cumsum(axis=1)
    category_colors = plt.get_cmap('RdYlGn')(
        np.linspace(0.15, 0.85, data.shape[1]))

    fig, ax = plt.subplots(figsize=(10.5, 4))
    ax.invert_yaxis()
    ax.xaxis.set_visible(False)
    ax.set_xlim(0, np.sum(data, axis=1).max())

    for i, (colname, color) in enumerate(zip(category_names, 
category_colors)):
        widths = data[:, i]
        starts = data_cum[:, i] - widths
        ax.barh(labels, widths, left=starts, height=0.5,
                label=colname, color=color)
        xcenters = starts + widths / 2

        r, g, b, _ = color
        text_color = 'white' if r * g * b < 0.5 else 'darkgrey'
        for y, (x, c) in enumerate(zip(xcenters, widths)):
            ax.text(x, y, str(int(c))+'%', ha='center', va='center',
                    color=text_color, fontsize=20, fontweight='bold', fontname='Gill Sans MT')
    ax.legend(ncol=len(category_names), bbox_to_anchor=(- 0.01, 1),
              loc='lower left', prop={'family':'Gill Sans MT', 'size':'16'})
    ax.axis('off')
    return fig, ax


survey2(results2, category_names2)
plt.suptitle(t ='How do you feel about the money you\'ll be spending this holiday season?', x=0.509, y=1.1, fontsize=22,
            style='italic', fontname='Gill Sans MT')
#plt.savefig('holiday_money.jpeg', bbox_inches = 'tight')
plt.show()

Data Exploration with Student Test Scores

Data Exploration with Student Test Scores

I explored a set of student test scores from Kaggle for my Udacity Data Analyst Nanodegree program. The data consists of 1000 entries for students with the following categories: gender, race/ethnicity, parental level of education, lunch assistance, test preparation, math score, reading score, writing score. My main objective was to explore trends through the stages of univariate, bivariate, and multivariate analysis.

Preliminary Data Cleaning

For this project, I used numpy, pandas, matplotlib.pyplot, and seaborn libraries. The original data has all test scores as integer data types. I added a column for a combined average of math, reading, and writing scores and three columns for the test scores converted into letter grade.

# import all packages and set plots to be embedded inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb

%matplotlib inline
labels = ['gender', 'race/ethnicity', 'par_level_educ', 'lunch', 'test_prep', 'math', 'reading', 'writing']
tests = pd.read_csv('StudentsPerformance.csv', header=0, names=labels)
tests.info()
Output of info() for student test scores.
tests.head(10)
Output of head() for student test scores.

Univariate Analysis

Histograms provide a sense of the spread of test scores across subject. Count plots provide counts for test preparation course attendance and parental level of education.

plt.figure(figsize=[10,4])
plt.subplot(1, 3, 1)
plt.hist(data=tests, x='math', bins=20)
plt.title('Math')
plt.xlim(0,100)
plt.ylim(0,160)
plt.ylabel('Number of Students', fontsize=12)
plt.subplot(1, 3, 2)
plt.hist(data=tests, x='reading', bins=20)
plt.title('Reading')
plt.xlim(0,100)
plt.ylim(0,160)
plt.subplot(1, 3, 3)
plt.hist(data=tests, x='writing', bins=20)
plt.title('Writing')
plt.xlim(0,100)
plt.ylim(0,160)
plt.suptitle('Test Scores', fontsize=16, y=1.0);
Histograms showing the spread of student test scores across all topics.
ed_order = ['some high school', 'high school', 'some college', 
            'associate\'s degree', 'bachelor\'s degree', 'master\'s degree']
base_color = sb.color_palette()[9]
sb.countplot(data=tests, x='par_level_educ', color=base_color, order=ed_order)
n_points = tests.shape[0]
cat_counts = tests['par_level_educ'].value_counts()
locs, labels = plt.xticks()
for loc, label in zip(locs, labels):
    count = cat_counts[label.get_text()]
    pct_string = count
    plt.text(loc, count-35, pct_string, ha='center', color='black', fontsize=12)
plt.xticks(rotation=25)
plt.xlabel('')
plt.ylabel('')
plt.title('Parental Education Level of Student Test Takers');
Bivariate count plots of student test scores across parental levels of education.

Bivariate Analysis

Violin plots illustrate average test scores and test preparation course attendance. Box plots provide visual representation of the quartiles within each subject area. I sorted level of education from the lowest to highest level captured by the data.

base_color=sb.color_palette()[0]
g = sb.violinplot(data=tests, y='test_prep', x='avg_score', color=base_color)
plt.xlabel('')
plt.ylabel('')
plt.title('Average Test Scores and Preparation Course Completion', fontsize=14)
g.set_yticklabels(['Did Not Complete', 'Completed Course'], fontsize=12);
Violin plots that show average student test scores base on level of test preparation.
ed_order = ['some high school', 'high school', 'some college', 
            'associate\'s degree', 'bachelor\'s degree', 'master\'s degree']
sb.boxplot(data=tests, x='reading', y='par_level_educ', order=ed_order, palette="Blues")
plt.xlabel('')
plt.ylabel('')
plt.title('Reading Scores and Parental Level of Education', fontsize=14);
Box plots that show reading scores across varying levels of parental education.

Multivariate Analysis

A swarm plot explores average test scores, parental level of education, and test preparation course attendance. Box plots show test scores for each subject, divided by gender and test preparation course attendance.

ed_order = ['some high school', 'high school', 'some college', 
            'associate\'s degree', 'bachelor\'s degree', 'master\'s degree']
sb.swarmplot(data=tests, x='par_level_educ', y='avg_score', hue='test_prep', order=ed_order, edgecolor='black')
legend = plt.legend(loc=6, bbox_to_anchor=(1.0,0.5))
plt.xticks(rotation=15)
plt.xlabel('')
plt.ylabel('')
legend.get_texts()[0].set_text('Did Not Complete')
legend.get_texts()[1].set_text('Completed')
plt.ylim(0,110)
plt.title('Average Test Scores by Parental Level of Education and Test Preparation Course Participation');
A swarm plot that shows student test scores, test preparation level, and the highest levels of parental education.
plt.figure(figsize=[15,4])
plt.subplot(1, 3, 1)
g = sb.boxplot(data=tests, x='test_prep', y='math', hue='gender')
plt.title('Math')
plt.xlabel('')
plt.ylabel('')
plt.ylim(0,110)
g.set_xticklabels(['Did Not Complete', 'Completed Course'])
plt.subplot(1,3,2)
g = sb.boxplot(data=tests, x='test_prep', y='reading', hue='gender')
plt.title('Reading')
plt.xlabel('')
plt.ylabel('')
plt.ylim(0,110)
g.set_xticklabels(['Did Not Complete', 'Completed Course'])
plt.subplot(1,3,3)
g = sb.boxplot(data=tests, x='test_prep', y='writing', hue='gender')
plt.title('Writing')
plt.xlabel('')
plt.ylabel('')
plt.ylim(0,110)
g.set_xticklabels(['Did Not Complete', 'Completed Course']);
Multivariate box plots showing test scores in Math, Reading, and Writing based on student gender and test preparation level.

Data Analysis and UFO Reports

Data Analysis and UFO Reports

Data analysis and unidentified flying object (UFO) reports go hand-in-hand. I attended a talk by author Cheryl Costa who analyzes records of UFO sightings and explores their patterns. Cheryl and her wife Linda Miller Costa co-authored a book that compiles UFO reports called UFO Sightings Desk Reference: United States of America 2001-2015.

Records of UFO sightings are considered citizen science because people voluntarily report their experiences. This is similar to wildlife sightings recorded on websites like eBird that help illustrate bird distributions across the world. People report information about UFO sighting events including date, time, and location.

A dark night sky with the moon barely visible and trees below.
Night sky along the roadside outside Wayquecha Biological Field Station in Peru, taken April 2015.

Cheryl spoke about gathering data from two main online databases, MUFON (Mutual UFO Network) and NUFORC (National UFO Reporting Network). NUFORC’s database is public and reports can be sorted by date, UFO shape, and state. MUFON’s database requires a paid membership to access the majority of their data. This talk was not a session to discuss conspiracy theories, but a chance to look at trends in citizen science reports.

The use of data analysis on UFO reports requires careful consideration of potential bias and reasonable explanations for numbers in question. For example, a high volume of reports in the summer could be because more people are spending time outside and would be more likely to notice something strange in the sky.

This talk showed me that conclusions may be temptingly easy to draw when looking at UFO data as a whole, but speculations should be met with careful criticism. The use of the scientific method when approaching ufology, or the study of UFO sightings, seems key for a field often met with overwhelming skepticism.

I have yet to work with any open-source data on UFO reports, but this talk reminded me of the importance of a methodical approach to data analysis. Data visualization for any field of study starts with asking questions, being mindful of outside factors, and being able to communicate messages within large data sets to any audience.

Reducing Plastic Use

Reducing Plastic Use

Various pieces of plastic trash debris are strewn alongside seaweed and rocks on a beach.
Assorted plastic trash on the beach at Pelican Cove Park in Rancho Palos Verdes, CA, 2017.

In the spirit of this year’s Earth Day theme (‘End Plastic Pollution’), I researched the fate of plastic. The Environmental Protection Agency (EPA) prepared a report for 2014 municipal waste stream data for the United States. Plastic products were either recycled, burned for energy production, or sent to landfills. I used pandas to look at the data and Matplotlib to create a graph. I included percentages for each fate and compared the categories of total plastics, containers and packaging, durable goods, and nondurable goods.

A graph compares different types of plastic products and their fate in the municipal waste stream.
Percentages of total plastics and plastic types that get recycled, burned for energy, or sent to a landfill, according to the EPA.

The EPA data shows a majority of plastic products reported in the waste stream were sent to landfills. Obviously, not all plastic waste actually reaches a recycling facility or landfill. Roadsides, waterways, and beaches are all subject to plastic pollution. Decreasing personal use of plastic products can help reduce the overall production of waste.

Here are some ideas for cutting back on plastic use:

  • Bring reusable shopping bags to every store.
    • Utilize cloth bags for all purchases.
    • Opt for reusable produce bags for fresh fruit and vegetables instead of store-provided plastic ones.
  • Ditch party plasticware.
    • Buy an assortment of silverware from a thrift store for party use.
    • Snag a set of used glassware for drinks instead of buying single-use plastic cups.
  • Use Bee’s Wrap instead of plastic wrap.
    • Bee’s Wrap is beeswax covered cloth for food storage. It works exactly the same as plastic wrap, but it can be used over and over.
  • Choose glassware instead of plastic zip-locked bags for storing food.
    • Glass containers like Pyrex can be used in place of single-use plastic storage bags.
  • Say ‘no’ to plastic straws.
    • Get in the habit of refusing a straw at restaurants when you go out.
    • Bring a reusable straw made out of bamboo, stainless steel, or glass to your favorite drink spot.

 

To check out the code for the figure I created, here’s the repository for it.