Transforming Categorical Survey Data with pandas and GeoPy

Transforming Categorical Survey Data with pandas and GeoPy

In this tutorial, I review ways to take raw categorical survey data and create new variables for analysis and visualizations with Python using pandas and GeoPy. I’ll show how to make new pandas columns from encoding complex responses, geocoding locations, and measuring distances.

Here’s the associated GitHub repository for this workshop, which includes the data set and a Jupyter Notebook for the code.

Thanks to the St. Lawrence Eastern Lake Ontario Partnership for Regional Invasive Species Management (SLELO PRISM), I was able to use boat launch steward data from 2016 for this virtual workshop. The survey data was collected by boat launch stewards around Lake Ontario in upstate New York. Boaters were asked a series of survey questions and their watercrafts were inspected for aquatic invasive species.

This tutorial was originally designed for the Syracuse Women in Machine Learning and Data Science (Syracuse WiMLDS) Meetup group.

Project Overview: FoodPact

A fork and knife surround a plate that has earth on it and FoodPact is written below.
Logo for FoodPact program

A few months ago I began a project with my brother to create a calculator for the environmental footprint of food. It’s called FoodPact to merge food and ecological impact. It’s a work in progress and I’m excited to share the code for it.

Data sources to inform the calculator include:

  • Water footprint data for crops from a 2011 study by M.M. Mekonnen and A. Y. Hoekstra.
  • Greenhouse gas emissions data from Business for Social Responsibility (BSR) and Environmental Protection Agency (EPA) documents on transport via boatrail, and freight.
  • Food waste data from the United States Department of Agriculture (USDA) Economic Research Service (ERS)
  • Global food import data from the USDA Foreign Agricultural Service’s Global Agricultural Trade System (GATS).
  • Country centroid data from a President and Fellows of Harvard College 2015 data file.
  • US city locations from  SimpleMaps.

We used a Bootstrap Bootswatch for the web application’s layout and Flask as the microframework.

Python packages used in the program include:

  • Pandas to create more refined dataframes for use within the application
  • NumPy for equations
  • geopy for calculating great circle distance between latitudes and longitudes
  • Matplotlib and pyplot for creating graphs

The whole point of the program is to take a user’s location, food product, and the product’s country of origin to generate the estimated distance the food traveled, the approximate amount of carbon dioxide that travel generated, and the water requirements for the product.

Conversions include: cubic metric tons to gallons of water, tons of crops to pounds, and grams of carbon dioxide per kilometer to pounds per mile.

Selected graphics from FoodPact project:

One limitation of the calculator is that the values for carbon dioxide consider either full travel by ship, train, or truck and not a combination of the three methods. Emissions refer to the amount it takes to ship a twenty-foot equivalent (TEU) container full of the food product across the world. The country of origin considers the centroid and not the exact location of food production. Similarly, the list of cities displays the 5 most populated cities in that given state. The only exception is New York, for which I considered New York City close enough in latitude and longitude to account for Brooklyn, Queens, Manhattan, the Bronx, and Staten Island.

The data referenced in the calculator is meant to give a relative idea of the inputs required to generate and transport food products to give perspective to consumers. Ideally, the calculator will encourage conversations about the food system and inspire people to reduce their personal food waste.

Highlights from Data Science Day 2018

Highlights from Data Science Day 2018

Columbia University hosted Data Science Day 2018 on March 28th at their campus in Manhattan. I traveled to New York to attend the event and learn more about how data science plays a role in health, climate, and finance research. A few of the presentations stood out, including the environmental talks and a keynote address from Diane Greene, the CEO of Google Cloud.

View of Grand Army Plaza
Grand Army Plaza in Manhattan, New York, 2013

I was extremely excited when I first saw the program for Data Science Day because I noticed a series of lightning talks on climate change. The session entitled ‘Climate + Finance: Use of Environmental Data to Measure and Anticipate Financial Risk’ brought together Columbia staff who specialize in economics, climate research, and environmental policy.

Geoffrey Heal gave a talk called ‘Rising Waters: The Economic Impact of Sea Level Rise’ that addressed financial models associated with sea level rise projections. Heal presented major cities and associated data for property values, historic flooding, and flood maps to illustrate the overall financial impact of sea level rise. This talk highlighted the importance of interdisciplinary data science work when addressing complex issues like climate change. Collaboration between academic researchers and national groups like NOAA and FEMA provides a platform for data science work that can inform professionals across career fields.

Lisa Goddard spoke about ‘Data & Finance in the Developing World’. The main topics of her talk were food security and drought impacts in developing countries. Goddard’s research included rain gauge measurements, satellite imagery, soil moisture levels, and crop yield records. She addressed the use of various climate data to advise appropriate resilience tactics, such as crop insurance for financial security. Overall, dealing with food security will be essential when handling the impacts of climate change on small scale farms across the world. Data science can help the agricultural sector by providing farmers with more information to consider when planning for effects of climate change.

Wolfram Schlenker gave a talk called ‘Agricultural Yields and Prices in a Warming World’. He addressed the impact of weather shocks to common crops, such as unanticipated exposure to hot temperatures. Corn, a tropical plant, can potentially see higher yields when there are sudden, extreme instances of warm weather. Schlenker presented a fresh perspective on how climate change can impact crop yields differently according to species. A combination of climate models, market conditions, and yield data can provide a foundation for better understanding climate change’s impacts on agricultural commodities on a case-by-case basis.

Diane Greene’s keynote session for Data Science Day 2018 provoked important considerations when navigating the world of data science. Greene mentioned Google Cloud’s main goal is to deliver intuitive technological capabilities. Google Cloud deals with a wide range of APIs that make the flow of information across the world easier. For example, Google Cloud’s Translation API makes it possible for online articles to be translated in different languages to increase readability. Diane Greene’s talk inspired me to be creative with innovation in data science and consider usability and collaboration on all fronts.

This event was a great opportunity to learn from leaders in the field of data science. Communication and collaboration were major themes of these talks and I left Data Science Day 2018 feeling empowered to address challenges like climate change.

How Does Environmental Science Relate to Computer Programming?

How Does Environmental Science Relate to Computer Programming?

This is a question I have received quite frequently in recent weeks. Computer programming languages can be used to make scientific analysis much easier. This applies directly to environmental science because there is a wealth of data within the world of ecological studies. Coding offers a way for scientists to automate repetitive  tasks using lines of code, and therefore freeing up time for other work. This can result in the creation of new software that can be utilized by scientists across disciplines.

Statistics and environmental science go hand in hand. Science experiments involve a natural order of determining a hypothesis, establishing test methods, collecting data, analyzing data, and drawing conclusions.

The data analysis portion is where statistical models are important. Oftentimes, scientists want to know whether the results of their experiments hold statistical significance. This means proving that the trends observed in data are not just a result of some sort of mistake in the experiment design or execution. Computer programming languages such as Python can help scientists execute the statistical analysis of data by writing code to analyze their data. Python packages like NumPy provide a basis for computational analysis and Python libraries like SciPy offer modules such as scipy.stats that offer the ability to perform hypothesis tests. These include T-Tests and Analysis of Variance (ANOVA) tests on numerical data and the Chi Square test for categorical data. Packages in R such as car offer a function for ANOVA tables, but R Studio itself includes functions such as t.test to analyze data. Programming languages offer packages for creating graphs and visuals to display analytical tests, such as Matplotlib in Python and ggplot2 in R.

Sections of environmental science, such as conservation biology, can benefit from programming because of different computer models. As a college student, the first software I was introduced to that was created specifically for use in conservation science was a population viability analysis (PVA) software called Vortex. Population viability measures the likelihood of a group of organisms to thrive or decline under a certain set of circumstances. The Vortex software allows users to adjust the circumstances for populations in areas such as genetic diversity, number of organisms, and mortality rate. I used the software in a classroom setting while studying in Peru, and I performed various tests to see what factors would be detrimental to the population of a theoretical species. This tool is one of many that can be of assistance for environmental science professionals who can use PVA to inform management decisions for threatened species.

A treeline in the lowland Amazon rain forest along the Madre de Dios River in Peru, taken February 2015. Advancements in computer programming can provide tools to help increase scientific understanding of biologically rich areas, such as the Amazon.

Within the field of environmental science, computer programming can be a great advantage because it allows scientists to analyze data in efficient ways that can make everyday tasks easier. The utilization of programming languages and modeling software offers opportunities to put computers to use where humans would have otherwise performed repetitive tasks. This can provide scientists with more time to make discoveries and inform decisions to make the world a better place.