Project Overview: FoodPact

A fork and knife surround a plate that has earth on it and FoodPact is written below.
Logo for FoodPact program

A few months ago I began a project with my brother to create a calculator for the environmental footprint of food. It’s called FoodPact to merge food and ecological impact. It’s a work in progress and I’m excited to share the code for it.

Data sources to inform the calculator include:

  • Water footprint data for crops from a 2011 study by M.M. Mekonnen and A. Y. Hoekstra.
  • Greenhouse gas emissions data from Business for Social Responsibility (BSR) and Environmental Protection Agency (EPA) documents on transport via boatrail, and freight.
  • Food waste data from the United States Department of Agriculture (USDA) Economic Research Service (ERS)
  • Global food import data from the USDA Foreign Agricultural Service’s Global Agricultural Trade System (GATS).
  • Country centroid data from a President and Fellows of Harvard College 2015 data file.
  • US city locations from  SimpleMaps.

We used a Bootstrap Bootswatch for the web application’s layout and Flask as the microframework.

Python packages used in the program include:

  • Pandas to create more refined dataframes for use within the application
  • NumPy for equations
  • geopy for calculating great circle distance between latitudes and longitudes
  • Matplotlib and pyplot for creating graphs

The whole point of the program is to take a user’s location, food product, and the product’s country of origin to generate the estimated distance the food traveled, the approximate amount of carbon dioxide that travel generated, and the water requirements for the product.

Conversions include: cubic metric tons to gallons of water, tons of crops to pounds, and grams of carbon dioxide per kilometer to pounds per mile.

Selected graphics from FoodPact project:

One limitation of the calculator is that the values for carbon dioxide consider either full travel by ship, train, or truck and not a combination of the three methods. Emissions refer to the amount it takes to ship a twenty-foot equivalent (TEU) container full of the food product across the world. The country of origin considers the centroid and not the exact location of food production. Similarly, the list of cities displays the 5 most populated cities in that given state. The only exception is New York, for which I considered New York City close enough in latitude and longitude to account for Brooklyn, Queens, Manhattan, the Bronx, and Staten Island.

The data referenced in the calculator is meant to give a relative idea of the inputs required to generate and transport food products to give perspective to consumers. Ideally, the calculator will encourage conversations about the food system and inspire people to reduce their personal food waste.

The Role of Open Access

The Role of Open Access

Open access research provides an opportunity for the public to learn and use data as needed for free, but it is not overwhelmingly common. For researchers outside of academia, trying to pull together useful data can be difficult when considering accessibility barriers.

About two months ago, I began looking for data to create a model of biological inputs and energy requirements in the United States food system. Open data resources such as FAOSTAT, the Economic Research Service,  and Bureau of Transportation Statistics provided helpful figures on land use, food imports, and food transportation values. Aside from these resources, a lot of information I wanted to reference in building a model came from scientific papers that require journal subscriptions or charge a per-article fee.

Three articles that may have been helpful in my research illustrate the cost of access:

Upon closer investigation, Appetite claims it ‘supports open access’ but charges authors $3000 to make articles available to everyone, according to publisher Elsevier. Clearly, providing affordable open access options doesn’t seem like a priority for publishers.

There may have been useful data in the articles mentioned above. However, I won’t find out because I’m sticking with open access resources for my food systems project.

Public government databases are great, but specific science studies may hold more value to independent researchers. Journals like PLOS ONE lead the way in open access articles for those looking for specific research to compliment information from public databases. A 2016 article by Paul Basken in The Chronicle of Higher Education called ‘As an Open-Access Megajournal Cedes Some Ground, a Movement Gathers Steam’ shows a rise in open access papers, but I got the figures via Boston College because the article itself is ‘premium content for subscribers.’

Rise of published open access articles over time between 2008 and 2015. Data from: Basken, P. 2016. As an open-access megajournal cedes some ground, a movement gathers steam. The Chronicle for Higher Education, 62(19), 5-5.

Charging fees for accessibility can create an elitist barrier between academia and those who want to learn more about certain topics. I’m not proposing that everyone would take advantage of open access research articles if there were cheaper publishing options, or no access fees. If more studies were open access, it would create more opportunities for members of the public to digest scientific studies on their own terms.

There’s immense value in the open-source, collaborative culture of the tech community that I hope spills over into academia. I’m optimistic about a continued increase in open access publications in the science community. For now, I’m looking forward to creating open source projects that take advantage of public data.

Data Analysis and UFO Reports

Data Analysis and UFO Reports

Data analysis and unidentified flying object (UFO) reports go hand-in-hand. I attended a talk by author Cheryl Costa who analyzes records of UFO sightings and explores their patterns. Cheryl and her wife Linda Miller Costa co-authored a book that compiles UFO reports called UFO Sightings Desk Reference: United States of America 2001-2015.

Records of UFO sightings are considered citizen science because people voluntarily report their experiences. This is similar to wildlife sightings recorded on websites like eBird that help illustrate bird distributions across the world. People report information about UFO sighting events including date, time, and location.

A dark night sky with the moon barely visible and trees below.
Night sky along the roadside outside Wayquecha Biological Field Station in Peru, taken April 2015.

Cheryl spoke about gathering data from two main online databases, MUFON (Mutual UFO Network) and NUFORC (National UFO Reporting Network). NUFORC’s database is public and reports can be sorted by date, UFO shape, and state. MUFON’s database requires a paid membership to access the majority of their data. This talk was not a session to discuss conspiracy theories, but a chance to look at trends in citizen science reports.

The use of data analysis on UFO reports requires careful consideration of potential bias and reasonable explanations for numbers in question. For example, a high volume of reports in the summer could be because more people are spending time outside and would be more likely to notice something strange in the sky.

This talk showed me that conclusions may be temptingly easy to draw when looking at UFO data as a whole, but speculations should be met with careful criticism. The use of the scientific method when approaching ufology, or the study of UFO sightings, seems key for a field often met with overwhelming skepticism.

I have yet to work with any open-source data on UFO reports, but this talk reminded me of the importance of a methodical approach to data analysis. Data visualization for any field of study starts with asking questions, being mindful of outside factors, and being able to communicate messages within large data sets to any audience.

Reducing Plastic Use

Reducing Plastic Use
Various pieces of plastic trash debris are strewn alongside seaweed and rocks on a beach.
Assorted plastic trash on the beach at Pelican Cove Park in Rancho Palos Verdes, CA, 2017.

In the spirit of this year’s Earth Day theme (‘End Plastic Pollution’), I researched the fate of plastic. The Environmental Protection Agency (EPA) prepared a report for 2014 municipal waste stream data for the United States. Plastic products were either recycled, burned for energy production, or sent to landfills. I used pandas to look at the data and Matplotlib to create a graph. I included percentages for each fate and compared the categories of total plastics, containers and packaging, durable goods, and nondurable goods.

A graph compares different types of plastic products and their fate in the municipal waste stream.
Percentages of total plastics and plastic types that get recycled, burned for energy, or sent to a landfill, according to the EPA.

The EPA data shows a majority of plastic products reported in the waste stream were sent to landfills. Obviously, not all plastic waste actually reaches a recycling facility or landfill. Roadsides, waterways, and beaches are all subject to plastic pollution. Decreasing personal use of plastic products can help reduce the overall production of waste.

Here are some ideas for cutting back on plastic use:

  • Bring reusable shopping bags to every store.
    • Utilize cloth bags for all purchases.
    • Opt for reusable produce bags for fresh fruit and vegetables instead of store-provided plastic ones.
  • Ditch party plasticware.
    • Buy an assortment of silverware from a thrift store for party use.
    • Snag a set of used glassware for drinks instead of buying single-use plastic cups.
  • Use Bee’s Wrap instead of plastic wrap.
    • Bee’s Wrap is beeswax covered cloth for food storage. It works exactly the same as plastic wrap, but it can be used over and over.
  • Choose glassware instead of plastic zip-locked bags for storing food.
    • Glass containers like Pyrex can be used in place of single-use plastic storage bags.
  • Say ‘no’ to plastic straws.
    • Get in the habit of refusing a straw at restaurants when you go out.
    • Bring a reusable straw made out of bamboo, stainless steel, or glass to your favorite drink spot.

 

To check out the code for the figure I created, here’s the repository for it.

Highlights from Data Science Day 2018

Highlights from Data Science Day 2018

Columbia University hosted Data Science Day 2018 on March 28th at their campus in Manhattan. I traveled to New York to attend the event and learn more about how data science plays a role in health, climate, and finance research. A few of the presentations stood out, including the environmental talks and a keynote address from Diane Greene, the CEO of Google Cloud.

View of Grand Army Plaza
Grand Army Plaza in Manhattan, New York, 2013

I was extremely excited when I first saw the program for Data Science Day because I noticed a series of lightning talks on climate change. The session entitled ‘Climate + Finance: Use of Environmental Data to Measure and Anticipate Financial Risk’ brought together Columbia staff who specialize in economics, climate research, and environmental policy.

Geoffrey Heal gave a talk called ‘Rising Waters: The Economic Impact of Sea Level Rise’ that addressed financial models associated with sea level rise projections. Heal presented major cities and associated data for property values, historic flooding, and flood maps to illustrate the overall financial impact of sea level rise. This talk highlighted the importance of interdisciplinary data science work when addressing complex issues like climate change. Collaboration between academic researchers and national groups like NOAA and FEMA provides a platform for data science work that can inform professionals across career fields.

Lisa Goddard spoke about ‘Data & Finance in the Developing World’. The main topics of her talk were food security and drought impacts in developing countries. Goddard’s research included rain gauge measurements, satellite imagery, soil moisture levels, and crop yield records. She addressed the use of various climate data to advise appropriate resilience tactics, such as crop insurance for financial security. Overall, dealing with food security will be essential when handling the impacts of climate change on small scale farms across the world. Data science can help the agricultural sector by providing farmers with more information to consider when planning for effects of climate change.

Wolfram Schlenker gave a talk called ‘Agricultural Yields and Prices in a Warming World’. He addressed the impact of weather shocks to common crops, such as unanticipated exposure to hot temperatures. Corn, a tropical plant, can potentially see higher yields when there are sudden, extreme instances of warm weather. Schlenker presented a fresh perspective on how climate change can impact crop yields differently according to species. A combination of climate models, market conditions, and yield data can provide a foundation for better understanding climate change’s impacts on agricultural commodities on a case-by-case basis.

Diane Greene’s keynote session for Data Science Day 2018 provoked important considerations when navigating the world of data science. Greene mentioned Google Cloud’s main goal is to deliver intuitive technological capabilities. Google Cloud deals with a wide range of APIs that make the flow of information across the world easier. For example, Google Cloud’s Translation API makes it possible for online articles to be translated in different languages to increase readability. Diane Greene’s talk inspired me to be creative with innovation in data science and consider usability and collaboration on all fronts.

This event was a great opportunity to learn from leaders in the field of data science. Communication and collaboration were major themes of these talks and I left Data Science Day 2018 feeling empowered to address challenges like climate change.

Creating a Data Science Resume for Career Switchers

Creating a Data Science Resume for Career Switchers

As a career switcher, I had no idea where to begin in my efforts to create a data science resume or how to incorporate my background in ecological field work. I’ve only ever needed a resume for one specific field for my career thus far. Thankfully, Kaggle hosted a virtual CareerCon last week and it helped me develop new strategies for tweaking my work experience to target data science.

William Chen, a Data Science Manager at Quora, led a session called How to Build a Compelling Data Science Portfolio & Resume that included tips for formatting a data science resume. William spoke directly to his experience reviewing data science portfolios. Major advice from his talk included:

  • Keep it concise. A one page resume with simple readability is recommended.
  • Include relevant coursework and order it accordingly, from most to least to relevant.
  • Mention your technical skills, and especially those included in the posting for a desired position.
  • Highlight projects and include results and references, like web links.
  • Avoid including impersonal projects such as homework assignments.
  • Tailor your experience toward the job and include relevant capstone projects and independent research if you don’t have direct data science work experience to mention.

Below I’ve included some of my own changes to my resume to take existing project experience I have in the realm of data analysis and tweak it to fit a data science resume.

First of all, here’s an overview visual of the format of a recent version of my resume tailored for a job in land management and my edited resume for data science. The quality of the text isn’t amazing, but this is mainly to show increased readability and concise, relevant content.

Comparison of environmental science resume (left) and newly edited data science resume (right).
Comparison of environmental science resume (left) and newly edited data science resume (right).

William Chen’s advice led me to get to the point about why I would be a good candidate for an opportunity in data science. This meant I had to get my message across quickly. Previously, my resume was a wall of text divided by education, work experience, and relevant community service. This format is dense and confusing and would be improper to send to a hiring official in response to a data science posting.

I broke down my data science resume into the categories of experience, education, projects, skills, and relevant coursework. In experience, I highlighted potentially relevant duties such as data collection, analysis, and visualization that show my personal connection to data science. Next, I cut down the text in my education section from my previous resume to reveal only my school, its location, my degree earned, and my enrollment dates. The projects section includes three research projects I worked on in my undergraduate career that involved data collection, analysis, and synthesis. Lastly, I included a section for skills and a section for relevant coursework.

No matter what your academic or work background, you can find ways to make a data science resume. William Chen’s advice brought me to the realization that I had relevant technical skills and project experience in environmental science that I could translate into a purposeful foundation for a job in data science. When you think about your qualifications outside the context of a specific career field, creating a data science resume becomes a simple task.

How to Choose an Online Data Science Course

Multiple factors can play a role in your decision process when selecting an online data science course. It is important to remember that no two educational resources are exactly the same. I recommend carefully considering your needs and learning goals, and feel free to give multiple websites a try before making a decision.

Here’s a quick overview of the major components of three educational resources I have been using to learn data science. This is based on my experiences with Codecademy, DataCamp, and Udacity. There are plenty of other educational websites to chose from, including Coursera and Udemy.

 CodecademyDataCampUdacity
Languages for Data SciencePython and SQLPython, R, and SQLPython, R, and SQL
FormatInteractive Lessons and ExercisesInteractive Lessons, Exercises, and VideosVideos and Exercises
Unique FeaturesNo VideosAvailable via Mobile AppVideos Feature Industry Professionals
Helpful ResourcesHints, 'Get Help' Live Chat for Pro Users, and Community ForumHints, Show Answer Options, and Community ForumCommunity Forum
Free ContentFree CoursesFree Courses and Access to First Section of All Premium LessonsFree Courses
Premium Program CostsCodecademy Pro:
$15.99 - $19.99
per month
DataCamp Membership:
$25 - $29 per month
Data Analyst Nanodegree: $200 per month
Features of Premium CoursesQuizzes, Advisers, and ProjectsProjects and Course Completion CertificatesProject Review and Career Services

Pick a Language

Two of the most popular languages for data science are Python and R. Another language called SQL (Structured Query Language) is also helpful to know because you can use it to work with specific data in a database. Python and R are both widely used, so I recommend trying out each language if you’re aiming to focus on just one. Depending on your preferences, the offerings of Codecademy, DataCamp, and Udacity may play a role in your decision. Codecademy offers Python and SQL. DataCamp has lessons in Python, SQL, and R, with career tracks for data scientists with Python and R. Udacity has a selection of courses that cater to all three languages. At the end of the day, choosing a language depends on how you seek to use your data science skills.

Learning Style

Test out different websites and make sure you enjoy the format of lessons before committing to one, and especially before paying for a subscription or program. If video lessons play to your strengths, I recommend using Udacity. Course videos are instructed by a wide range of data science industry professionals. This offers a unique perspective as to how people use data science in specific career areas.

Websites like Codecademy and DataCamp are designed for hands-on, visual learners. Both websites offer a console with instant feedback when you run lines of code. Codecademy, unlike DataCamp and Udacity, does not include video lessons in the curriculum. If you prefer reading at your own pace and executing lines of code without trying to absorb a video lecture, Codecademy might be right for you. DataCamp provides video introductions before coding lessons and tasks. Also, DataCamp offers an app for on-the-go coding lessons. However, the preferred format for learning with DataCamp is on the computer.

Helpful Resources 

There are tools in all three websites that help you if you get stuck on a problem. Codecademy and DataCamp offer hints specific to assigned tasks, as well as access to community forums where users can post questions for others to answer. Codecademy also offers live chat assistance for Pro members, where a tutor will review code in real time. DataCamp features an option to show the answer code for an assigned task, if you are still having trouble after reviewing a hint. The format of Udacity does not involve an interactive console, so when your code is incorrect, the best place to find help is on their community forums.

Free Content

Codecademy, DataCamp, and Udacity all offer free courses that can cater to your interests in data science. Free lessons on each website are self-paced and designed to adapt to your schedule and lifestyle.

Premium Programs

Each website offers the option to pay for access to additional content and benefits.

  • Codecademy Pro offers three levels of subscription: one month($19.99), six months($17.99 per month), and a year($15.99 per month). There’s also an option for Pro Intensive courses, such as Intro to Data Analysis, that cost $199 each. Membership benefits include quizzes and projects.
  • DataCamp membership is in the form of a monthly plan($29 per month) or a yearly plan($25 per month). Members gain unlimited access to all programs.
  • Udacity offers a Data Analyst Nanodegree program with 2 three-month terms. Term 1($499) and term 2($699) result in a cost of about $200 per month for six months. Benefits of this program include project feedback and exclusive career services.

DataCamp’s membership offers the most flexibility out of these three platforms because premium lessons are at your own pace. For Codecademy members, the Intro to Data Analysis Pro Intensive has an outlined course time frame of 4 months. You can work ahead as much as you would like depending on your schedule. Udacity’s Data Analyst Nanodegree program is made up of two 3-month terms, for a total of 6 months estimated to complete the program.