A sampler data package called lterdatasampler from the Long Term Ecological Research (LTER) program allows anyone to work with some neat environmental data. Data samples include weights for bison, fiddler crab body size, and meteorological data from a field station. The package homepage gives suggestions for modeling relationships, such as linear relationships and time series analysis. This will go over simple linear regression using R statistical software and a sugar maple dataset from the lterdatasampler package.
The data was collected at Hubbard Brook Experimental Forest in New Hampshire, which I am partial to having collected and analyzed data from there during an undergraduate internship.
Background
The sugar maple data comes from a study and paper from Stephanie Juice and Tim Fahey from Cornell University on the ‘Health of Sugar Maple (Acer saccharum) Seedlings in Response to Calcium Addition (2003-2004), Hubbard Brook LTER’. The data summary page points out the leaf samples were collected in transects from a watershed treated with calcium, and reference watershed sites. The data sample is 359 rows with the following 11 variables: year, watershed, elevation, transect, sample, stem_length, leaf1area, leaf2area, leaf_dry_mass, stem_dry_mass, and corrected_leaf_area.
R Code
The code that follows can be found in an associated GitHub repository here in an R-Markdown file. As mentioned above, the sugar maple data was chosen for simple linear regression due to the linear relationship noted from the data package site.
First, install the lterdatasampler package. If this is your first usage, use the following:
install.packages("remotes")
remotes::install_github("lter/lterdatasampler")
Once the package is installed, call it using the library function and load the car (Companion to Applied Regression) package and caTools (Tools: Moving Window Statistics, GIF, Base64, ROC AUC, etc) package for later use.
library(lterdatasampler)
library(car)
library(caTools)
Call hbr_maples to preview the data.
hbr_maples
For background to the study, I plotted the corrected leaf area (in centimeters squared) between the reference and calcium treated Watershed 1. This shows the overall differences in samples collected between each area. This simple linear regression analysis will look at the overall measurements without respect to watershed, but this boxplot can help give an idea of the spread of data.
plot(hbr_maples$watershed, hbr_maples$corrected_leaf_area,
xlab='Watershed',
ylab='Corrected Leaf Area (cm^2)',
main='Sugar Maple Leaf Area for Watershed Samples')
Next, I created scatterplots between the corrected leaf area and stem length, stem dry mass, and leaf dry mass. These plots are meant to show a preliminary relationship between these variables to confirm a linear relationship appears to be proper to explore further.
par(mfrow=c(1,3))
plot(x=hbr_maples$corrected_leaf_area,
y=hbr_maples$stem_length,
xlab='Leaf Area (cm^2)',
ylab='Stem Length (mm)')
plot(x=hbr_maples$corrected_leaf_area,
y=hbr_maples$stem_dry_mass,
xlab='Leaf Area (cm^2)',
ylab='Stem Dry Mass (g)')
plot(x=hbr_maples$corrected_leaf_area,
y=hbr_maples$leaf_dry_mass,
xlab='Leaf Area (cm^2)',
ylab='Leaf Dry Mass (g)')
title("Scatterplots of Stem Length, Stem Dry Mass, and Leaf Dry Mass", line = -1, outer = TRUE)
I decided to work with the leaf dry mass as the predicting variable and corrected leaf area as the response for this regression. As a precaution, I check to see if there are null or NA values in the variable I plan on using for the response variable. Missing values here would not be helpful in training the model so those can be found and removed.
hbr_maples_cleaned <- hbr_maples[!is.na(hbr_maples$corrected_leaf_area),]
The preliminary graphs above seem to show at least some indication of an outlier. That can be checked with another quick plot and summary statistics for the response variable.
plot(hbr_maples_cleaned$leaf_dry_mass)
summary(hbr_maples_cleaned$leaf_dry_mass)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01170 0.03540 0.04745 0.05169 0.06105 0.38700
The scatterplot shows one value in particular that can be removed. Find the respective index and then drop it from a newly declared variable.
hbr_maples_cleaned[match(0.387,hbr_maples_cleaned$leaf_dry_mass),]
hbr_maples_cleaned2 <- hbr_maples_cleaned[-53, ]
Split the data into training and test sets for later evaluation using the split function from caTools. With such a small dataset, the type of accuracy we will generate will not be reliable, but I would like to show these steps as good practice for larger datasets. I chose a 70% for training and 30% for test split. This splits as 167 records for the training set and 72 records for the test set.
set.seed(1)
maple_split <- sample.split(hbr_maples_cleaned2$corrected_leaf_area, SplitRatio = 0.7)
train_data <- hbr_maples_cleaned2[maple_split==TRUE,]
test_data <- hbr_maples_cleaned2[maple_split==FALSE,]
Create a simple linear regression model to generate the corrected leaf area based on the leaf dry mass using the lm() function. Preview the model output using summary().
leafarea_model <- lm(corrected_leaf_area ~ leaf_dry_mass, data=train_data)
summary(leafarea_model)
##
## Call:
## lm(formula = corrected_leaf_area ~ leaf_dry_mass, data = train_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.5812 -1.7899 -0.3127 1.5761 6.9762
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.3121 0.5963 15.62 <2e-16 ***
## leaf_dry_mass 350.8474 11.0620 31.72 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.061 on 165 degrees of freedom
## Multiple R-squared: 0.8591, Adjusted R-squared: 0.8582
## F-statistic: 1006 on 1 and 165 DF, p-value: < 2.2e-16
Check the confidence interval of the linear regression model using the confint() function. This shows at the default of 95% confidence, the coefficient for leaf dry mass is between about 329 and 373.
confint(leafarea_model)
## 2.5 % 97.5 %
## (Intercept) 8.134648 10.48949
## leaf_dry_mass 329.006035 372.68880
Next, plot the residuals from the model against the fitted values. Declare variables for each and then plot in a scatterplot. The variance among the plotted values appears generally constant, but with some widely spread-out points as the fitted values increase.
leafarea_resids <- residuals(leafarea_model)
leafarea_fitted <- leafarea_model$fitted
plot(leafarea_fitted, leafarea_resids,
main='Residuals vs Fitted Values of Leaf Area Model',
xlab='Fitted Values',
ylab='Residual Values')
lines(lowess(leafarea_fitted, leafarea_resids), col='red')
Plot a histogram of the residuals from the model to check and see if there is normal distribution. The normality assumption for linear regression would suggest that the data is about normal if we see a standard distribution.
par(mfrow=c(1,2))
hist(leafarea_resids,main="Histogram of Residuals",xlab="Residuals")
qqnorm(leafarea_resids)
We can see from the histogram of residuals that the data might benefit from a transformation to give the errors a more normal distribution.
To identify outliers that may influence the model, we can use the Cook’s distance. This shows which data points could potentially be influencing the model and points them out for potential removal. This provied the index of 175 as an outlier that may need to be removed.
cd_leafarea_model <- cooks.distance(leafarea_model)
leafarea_model_abovethreshold <- as.numeric(names(cd_leafarea_model)[(cd_leafarea_model > 1)])
print(leafarea_model_abovethreshold)
## [1] 175
Then, we can locate the row that held the outlier and remove it. Overall, the handling of outliers depends on the goals for regression analysis and sometimes they are better to keep in.
train_data_cleaned <- train_data[-leafarea_model_abovethreshold, ]
Create a new linear regression model in the data where the outlier has been removed. We can see from the summary output that the model is the same as above, so removing the outlier did not have an effect.
leafarea_model2 <- lm(corrected_leaf_area ~ leaf_dry_mass, data=train_data_cleaned)
summary(leafarea_model2)
##
## Call:
## lm(formula = corrected_leaf_area ~ leaf_dry_mass, data = train_data_cleaned)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.5812 -1.7899 -0.3127 1.5761 6.9762
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.3121 0.5963 15.62 <2e-16 ***
## leaf_dry_mass 350.8474 11.0620 31.72 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.061 on 165 degrees of freedom
## Multiple R-squared: 0.8591, Adjusted R-squared: 0.8582
## F-statistic: 1006 on 1 and 165 DF, p-value: < 2.2e-16
Check to see if the model would benefit from a Box-Cox transformation to improve the model’s fit. This can help better meet the assumptions of simple linear regression, such as residual distribution. The lambda value given will dictate the transformation action that is suggested. From the output below, the optimal lambda value rounded to the nearest 0.5 would be 1. This means no transformation is suggested.
bc_leafarea <- boxCox(leafarea_model2)
lambda_bc_model2 <- bc_leafarea$x[which(bc_leafarea$y==max(bc_leafarea$y))]
print(lambda_bc_model2)
## [1] 1.070707
The linear regression equation we can construct from the model is:
Corrected Leaf Area (cm2) = 9.3121 + 350.8474*(Leaf Dry Mass (g))
The Multiple R-Squared value from the summary means about 86% of the variance found can be explained by the model.
Last, use the test data to check the model’s performance by using the predict() function and calculate the mean squared prediction error (MSPE). The MSPE is about 9.347519 and that value can be used as a performance comparison if other models are created.
pred_test <- predict(leafarea_model, test_data)
mse.model <- mean((pred_test-test_data$corrected_leaf_area)^2)
cat("The mean squared prediction error is",mse.model,"\n")
## The mean squared prediction error is 9.347519
Overview
This is an simple example of linear regression, and more practice can be gained with lm() when looking at other variables in the data. Given the linear relationship, multiple linear regression models can be good to explore as well after working through standard variable selection processes. There is plenty more to discover in the other lterdatapackage samples.
Source
Juice, S. and T. Fahey. 2019. Health and mycorrhizal colonization response of sugar maple (Acer saccharum) seedlings to calcium addition in Watershed 1 at the Hubbard Brook Experimental Forest ver 3. Environmental Data Initiative. https://doi.org/10.6073/pasta/0ade53ede9a916a36962799b2407097e