Description of the dataset
The dataset we have chosen for our project is derived from Project FeederWatch, a comprehensive initiative that compiles observational data on bird visits to various locations across North America, including nature centers, backyards, community areas, and other sites conducive to birdwatching. This dataset is rich in details, providing not only the species of birds observed but also a variety of variables that describe the environmental and contextual conditions under which these observations were made. These variables offer insights into the specific circumstances and settings that attract different bird species, such as weather conditions, time of year, and the types of feed available. By analyzing this dataset, we aim to uncover patterns and trends in bird behavior and preferences, enhancing our understanding of avian life in North America.
There are 100,000 x 22 dimensions in the observation data-set, while there
are 49,999 x 62 dimensions in this site data-set. In addition, there are 361
unique species that have been observed. In observation_data, there are a total
of 370,672 sightings.
In observation_data, the 22 columns correspond to important data points about the
project. Namely, subnational1_code corresponds to the country abbreviation, as well
as state or province abbreviation of each survey site, while columns month day year
describe the time at which the observation was recorded. Other important columns
include, but are not limited to species_code and how_many the former representing
the code of the bird observed, and the latter representing, at a given row, how many
of the species were observed. Lastly, data_entry_method describes the snow depth
during an observation, in cm.
Description of the variables (dimensions):
observation_data
and site_data
shows which bird species visit feeders at thousands of locations across the continent every winter. The data also indicate how many individuals of each species are seen. This information can be used to measure changes in the winter ranges and abundances of bird species over time.
- Observation data refer to the specific instances when birds are seen and recorded at a feeding station, and are the primary data points collected by participants and provide insights into bird presence, abundance, and behavior at specific times and locations.
- On the other hand, site description data, describe the physical characteristics and context of the location where the bird feeding station is set up, and are essential for understanding the context within which observations are made. They help researchers and participants to correlate bird behaviors and presence with specific environmental conditions or site characteristics.
- Observation data provide the “what” (what birds are seen, when, and in what numbers), site description data provide the “where” and “how” (the conditions and characteristics of the location where observations are made).
species_dictionary
maps the species_code
column in observation_data and site_data to actual bird species.
Provenance:
- Our data-set comes from Project FeederWatch, which is operated by the Cornell Lab of Ornithology and Birds Canada. Since 2016, Project FeederWatch has been sponsored by Wild Bird Unlimited.
- It was collected by the Cornell Lab of Ornithology and Birds Canada
- The transformations the data-set has undergone are
- Version 1.1. Corrected ‘Data Level’ field for CREATION_DT and LAST_EDITED_DT to read “site” instead of “season”
- Version 1.2. Added information on taxonomic updates implemented in August 2021.
- Version 1.3. Added
supp_food
column to the site description fields. July 2022.
The reason why we chose this dataset
We chose this data set because we all are interested in birds but don’t know much about their behavioral habits, such as the time of the year in which different species forage or migrate, their feeding preferences, etc. We wish to learn more about their foraging timeline and feeding preferences using data visualization techniques introduced to in the course.
Our selection of this data set stems from a shared fascination with birds, coupled with a recognition of our limited understanding of their behavioral patterns. This includes, but is not limited to, the time of the year in which different species forage or migrate, their feeding preferences. We aim to deepen our knowledge of these aspects by employing data visualization techniques introduced in the course, focusing particularly on understanding the visiting timeline and feeding habits of various bird species.
The two questions we want to answer
Question 1
Are birds, by bird species, more likely to be observed when there’s a lot of snow, if there’s less snow, or if it doesn’t matter - observation_data
- species_code
(categorical), snow_dep_atleast
(continuous).
species_code
: Bird species observed, stored as 6-letter species codes
snow_dep_atleast
: Participant estimate of minimum snow depth during a checklist
Question 2
Do certain species of birds prefer different types of feeders? observation_data
- species_code
(categorical), feeder types columns.
Knowing both the months of the year and the type of feeder certain species are likely to be observed, we will know which feeder type to put out at which months of the year to attract certain species of interest or more birds in general.
Analysis plan
A plan for answering each of the questions including the variables involved, variables to be created (if any), external data to be merged in (if any).
Question 1
- To answer are birds, by bird species, more likely to be observed when there’s a lot of snow, if there’s less snow, or if it doesn’t matter:
- Visualization
- Select
species_code
(categorical), snow_dep_atleast
(continuous)
- Turn
snow_dep_atleast
from numerical to categorical
- group by
species_code
, use how_many
and n()
to get values for number of observations, store in variable, use it to plot y by snow_dep_atleast
and x
by species
- Use
species_dictionary
data-set to get names of birds by species_code
, and use it to label legends.
- Logistic Regression (with each species on the x-axis and snow depth on the y-axis)
- Turn
snow_dep_atleast
into a categorical variable (low, medium, high)
- Use
glm()
to fit a logistic regression model, and predict the likelihood of bird observations based on snow depth categories using summary()
- Although there is no visualization from the regression model, it gives invaluable insight and direction into what specific parts of our data-set we want to visualize and communicate.
- Chi-Square Test of Independence:
- Test if the frequency of bird observations is independent of the snow depth categories
- Turn
snow_dep_atleast
into a categorical variable (low, medium, high)
- Make a contingency table by species and
snow_dep_atleast
- Use chisq.test() to compute if bird species distribution is independent of
snow_dep_atleast
Question 2
- To answer: which species of birds does each types of feeder attract, in general?
- Data Transformation
- Group by feeder type and species
- Calculate the count of each species observed at each feeder type
- Store in a data frame where feeder types are on one axis, species are on another axis, and the cells represent the counts of each species attracted by each feeder type.
- Can retain only the top visiting species based on the counts
- Visualization
- Create a grouped bar chart: x-axis = feeder types, y-axis = count of sightings (frequency), fill = species
- If species are very different: continue with grouped bar chart as visualization
- Alternatively, if the same species are the top visitors for all feeder types but different frequencies: create a heat map where x-axis = feeder types, y-axis = species, color intensity =
frequencies
. We plan to switch to a heat map if the top 30 observations are the same.
- Label species in the legend with the species dictionary database