Project proposal

Author

Gold Emu

Description of the dataset

The dataset we have chosen for our project is derived from Project FeederWatch, a comprehensive initiative that compiles observational data on bird visits to various locations across North America, including nature centers, backyards, community areas, and other sites conducive to birdwatching. This dataset is rich in details, providing not only the species of birds observed but also a variety of variables that describe the environmental and contextual conditions under which these observations were made. These variables offer insights into the specific circumstances and settings that attract different bird species, such as weather conditions, time of year, and the types of feed available. By analyzing this dataset, we aim to uncover patterns and trends in bird behavior and preferences, enhancing our understanding of avian life in North America.



There are 100,000 x 22 dimensions in the observation data-set, while there 
are 49,999 x 62 dimensions in this site data-set. In addition, there are 361 
unique species that have been observed. In observation_data, there are a total 
of  370,672  sightings.

In observation_data, the 22 columns correspond to important data points about the 
project. Namely, subnational1_code corresponds to the country abbreviation, as well 
as state or province abbreviation of each survey site, while columns month day year 
describe the time at which the observation was recorded. Other important columns 
include, but are not limited to species_code and how_many the former representing 
the code of the bird observed, and the latter representing, at a given row, how many 
of the species were observed. Lastly, data_entry_method describes the snow depth 
during an observation, in cm.

Description of the variables (dimensions):

observation_data and site_data shows which bird species visit feeders at thousands of locations across the continent every winter. The data also indicate how many individuals of each species are seen. This information can be used to measure changes in the winter ranges and abundances of bird species over time.
- Observation data refer to the specific instances when birds are seen and recorded at a feeding station, and are the primary data points collected by participants and provide insights into bird presence, abundance, and behavior at specific times and locations.
- On the other hand, site description data, describe the physical characteristics and context of the location where the bird feeding station is set up, and are essential for understanding the context within which observations are made. They help researchers and participants to correlate bird behaviors and presence with specific environmental conditions or site characteristics.
- Observation data provide the “what” (what birds are seen, when, and in what numbers), site description data provide the “where” and “how” (the conditions and characteristics of the location where observations are made).
species_dictionary maps the species_code column in observation_data and site_data to actual bird species.

Provenance:

Our data-set comes from Project FeederWatch, which is operated by the Cornell Lab of Ornithology and Birds Canada. Since 2016, Project FeederWatch has been sponsored by Wild Bird Unlimited.
It was collected by the Cornell Lab of Ornithology and Birds Canada
The transformations the data-set has undergone are
- Version 1.1. Corrected ‘Data Level’ field for CREATION_DT and LAST_EDITED_DT to read “site” instead of “season”
- Version 1.2. Added information on taxonomic updates implemented in August 2021.
- Version 1.3. Added supp_food column to the site description fields. July 2022.

The reason why we chose this dataset

We chose this data set because we all are interested in birds but don’t know much about their behavioral habits, such as the time of the year in which different species forage or migrate, their feeding preferences, etc. We wish to learn more about their foraging timeline and feeding preferences using data visualization techniques introduced to in the course.

Our selection of this data set stems from a shared fascination with birds, coupled with a recognition of our limited understanding of their behavioral patterns. This includes, but is not limited to, the time of the year in which different species forage or migrate, their feeding preferences. We aim to deepen our knowledge of these aspects by employing data visualization techniques introduced in the course, focusing particularly on understanding the visiting timeline and feeding habits of various bird species.

The two questions we want to answer

Question 1

Are birds, by bird species, more likely to be observed when there’s a lot of snow, if there’s less snow, or if it doesn’t matter - observation_data - species_code (categorical), snow_dep_atleast (continuous).

species_code: Bird species observed, stored as 6-letter species codes
snow_dep_atleast: Participant estimate of minimum snow depth during a checklist

Question 2

Do certain species of birds prefer different types of feeders? observation_data - species_code (categorical), feeder types columns.

Knowing both the months of the year and the type of feeder certain species are likely to be observed, we will know which feeder type to put out at which months of the year to attract certain species of interest or more birds in general.

Analysis plan

A plan for answering each of the questions including the variables involved, variables to be created (if any), external data to be merged in (if any).

Question 1

To answer are birds, by bird species, more likely to be observed when there’s a lot of snow, if there’s less snow, or if it doesn’t matter:
1. Visualization
  1. Select species_code (categorical), snow_dep_atleast (continuous)
  2. Turn snow_dep_atleast from numerical to categorical
  3. group by species_code, use how_many and n() to get values for number of observations, store in variable, use it to plot y by snow_dep_atleast and x by species
  4. Use species_dictionary data-set to get names of birds by species_code, and use it to label legends.
2. Logistic Regression (with each species on the x-axis and snow depth on the y-axis)
  1. Turn snow_dep_atleast into a categorical variable (low, medium, high)
  2. Use glm() to fit a logistic regression model, and predict the likelihood of bird observations based on snow depth categories using summary()
  3. Although there is no visualization from the regression model, it gives invaluable insight and direction into what specific parts of our data-set we want to visualize and communicate.
3. Chi-Square Test of Independence:
  1. Test if the frequency of bird observations is independent of the snow depth categories
  2. Turn snow_dep_atleast into a categorical variable (low, medium, high)
  3. Make a contingency table by species and snow_dep_atleast
  4. Use chisq.test() to compute if bird species distribution is independent of snow_dep_atleast

Question 2

To answer: which species of birds does each types of feeder attract, in general?
1. Data Transformation
  1. Group by feeder type and species
  2. Calculate the count of each species observed at each feeder type
  3. Store in a data frame where feeder types are on one axis, species are on another axis, and the cells represent the counts of each species attracted by each feeder type.
  4. Can retain only the top visiting species based on the counts
2. Visualization
  1. Create a grouped bar chart: x-axis = feeder types, y-axis = count of sightings (frequency), fill = species
  2. If species are very different: continue with grouped bar chart as visualization
  3. Alternatively, if the same species are the top visitors for all feeder types but different frequencies: create a heat map where x-axis = feeder types, y-axis = species, color intensity = frequencies. We plan to switch to a heat map if the top 30 observations are the same.
3. Label species in the legend with the species dictionary database