INFO 2950 Final Project

Exploratory data analysis

library(tidyverse)
library(skimr)
library(rvest)
library(httr)

Research question(s)

  1. How have different sectors of the economy (e.g., agriculture, manufacturing, services) contributed to the GDP and population growth for the UN Security Council of 2020 for the years 2000 to 2020?
  2. How does population growth, population density, and GDP per capita (in USD) effect foreign direct investment flowing inwards and outwards for there UN Security Council of the years 2000 to 2020?
  3. What are the key factors that explain differences in GDP per capita across countries, and how do these factors vary by region? What is the role of education, natural resources, infrastructure, and political institutions in shaping GDP per capita in different countries and compare these factors across regions such as Asia, Europe, and Africa.
  4. How do different policies and economic systems (such as socialism, capitalism, and mixed economies) affect GDP growth and development in different countries, and what are the trade-offs associated with these different approaches?
  5. How do the resources of a country effect their populations, if they do at all? Assessing the potential correlation that GDP may have with birthrate and population numbers (population, deathrates, birthrates).

Data collection and cleaning

# https://databank.worldbank.org/source/world-development-indicators

world_bank_files <- list.files(
  path = "world_bank_data",
  pattern = "*.csv",
  full.names = TRUE
)

world_bank_data <- read_csv(file = world_bank_files)
Rows: 6370 Columns: 25
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (25): Country Name, Country Code, Series Name, Series Code, 2000 [YR2000...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
world_bank_data
# A tibble: 6,370 × 25
   `Country Name` `Country Code` `Series Name`     `Series Code` `2000 [YR2000]`
   <chr>          <chr>          <chr>             <chr>         <chr>          
 1 Angola         AGO            Maternal mortali… SH.STA.MMRT   827            
 2 Angola         AGO            Mortality rate, … SP.DYN.AMRT.… 360.01         
 3 Angola         AGO            Mortality rate, … SP.DYN.AMRT.… 469.15         
 4 Angola         AGO            Mortality rate, … SP.DYN.IMRT.… 121.9          
 5 Angola         AGO            Mortality rate, … SP.DYN.IMRT.… 111.7          
 6 Angola         AGO            Mortality rate, … SP.DYN.IMRT.… 131.6          
 7 Angola         AGO            Suicide mortalit… SH.STA.SUIC.… 8.7            
 8 Angola         AGO            Suicide mortalit… SH.STA.SUIC.… 3.3            
 9 Angola         AGO            Suicide mortalit… SH.STA.SUIC.… 14.2           
10 Angola         AGO            Military expendi… MS.MIL.XPND.… 6.392603154274…
# ℹ 6,360 more rows
# ℹ 20 more variables: `2001 [YR2001]` <chr>, `2002 [YR2002]` <chr>,
#   `2003 [YR2003]` <chr>, `2004 [YR2004]` <chr>, `2005 [YR2005]` <chr>,
#   `2006 [YR2006]` <chr>, `2007 [YR2007]` <chr>, `2008 [YR2008]` <chr>,
#   `2009 [YR2009]` <chr>, `2010 [YR2010]` <chr>, `2011 [YR2011]` <chr>,
#   `2012 [YR2012]` <chr>, `2013 [YR2013]` <chr>, `2014 [YR2014]` <chr>,
#   `2015 [YR2015]` <chr>, `2016 [YR2016]` <chr>, `2017 [YR2017]` <chr>, …

Data description

The data set used for this project was pre-curated and provided in a CSV file. It contains 6370 observations and 25 attributes, carefully selected based on the project’s research questions and overarching theme. To ensure that the data remained relevant to our focus on the economy, we narrowed our scope to the UN Security Council members during 2022. This allowed us to exclude irrelevant observations such as gender, primary education level, and access to electricity.However, selecting the most pertinent economic indicators required much discussion and consideration. We ultimately chose to focus on key indicators of economic growth, such as GDP, investment, goods and services, primary economic sectors, imports/exports, population growth, and CO2 emissions. It’s important to note that each observation is associated with a country, and no personal information was involved in the data collection process.

Data limitations

There are a few potential issues with our Data-set that should be noted. Firstly, some of the observations are missing (labeled as N/A). This could pose a challenge when plotting graphs or estimating specific values such as GDP, population growth, and imported goods and services. Another concern is that while most of the observations were calculated based on USD, the Data-set didn’t always specify which currency was used. This raises questions about how to handle non-USD currencies, limiting the project to only working with USD data. Overall, it’s important to be aware of these potential limitations when working with the dataset and to approach the data with caution to ensure accurate analysis and interpretation.

Exploratory data analysis

We are exploring observations such as birthrate, death-rate, GDP, rates of education, urbanizing, etc for 265 countries and territories to see the way these factors change on a two decade span (from 2000 to 2020). There are 234 rows of data in our table, attributing to 234 different factors/observations we can be analyzing data from, all associated with a particular country and year. Initially the different observations were in rows, and each country had at least some data for that observation, so there are 234 rows for each of the 265 countries which isn’t very practical and/or clean. The tidied data does a better job of allowing for data analysis and creates a cleaner, more concise data table.

Questions for reviewers

List specific questions for your peer reviewers and project mentor to answer in giving you feedback on this phase.