library(tidyverse)
Project proposal
Dataset
Dataset Description
The Carbon Majors database compiles historical production data from 122 of the world’s largest producers of oil, gas, coal, and cement. It helps measure both direct emissions from operations and emissions resulting from the use of their products. The database covers 75 investor-owned companies, 36 state-owned companies, and 11 national entities, including 82 oil producers, 81 gas producers, 49 coal producers, and 6 cement producers. With records dating back to 1854, it accounts for over 1.42 trillion tonnes of CO₂ equivalent emissions, representing 72% of global fossil fuel and cement emissions since the Industrial Revolution began in 1751. This data is collected though The dataset is built primarily from self-reported production data (e.g., annual reports, SEC filings) but also incorporates third-party sources like the U.S. Energy Information Administration (EIA) and industry journals when necessary.
Dataset Origins and Gathering Process
Production data is standardized (e.g., oil in barrels, coal in tonnes). Emissions are estimated using IPCC default emission factors, adjusted for non-energy uses (e.g., petrochemicals that store carbon). Scope 3 emissions (88%) are calculated from net fossil fuel production (not sales) to avoid double counting. Scope 1 emissions include flaring, venting, fugitive methane, and own fuel use. The data is updated annually and was first compiled in 2013 by Richard Heede of the Climate Accountability Institute (CAI). It includes investor-owned companies, state-owned entities, and nation-states in certain cases (e.g., Soviet Union before 1991). Gaps in early data are interpolated where possible, but missing historical production is not estimated. The datasets goal is to attribute emissions directly to producers and track their impact on total global fossil fuel and cement emissions since the Industrial Revolution.
Why this dataset: This is an urgent, real-world topic. Climate change and corporate emissions remain a foremost issue for humanity, especially with the new administration leaving of the Paris climate accords. This visualization seeks to be a reminder of the dangers presented by climate change, as well as a call to action to still hold those most responsible accountable. Furthermore, the detailed data contained within allows for a nuanced comparative analysis of this data. To tackle this topic we then need to ask: What is the extent to which the major companies emit pollutants? What are the major pollutants that these companies use and emit? Which companies produce the most pollutants? By answering these questions we can then know where to concentrate our efforts in controlling the pollution levels and abate the climate crisis to some extent.
Dimensions of The Dataset
To answer our questions, we use the following variables contained in our dataset.There are 10 numerical variables and 2 categorical variables:
Numerical variables:
production_value – The amount of production (e.g., 0.9125). product_emissions_MtCO2 – Emissions from product-related activities, measured in megatons of CO₂. flaring_emissions_MtCO2 – CO₂ emissions due to gas flaring. venting_emissions_MtCO2 – CO₂ emissions from gas venting. own_fuel_use_emissions_MtCO2 – CO₂ emissions from fuel used in operations. fugitive_methane_emissions_MtCO2e – Methane leakage emissions, converted to CO₂ equivalent. fugitive_methane_emissions_MtCH4 – Methane emissions measured in megatons of CH₄. total_operational_emissions_MtCO2e – The sum of all operational emissions in CO₂ equivalent. total_emissions_MtCO2e – The overall emissions, incorporating additional factors.
Categorical Dimensions:
production_unit – The unit of production measurement (e.g., “Million bbl/yr”). source – The reference or data source (e.g., “Abu Dhabi National Oil Company Annual Report 1975, pp. 35-37”). This dataset primarily quantifies emissions across various categories for oil production while documenting the production scale and data source.
<- read.csv("data/emissions_high_granularity.csv")
carbon_df
# column names
names(carbon_df)
[1] "year" "parent_entity"
[3] "parent_type" "reporting_entity"
[5] "commodity" "production_value"
[7] "production_unit" "product_emissions_MtCO2"
[9] "flaring_emissions_MtCO2" "venting_emissions_MtCO2"
[11] "own_fuel_use_emissions_MtCO2" "fugitive_methane_emissions_MtCO2e"
[13] "fugitive_methane_emissions_MtCH4" "total_operational_emissions_MtCO2e"
[15] "total_emissions_MtCO2e" "source"
# how many rows & columns
cat("Number of rows:", nrow(carbon_df), "\n")
Number of rows: 15797
cat("Number of columns:", ncol(carbon_df), "\n")
Number of columns: 16
# the structure (data types and a preview of each column)
print(head(carbon_df, 10))
year parent_entity parent_type reporting_entity
1 1962 Abu Dhabi National Oil Company State-owned Entity Abu Dhabi
2 1963 Abu Dhabi National Oil Company State-owned Entity Abu Dhabi
3 1964 Abu Dhabi National Oil Company State-owned Entity Abu Dhabi
4 1965 Abu Dhabi National Oil Company State-owned Entity Abu Dhabi
5 1966 Abu Dhabi National Oil Company State-owned Entity Abu Dhabi
6 1967 Abu Dhabi National Oil Company State-owned Entity Abu Dhabi
7 1968 Abu Dhabi National Oil Company State-owned Entity Abu Dhabi
8 1969 Abu Dhabi National Oil Company State-owned Entity Abu Dhabi
9 1970 Abu Dhabi National Oil Company State-owned Entity Abu Dhabi
10 1971 Abu Dhabi National Oil Company State-owned Entity Abu Dhabi
commodity production_value production_unit product_emissions_MtCO2
1 Oil & NGL 0.9125 Million bbl/yr 0.3389277
2 Oil & NGL 1.8250 Million bbl/yr 0.6778554
3 Oil & NGL 7.3000 Million bbl/yr 2.7114216
4 Oil & NGL 10.9500 Million bbl/yr 4.0671324
5 Oil & NGL 13.5050 Million bbl/yr 5.0161300
6 Oil & NGL 14.6000 Million bbl/yr 5.4228432
7 Oil & NGL 18.2500 Million bbl/yr 6.7785540
8 Oil & NGL 22.2650 Million bbl/yr 8.2698359
9 Oil & NGL 25.7325 Million bbl/yr 9.5577611
10 Oil & NGL 34.3100 Million bbl/yr 12.7436815
flaring_emissions_MtCO2 venting_emissions_MtCO2 own_fuel_use_emissions_MtCO2
1 0.005404077 0.001298972 0
2 0.010808155 0.002597944 0
3 0.043232620 0.010391775 0
4 0.064848929 0.015587663 0
5 0.079980346 0.019224785 0
6 0.086465239 0.020783551 0
7 0.108081549 0.025979439 0
8 0.131859490 0.031694915 0
9 0.152394984 0.036631008 0
10 0.203193312 0.048841345 0
fugitive_methane_emissions_MtCO2e fugitive_methane_emissions_MtCH4
1 0.01825408 0.0006519315
2 0.03650816 0.0013038630
3 0.14603266 0.0052154520
4 0.21904898 0.0078231780
5 0.27016041 0.0096485862
6 0.29206531 0.0104309040
7 0.36508164 0.0130386299
8 0.44539960 0.0159071285
9 0.51476511 0.0183844682
10 0.68635348 0.0245126243
total_operational_emissions_MtCO2e total_emissions_MtCO2e
1 0.02495713 0.3638848
2 0.04991426 0.7277697
3 0.19965705 2.9110786
4 0.29948558 4.3666180
5 0.36936554 5.3854955
6 0.39931410 5.8221573
7 0.49914263 7.2776966
8 0.60895400 8.8787899
9 0.70379110 10.2615522
10 0.93838814 13.6820696
source
1 Abu Dhabi National Oil Company Annual Report 1975, pp. 35-37
2 Abu Dhabi National Oil Company Annual Report 1975, pp. 35-38
3 Abu Dhabi National Oil Company Annual Report 1975, pp. 35-39
4 Abu Dhabi National Oil Company Annual Report 1975, pp. 35-40
5 Abu Dhabi National Oil Company Annual Report 1975, pp. 35-41
6 Abu Dhabi National Oil Company Annual Report 1975, pp. 35-42
7 Abu Dhabi National Oil Company Annual Report 1975, pp. 35-43
8 Abu Dhabi National Oil Company Annual Report 1975, pp. 35-44
9 Abu Dhabi National Oil Company Annual Report 1975, pp. 35-45
10 Abu Dhabi National Oil Company Annual Report 1975, pp. 35-46
# summary stats for each column (for numeric columns, etc.)
summary(carbon_df)
year parent_entity parent_type reporting_entity
Min. :1854 Length:15797 Length:15797 Length:15797
1st Qu.:1970 Class :character Class :character Class :character
Median :1993 Mode :character Mode :character Mode :character
Mean :1986
3rd Qu.:2007
Max. :2022
commodity production_value production_unit
Length:15797 Min. : 0.00 Length:15797
Class :character 1st Qu.: 11.80 Class :character
Mode :character Median : 59.97 Mode :character
Mean : 327.88
3rd Qu.: 246.38
Max. :27192.00
product_emissions_MtCO2 flaring_emissions_MtCO2 venting_emissions_MtCO2
Min. : 0.000 Min. : 0.00000 Min. : 0.00000
1st Qu.: 5.996 1st Qu.: 0.00000 1st Qu.: 0.00000
Median : 21.502 Median : 0.01591 Median : 0.04525
Mean : 79.392 Mean : 0.51723 Mean : 0.46246
3rd Qu.: 62.192 3rd Qu.: 0.19725 3rd Qu.: 0.32972
Max. :7769.222 Max. :27.02687 Max. :41.45866
own_fuel_use_emissions_MtCO2 fugitive_methane_emissions_MtCO2e
Min. : 0.0000 Min. : 0.0000
1st Qu.: 0.0000 1st Qu.: 0.6071
Median : 0.0000 Median : 2.3511
Mean : 0.6887 Mean : 8.8842
3rd Qu.: 0.1624 3rd Qu.: 7.4017
Max. :83.2035 Max. :877.6837
fugitive_methane_emissions_MtCH4 total_operational_emissions_MtCO2e
Min. : 0.00000 Min. : 0.000
1st Qu.: 0.02168 1st Qu.: 0.752
Median : 0.08397 Median : 2.870
Mean : 0.31729 Mean : 10.553
3rd Qu.: 0.26434 3rd Qu.: 8.966
Max. :31.34585 Max. :877.684
total_emissions_MtCO2e source
Min. : 0.000 Length:15797
1st Qu.: 7.209 Class :character
Median : 25.117 Mode :character
Mean : 89.944
3rd Qu.: 72.255
Max. :8646.906
Questions
The two questions you want to answer.
Q1. “How have total operational emissions (total_operational_emissions_MtCO2e) evolved over time for major parent_entity groups, and which entities account for the largest share of these emissions in different years?
Q2. Q2. Among the parent entities identified in the first chart as the largest historical polluters, which specific emission sources (e.g. product, flaring, venting, own fuel use, or fugitive methane) drive their total_operational_emissions_MtCO2e?
Analysis plan
A plan for answering each of the questions including the variables involved, variables to be created (if any), external data to be merged in (if any).
The feedback we received from the other teams was very helpful. As pointed out, this project was originally attempting to answer both questions through the creation of a single graph. After realizing this was inappropriate from the feedback given to us by Dank Vibe, we have decided to change the initial visualization for the first question away from an alluvial diagram and into a more appropriate visualization genome, while keeping the alluvial diagram to answer a more refined version of our regional second question, one more appropriate for visualizing.
In regards to the feedback given to us by DANK TEA, we will attempt to implement normalized data to their specifications, as we believe this may be a strong way to approach the visualization. However, we are considering if such processing is necessary, as the purpose of the data set is to assign guilt to the largest emitters, regardless of efficiency. In this regards, we may still use un-normalized data if it becomes apparent that such pre-processing obscures the messaging that the original data set was built to convey.
Preparation and background for plan 1:
We plan to narrow our scope to the top 10 contributors for total operational emissions. To do this, we are going to normalize the data to account for the size of different companies since larger entities will naturally have a higher total emission than smaller ones.
Plan 1:
We will use a multiple-line graph. The x-axis represents the years the emissions have happened. The y-axis represents the total emissions. Each parent company will be represented by a line, with the trend representing the trend in total emissions through the years. The variables we shall use are the year, total_operational_emission, and source. We shall calculate this for the 10 companies with the highest total operational emissions. During this process, we will need to process the data into discrete time chunks. Rather than plotting all of the data from each year at once, we will have to bin together emitters that appear in multiple years (Decades, half decades, or another appropriate time frame) and analyze their trends as a whole, this will include merging the relevant emissions statistics together as well for comparison of more manageable time chunk comparisons.
Preparation and background for plan 2:
We will continue to use the normalized data as in plan 1 but in this case, we will be using yearly data. Additionally, this visualization will aim to view the graph(s) from plan 1 at a more detailed level. More specifically, we will be viewing the top contributing entities and looking at what the sources of their emissions are. That is product, flaring, venting, own fuel use, or fugitive methane. This way we can see what specific actions are potentially linked to higher total emissions and how that has changed from year to year.
Plan 2:
To identify the key emission sources driving total operational emissions, we will use an alluvial diagram. Entities previously identified as historically significant polluters will be positioned on the left side, while the corresponding pollution types will be displayed on the right. The width or thickness of each flow in the diagram will represent the total operational emissions of each entity, with segments directing emissions into their respective categories. This visualization will clarify how emissions are distributed, highlighting the most problematic sources and informing future regulatory responses. Additionally, this approach will help us identify trends in industrial production, pinpointing where emissions are most concentrated.
Variables for this question include:
Categorical: parent_entity Numerical: product_emissions_MtCO2, flaring_emissions_MtCO2, venting_emissions_MtCO2, own_fuel_use_emissions_MtCO2, fugitive_methane_emissions_MtCO2e, total_operational_emissions_MtCO2e
Why an alluvial flow chart:
An alluvial flow chart is ideal for communicating how emissions are distributed across entities and categories by visually mapping relationships between them. Its structured yet flexible design makes it easy to track emission flows, with width variations clearly indicating magnitude. This format enhances readability by grouping related emissions while preserving hierarchical complexity without overwhelming the viewer. By emphasizing proportional relationships, it allows for intuitive comparisons, making it easier to identify dominant polluters and major emission sources. The visual clarity helps stakeholders quickly grasp key insights, guiding data-driven decision-making. This makes it a powerful tool for effectively translating complex emission data into actionable information.