pacman::p_load(tidyverse, readxl, dplyr, tidyr) Data Preparation & Cleaning
1. Launching R Packages and Data Import
Import Package
The following code snippet imports the necessary packages used for our analysis:
Tidyverse: An aggregation of R packages designed for data science.
Dplyr and tidyr: Part of the tidyverse, these packages are instrumental in data manipulation and tidying, respectively.
Readxl: Facilitates the import of Excel files.
Import Big Mac Data
The code below imports the Big Mac dataset, which serves as the primary data source for this analysis. The dataset is in CSV format and was downloaded from the Economist’s GitHub page. Prior to import, some less relevant columns were removed for clarity.
big_mac <- read_csv("data/big_mac.csv")The dataset comprises 9 columns, encompassing 1,386 entries across 29 countries and regions from the year 2000 onward. Notably, for some countries, there are multiple records per year. The date column is initially in character format and requires conversion to a date format for accurate analysis.
The code chunk below converts the date column to a date format
big_mac$date <- dmy(big_mac$date)select only useful col
big_mac <- big_mac %>%
select(date, currency_code, name, local_price, dollar_price)The code below provides a preview of the Big Mac dataset
head(big_mac)# A tibble: 6 × 5
date currency_code name local_price dollar_price
<date> <chr> <chr> <dbl> <dbl>
1 2000-04-01 ARS Argentina 2.5 2.5
2 2000-04-01 AUD Australia 2.59 1.54
3 2000-04-01 BRL Brazil 2.95 1.65
4 2000-04-01 GBP Britain 1.9 3.00
5 2000-04-01 CAD Canada 2.85 1.94
6 2000-04-01 CLP Chile 1260 2.45
Import Complementary Data Set
complementary dataset are imported to add in additional indicators
import export data
GDP
GDP per Capita
inflation
GDP per Employed
Unemployment Rate
Human Development Index (HDI)
Population
Other then import export data is from Kaggle, all other indicators are from the World Bank
The code chunk below imports the import export data
export <- read_csv("data/export.csv")head(export)# A tibble: 6 × 33
`Partner Name` Year `Export (US$ Thousand)` `Import (US$ Thousand)`
<chr> <dbl> <dbl> <dbl>
1 Aruba 1988 3498. 328.
2 Afghanistan 1988 213030. 54460.
3 Angola 1988 375528. 370703.
4 Anguila 1988 367. 4
5 Albania 1988 30104. 47709.
6 Andorra 1988 67924. 3284.
# ℹ 29 more variables: `Export Product Share (%)` <dbl>,
# `Import Product Share (%)` <dbl>, `Revealed comparative advantage` <dbl>,
# `World Growth (%)` <dbl>, `Country Growth (%)` <dbl>,
# `AHS Simple Average (%)` <dbl>, `AHS Weighted Average (%)` <dbl>,
# `AHS Total Tariff Lines` <dbl>,
# `AHS Dutiable Tariff Lines Share (%)` <dbl>,
# `AHS Duty Free Tariff Lines Share (%)` <dbl>, …
the table is in long format
the code chunk below imports
gdp <- read_csv("data/gdp.csv")
gdp_capita <- read_csv("data/gdp_capita.csv")
gdp_employed <- read_csv("data/gdp_employed.csv")
inflation <- read_csv("data/inflation.csv")
unemployment <- read_csv("data/unemployment_rate.csv")the code chunk below display an overview of the dataframe
head(gdp)# A tibble: 6 × 27
`Series Name` `Series Code` `Country Name` `Country Code` `2000 [YR2000]`
<chr> <chr> <chr> <chr> <chr>
1 GDP (current US$) NY.GDP.MKTP.CD Afghanistan AFG ..
2 GDP (current US$) NY.GDP.MKTP.CD Albania ALB 3480355258
3 GDP (current US$) NY.GDP.MKTP.CD Algeria DZA 54790398570
4 GDP (current US$) NY.GDP.MKTP.CD American Samoa ASM ..
5 GDP (current US$) NY.GDP.MKTP.CD Andorra AND 1429047882
6 GDP (current US$) NY.GDP.MKTP.CD Angola AGO 9129594816
# ℹ 22 more variables: `2001 [YR2001]` <chr>, `2002 [YR2002]` <chr>,
# `2003 [YR2003]` <chr>, `2004 [YR2004]` <chr>, `2005 [YR2005]` <chr>,
# `2006 [YR2006]` <chr>, `2007 [YR2007]` <chr>, `2008 [YR2008]` <chr>,
# `2009 [YR2009]` <chr>, `2010 [YR2010]` <chr>, `2011 [YR2011]` <chr>,
# `2012 [YR2012]` <chr>, `2013 [YR2013]` <chr>, `2014 [YR2014]` <chr>,
# `2015 [YR2015]` <chr>, `2016 [YR2016]` <chr>, `2017 [YR2017]` <chr>,
# `2018 [YR2018]` <chr>, `2019 [YR2019]` <chr>, `2020 [YR2020]` <chr>, …
these indicators are in short format and it is necessary to convert it to long format first so that it can join with the big mac table
the function below converts the input into a long table format
to be updated
hdi is in long format
hdi <- read_csv("data/human-development-index.csv")
population <- read_csv("data/population.csv")2. Data Wrangling and Merging
BIG MAC DAT
convert date to year
big_mac$year <- year(big_mac$date)
big_mac# A tibble: 1,386 × 6
date currency_code name local_price dollar_price year
<date> <chr> <chr> <dbl> <dbl> <dbl>
1 2000-04-01 ARS Argentina 2.5 2.5 2000
2 2000-04-01 AUD Australia 2.59 1.54 2000
3 2000-04-01 BRL Brazil 2.95 1.65 2000
4 2000-04-01 GBP Britain 1.9 3.00 2000
5 2000-04-01 CAD Canada 2.85 1.94 2000
6 2000-04-01 CLP Chile 1260 2.45 2000
7 2000-04-01 CNY China 9.9 1.20 2000
8 2000-04-01 CZK Czech Republic 54.4 1.39 2000
9 2000-04-01 DKK Denmark 24.8 3.08 2000
10 2000-04-01 HKD Hong Kong 10.2 1.31 2000
# ℹ 1,376 more rows
some of country name is not consistent, rename
big_mac <- big_mac %>%
mutate(name = case_when(
name == 'United Arab Emirates' ~ 'UAE',
TRUE ~ name
))check number of records per year
entries_per_year_country <- big_mac %>%
group_by(year, name) %>%
summarize(count = n())As most countries only contains 1 record each year, the earliest record for each year will be kept for easier manipulation
first_records <- big_mac %>%
group_by(year, name) %>%
slice(1)rename some variables
big_mac_new <- first_records %>%
select(country = name,
year,
currency_code,
bmi_localprice = local_price,
bmi_usd_price = dollar_price
)local price %
big_mac_new <- big_mac_new %>%
group_by(country) %>%
arrange(country, currency_code, year) %>%
mutate(bmi_change = (bmi_localprice / lag(bmi_localprice) - 1) * 100) %>%
ungroup()
head(big_mac_new)# A tibble: 6 × 6
country year currency_code bmi_localprice bmi_usd_price bmi_change
<chr> <dbl> <chr> <dbl> <dbl> <dbl>
1 Argentina 2000 ARS 2.5 2.5 NA
2 Argentina 2001 ARS 2.5 2.5 0
3 Argentina 2002 ARS 2.5 0.799 0
4 Argentina 2003 ARS 4.1 1.42 64
5 Argentina 2004 ARS 4.36 1.48 6.34
6 Argentina 2005 ARS 4.75 1.64 8.94
other data
export
head(export)# A tibble: 6 × 33
`Partner Name` Year `Export (US$ Thousand)` `Import (US$ Thousand)`
<chr> <dbl> <dbl> <dbl>
1 Aruba 1988 3498. 328.
2 Afghanistan 1988 213030. 54460.
3 Angola 1988 375528. 370703.
4 Anguila 1988 367. 4
5 Albania 1988 30104. 47709.
6 Andorra 1988 67924. 3284.
# ℹ 29 more variables: `Export Product Share (%)` <dbl>,
# `Import Product Share (%)` <dbl>, `Revealed comparative advantage` <dbl>,
# `World Growth (%)` <dbl>, `Country Growth (%)` <dbl>,
# `AHS Simple Average (%)` <dbl>, `AHS Weighted Average (%)` <dbl>,
# `AHS Total Tariff Lines` <dbl>,
# `AHS Dutiable Tariff Lines Share (%)` <dbl>,
# `AHS Duty Free Tariff Lines Share (%)` <dbl>, …
rename export column and filter important columns for joining
export_data <- rename(export, country = `Partner Name`, year = Year) %>%
select(country,
year,
export_usd = `Export (US$ Thousand)`,
import_usd = `Import (US$ Thousand)`)understand net trading $
export_data <- export_data %>%
mutate(net_export = export_usd - import_usd)
head(export_data)# A tibble: 6 × 5
country year export_usd import_usd net_export
<chr> <dbl> <dbl> <dbl> <dbl>
1 Aruba 1988 3498. 328. 3170.
2 Afghanistan 1988 213030. 54460. 158571.
3 Angola 1988 375528. 370703. 4825.
4 Anguila 1988 367. 4 363.
5 Albania 1988 30104. 47709. -17606.
6 Andorra 1988 67924. 3284. 64640.
rename some of the country and location name
export_data <- export_data %>%
mutate(country = case_when(
country == "Russian Federation" ~ "Russia",
country == "United Kingdom" ~ "Britain",
country == "Hong Kong, China" ~ "Hong Kong",
country == "Korea, Rep." ~ "South Korea",
country == 'Egypt, Arab Rep.' ~ 'Egypt',
country == 'Slovak Republic' ~ 'Slovakia',
country == 'United Arab Emirates' ~ 'UAE',
TRUE ~ country
))join the big mac data with export data
joined_data <- left_join(big_mac_new, export_data, by = c("country", "year"))GDP data
reformat to long format
gdp_long_format <- gdp %>%
pivot_longer(
cols = -c(`Series Name`, `Series Code`, `Country Name`, `Country Code`),
names_to = "year",
names_pattern = "^\\d{4} \\[YR(\\d{4})\\]",
values_to = "GDP"
) %>%
select(country = `Country Name`, year, GDP)
gdp_long_format$year <- as.integer(gdp_long_format$year)convert missing data
gdp_long_format <- gdp_long_format %>%
mutate(GDP = na_if(GDP, ".."))rename columns in GDP
gdp_long_format <- gdp_long_format %>%
mutate(country = case_when(
country == "Russian Federation" ~ "Russia",
country == "United Kingdom" ~ "Britain",
country == "Czechia" ~ "Czech Republic",
country == "Korea, Rep." ~ "South Korea",
country == "Egypt, Arab Rep." ~ "Egypt",
country == 'Hong Kong SAR, China' ~ 'Hong Kong',
country == 'Slovak Republic' ~ 'Slovakia',
country == 'United Arab Emirates' ~ 'UAE',
country == 'Viet Nam' ~ 'Vietnam',
country == 'Turkiye' ~ 'Turkey',
country == 'Venezuela, RB' ~ 'Venezuela',
TRUE ~ country
))join big mac data with gdp
joined_data <- left_join(joined_data, gdp_long_format, by = c("country", "year"))GDP per Capita
reformat to long format
gdp_capita_long_format <- gdp_capita %>%
pivot_longer(
cols = -c(`Series Name`, `Series Code`, `Country Name`, `Country Code`),
names_to = "year",
names_pattern = "^\\d{4} \\[YR(\\d{4})\\]",
values_to = "gdp_per_capita"
) %>%
select(country = `Country Name`, year, gdp_per_capita)
gdp_capita_long_format$year <- as.integer(gdp_capita_long_format$year)convert missing data
gdp_capita_long_format <- gdp_capita_long_format %>%
mutate(gdp_per_capita = na_if(gdp_per_capita, ".."))rename columns
gdp_capita_long_format <- gdp_capita_long_format %>%
mutate(country = case_when(
country == "Russian Federation" ~ "Russia",
country == "United Kingdom" ~ "Britain",
country == "Czechia" ~ "Czech Republic",
country == "Korea, Rep." ~ "South Korea",
country == "Egypt, Arab Rep." ~ "Egypt",
country == 'Hong Kong SAR, China' ~ 'Hong Kong',
country == 'Slovak Republic' ~ 'Slovakia',
country == 'United Arab Emirates' ~ 'UAE',
country == 'Viet Nam' ~ 'Vietnam',
country == 'Turkiye' ~ 'Turkey',
country == 'Venezuela, RB' ~ 'Venezuela',
TRUE ~ country
))join big mac data with gdp_per_capita
joined_data <- left_join(joined_data, gdp_capita_long_format, by = c("country", "year"))inflation
reformat to long format
inflation_long_format <- inflation %>%
pivot_longer(
cols = -c(`Series Name`, `Series Code`, `Country Name`, `Country Code`),
names_to = "year",
names_pattern = "^\\d{4} \\[YR(\\d{4})\\]",
values_to = "inflation"
) %>%
select(country = `Country Name`, year, inflation)
inflation_long_format$year <- as.integer(inflation_long_format$year)convert missing data
inflation_long_format <- inflation_long_format %>%
mutate(inflation = na_if(inflation, ".."))rename columns in inflation
inflation_long_format <- inflation_long_format %>%
mutate(country = case_when(
country == "Russian Federation" ~ "Russia",
country == "United Kingdom" ~ "Britain",
country == "Czechia" ~ "Czech Republic",
country == "Korea, Rep." ~ "South Korea",
country == "Egypt, Arab Rep." ~ "Egypt",
country == 'Hong Kong SAR, China' ~ 'Hong Kong',
country == 'Slovak Republic' ~ 'Slovakia',
country == 'United Arab Emirates' ~ 'UAE',
country == 'Viet Nam' ~ 'Vietnam',
country == 'Turkiye' ~ 'Turkey',
country == 'Venezuela, RB' ~ 'Venezuela',
TRUE ~ country
))join big mac data with inflation
joined_data <- left_join(joined_data, inflation_long_format, by = c("country", "year"))unemployment
reformat to long format
unemployment_long_format <- unemployment %>%
pivot_longer(
cols = -c(`Indicator Name`, `Indicator Code`, `Country Name`, `Country Code`),
names_to = "year",
values_to = "unemployment"
) %>%
select(country = `Country Name`, year, unemployment)
unemployment_long_format$year <- as.integer(unemployment_long_format$year)rename columns in unemployment
unemployment_long_format <- unemployment_long_format %>%
mutate(country = case_when(
country == "Russian Federation" ~ "Russia",
country == "United Kingdom" ~ "Britain",
country == "Czechia" ~ "Czech Republic",
country == "Korea, Rep." ~ "South Korea",
country == "Egypt, Arab Rep." ~ "Egypt",
country == 'Hong Kong SAR, China' ~ 'Hong Kong',
country == 'Slovak Republic' ~ 'Slovakia',
country == 'United Arab Emirates' ~ 'UAE',
country == 'Viet Nam' ~ 'Vietnam',
country == 'Turkiye' ~ 'Turkey',
country == 'Venezuela, RB' ~ 'Venezuela',
TRUE ~ country
))join big mac data with unemployment
joined_data <- left_join(joined_data, unemployment_long_format, by = c("country", "year"))take a look at the dataframe
head(joined_data)# A tibble: 6 × 13
country year currency_code bmi_localprice bmi_usd_price bmi_change export_usd
<chr> <dbl> <chr> <dbl> <dbl> <dbl> <dbl>
1 Argent… 2000 ARS 2.5 2.5 NA 28181510.
2 Argent… 2001 ARS 2.5 2.5 0 22619763.
3 Argent… 2002 ARS 2.5 0.799 0 9838651.
4 Argent… 2003 ARS 4.1 1.42 64 16066847.
5 Argent… 2004 ARS 4.36 1.48 6.34 25007443.
6 Argent… 2005 ARS 4.75 1.64 8.94 30870438.
# ℹ 6 more variables: import_usd <dbl>, net_export <dbl>, GDP <chr>,
# gdp_per_capita <chr>, inflation <chr>, unemployment <dbl>
hdi
select columns
head(hdi)# A tibble: 6 × 3
country year hdi
<chr> <dbl> <dbl>
1 Afghanistan 1990 0.273
2 Afghanistan 1991 0.279
3 Afghanistan 1992 0.287
4 Afghanistan 1993 0.297
5 Afghanistan 1994 0.292
6 Afghanistan 1995 0.31
hdi <- hdi %>%
mutate(country = case_when(
country == "Russian Federation" ~ "Russia",
country == "United Kingdom" ~ "Britain",
country == "Czechia" ~ "Czech Republic",
country == "Korea, Rep." ~ "South Korea",
country == "Egypt, Arab Rep." ~ "Egypt",
country == 'Hong Kong SAR, China' ~ 'Hong Kong',
country == 'Slovak Republic' ~ 'Slovakia',
country == 'United Arab Emirates' ~ 'UAE',
country == 'Viet Nam' ~ 'Vietnam',
country == 'Turkiye' ~ 'Turkey',
country == 'Venezuela, RB' ~ 'Venezuela',
TRUE ~ country
))joined_data2 <- left_join(joined_data, hdi, by = c("country", "year"))joined_data2# A tibble: 845 × 14
country year currency_code bmi_localprice bmi_usd_price bmi_change
<chr> <dbl> <chr> <dbl> <dbl> <dbl>
1 Argentina 2000 ARS 2.5 2.5 NA
2 Argentina 2001 ARS 2.5 2.5 0
3 Argentina 2002 ARS 2.5 0.799 0
4 Argentina 2003 ARS 4.1 1.42 64
5 Argentina 2004 ARS 4.36 1.48 6.34
6 Argentina 2005 ARS 4.75 1.64 8.94
7 Argentina 2006 ARS 4.75 1.55 0
8 Argentina 2007 ARS 8.25 2.67 73.7
9 Argentina 2008 ARS 11 3.64 33.3
10 Argentina 2009 ARS 11.5 3.02 4.55
# ℹ 835 more rows
# ℹ 8 more variables: export_usd <dbl>, import_usd <dbl>, net_export <dbl>,
# GDP <chr>, gdp_per_capita <chr>, inflation <chr>, unemployment <dbl>,
# hdi <dbl>
population
rename columns
population <- population %>%
mutate(country = case_when(
country == "Russian Federation" ~ "Russia",
country == "United Kingdom" ~ "Britain",
country == "Czechia" ~ "Czech Republic",
country == "Korea, Rep." ~ "South Korea",
country == "Egypt, Arab Rep." ~ "Egypt",
country == 'Hong Kong SAR, China' ~ 'Hong Kong',
country == 'Slovak Republic' ~ 'Slovakia',
country == 'United Arab Emirates' ~ 'UAE',
country == 'Viet Nam' ~ 'Vietnam',
country == 'Turkiye' ~ 'Turkey',
country == 'Venezuela, RB' ~ 'Venezuela',
TRUE ~ country
))head(population)# A tibble: 6 × 3
country year population
<chr> <dbl> <dbl>
1 Afghanistan 1950 7480464
2 Afghanistan 1951 7571542
3 Afghanistan 1952 7667534
4 Afghanistan 1953 7764549
5 Afghanistan 1954 7864289
6 Afghanistan 1955 7971933
join with big mac
joined_data2 <- left_join(joined_data2, population, by = c("country", "year"))check for missing data
drop tai wan change name
big_mac_combined <- joined_data2 %>%
filter(country != "Taiwan", country != "Euro area")change column datatype
big_mac_combined <- big_mac_combined %>%
mutate(
GDP = as.numeric(GDP),
inflation = as.numeric(inflation),
gdp_per_capita = as.numeric(gdp_per_capita)
)head(big_mac_combined)# A tibble: 6 × 15
country year currency_code bmi_localprice bmi_usd_price bmi_change export_usd
<chr> <dbl> <chr> <dbl> <dbl> <dbl> <dbl>
1 Argent… 2000 ARS 2.5 2.5 NA 28181510.
2 Argent… 2001 ARS 2.5 2.5 0 22619763.
3 Argent… 2002 ARS 2.5 0.799 0 9838651.
4 Argent… 2003 ARS 4.1 1.42 64 16066847.
5 Argent… 2004 ARS 4.36 1.48 6.34 25007443.
6 Argent… 2005 ARS 4.75 1.64 8.94 30870438.
# ℹ 8 more variables: import_usd <dbl>, net_export <dbl>, GDP <dbl>,
# gdp_per_capita <dbl>, inflation <dbl>, unemployment <dbl>, hdi <dbl>,
# population <dbl>
summary(big_mac_combined) country year currency_code bmi_localprice
Length:800 Min. :2000 Length:800 Min. : 1
Class :character 1st Qu.:2006 Class :character 1st Qu.: 6
Mode :character Median :2012 Mode :character Median : 21
Mean :2012 Mean : 15913
3rd Qu.:2018 3rd Qu.: 112
Max. :2022 Max. :4000000
bmi_usd_price bmi_change export_usd import_usd
Min. :0.640 Min. :-100.000 Min. :7.416e+06 Min. :3.959e+06
1st Qu.:2.300 1st Qu.: 0.000 1st Qu.:7.120e+07 1st Qu.:6.518e+07
Median :2.978 Median : 2.565 Median :1.459e+08 Median :1.346e+08
Mean :3.218 Mean : 5.670 Mean :2.862e+08 Mean :2.609e+08
3rd Qu.:3.979 3rd Qu.: 8.065 3rd Qu.:3.285e+08 3rd Qu.:2.818e+08
Max. :8.312 Max. : 275.000 Max. :3.292e+09 Max. :3.383e+09
NA's :53 NA's :53 NA's :53
net_export GDP gdp_per_capita inflation
Min. :-963828693 Min. :1.369e+10 Min. : 739 Min. :-3.686
1st Qu.: -23167201 1st Qu.:2.180e+11 1st Qu.: 5885 1st Qu.: 1.544
Median : -804945 Median :3.908e+11 Median : 13447 Median : 2.849
Mean : 25245050 Mean :1.406e+12 Mean : 23736 Mean : 4.053
3rd Qu.: 29978044 3rd Qu.:1.103e+12 3rd Qu.: 39901 3rd Qu.: 5.062
Max. :1587781425 Max. :2.544e+13 Max. :106177 Max. :72.309
NA's :53 NA's :24
unemployment hdi population
Min. : 0.130 Min. :0.4750 Min. :1.463e+06
1st Qu.: 3.745 1st Qu.:0.7320 1st Qu.:1.006e+07
Median : 5.280 Median :0.8340 Median :3.862e+07
Mean : 6.375 Mean :0.8134 Mean :1.223e+08
3rd Qu.: 7.795 3rd Qu.:0.9080 3rd Qu.:1.044e+08
Max. :28.840 Max. :0.9620 Max. :1.426e+09
NA's :1 NA's :53 NA's :53
# Summarize data to find the range of years available for each country
year_summary <- big_mac_combined %>%
group_by(country) %>%
summarize(StartYear = min(year), EndYear = max(year)) %>%
ungroup()
# Print the summary
print(year_summary)# A tibble: 53 × 3
country StartYear EndYear
<chr> <dbl> <dbl>
1 Argentina 2000 2022
2 Australia 2000 2022
3 Azerbaijan 2021 2022
4 Bahrain 2021 2022
5 Brazil 2000 2022
6 Britain 2000 2022
7 Canada 2000 2022
8 Chile 2000 2022
9 China 2000 2022
10 Colombia 2004 2022
# ℹ 43 more rows
# Your previous code to summarize StartYear and EndYear by country
year_summary <- big_mac_combined %>%
group_by(country) %>%
summarize(StartYear = min(year), EndYear = max(year)) %>%
ungroup()
# Group by StartYear and list countries along with their EndYear
grouped_summary <- year_summary %>%
group_by(StartYear) %>%
summarize(Countries = paste(country, "(EndYear:", EndYear, ")", sep = "", collapse = "; "),
.groups = 'drop') %>%
ungroup()
# Print the grouped summary
print(grouped_summary)# A tibble: 7 × 2
StartYear Countries
<dbl> <chr>
1 2000 Argentina(EndYear:2022); Australia(EndYear:2022); Brazil(EndYear:20…
2 2001 Philippines(EndYear:2022)
3 2002 Norway(EndYear:2022); Peru(EndYear:2022); Turkey(EndYear:2022)
4 2003 Egypt(EndYear:2022)
5 2004 Colombia(EndYear:2022); Pakistan(EndYear:2022); Saudi Arabia(EndYea…
6 2011 India(EndYear:2022)
7 2021 Azerbaijan(EndYear:2022); Bahrain(EndYear:2022); Costa Rica(EndYear…
filtered_big_mac_data <- big_mac_combined %>%
filter(!(year %in% c(2022, 2023)))
# View the first few rows of the filtered dataframe to confirm
head(filtered_big_mac_data)# A tibble: 6 × 15
country year currency_code bmi_localprice bmi_usd_price bmi_change export_usd
<chr> <dbl> <chr> <dbl> <dbl> <dbl> <dbl>
1 Argent… 2000 ARS 2.5 2.5 NA 28181510.
2 Argent… 2001 ARS 2.5 2.5 0 22619763.
3 Argent… 2002 ARS 2.5 0.799 0 9838651.
4 Argent… 2003 ARS 4.1 1.42 64 16066847.
5 Argent… 2004 ARS 4.36 1.48 6.34 25007443.
6 Argent… 2005 ARS 4.75 1.64 8.94 30870438.
# ℹ 8 more variables: import_usd <dbl>, net_export <dbl>, GDP <dbl>,
# gdp_per_capita <dbl>, inflation <dbl>, unemployment <dbl>, hdi <dbl>,
# population <dbl>
Checking the missing data and highlighting the countries within the gdp_per_employed and inflation columns
missing_values_summary <- filtered_big_mac_data %>%
filter(is.na(inflation)) %>% # Adjust 'price' to the column you're interested in
group_by(country, year) %>%
summarise(MissingCount = n(), .groups = 'drop') %>%
arrange(country, year)
# Display the summary
print(missing_values_summary)# A tibble: 22 × 3
country year MissingCount
<chr> <dbl> <int>
1 Argentina 2000 1
2 Argentina 2001 1
3 Argentina 2002 1
4 Argentina 2003 1
5 Argentina 2004 1
6 Argentina 2005 1
7 Argentina 2006 1
8 Argentina 2007 1
9 Argentina 2008 1
10 Argentina 2009 1
# ℹ 12 more rows
Impute the the inflation data for Argentina from [World Data Info Link](https://www.worlddata.info/america/argentina/inflation-rates.php)
library(readxl)
arg_inflation_data <- read_excel("data/ARG_inflation_Data.xlsx")
arg_inflation_data_relevant <- arg_inflation_data %>%
select(year, inflation)
filtered_big_mac_data_imputed <- filtered_big_mac_data %>%
mutate(inflation_arg = ifelse(country == "Argentina", arg_inflation_data$inflation[match(year, arg_inflation_data$year)], NA)) %>%
mutate(inflation = ifelse(is.na(inflation) & country == "Argentina", inflation_arg, inflation)) %>%
select(-inflation_arg)
print(filtered_big_mac_data_imputed)# A tibble: 747 × 15
country year currency_code bmi_localprice bmi_usd_price bmi_change
<chr> <dbl> <chr> <dbl> <dbl> <dbl>
1 Argentina 2000 ARS 2.5 2.5 NA
2 Argentina 2001 ARS 2.5 2.5 0
3 Argentina 2002 ARS 2.5 0.799 0
4 Argentina 2003 ARS 4.1 1.42 64
5 Argentina 2004 ARS 4.36 1.48 6.34
6 Argentina 2005 ARS 4.75 1.64 8.94
7 Argentina 2006 ARS 4.75 1.55 0
8 Argentina 2007 ARS 8.25 2.67 73.7
9 Argentina 2008 ARS 11 3.64 33.3
10 Argentina 2009 ARS 11.5 3.02 4.55
# ℹ 737 more rows
# ℹ 9 more variables: export_usd <dbl>, import_usd <dbl>, net_export <dbl>,
# GDP <dbl>, gdp_per_capita <dbl>, inflation <dbl>, unemployment <dbl>,
# hdi <dbl>, population <dbl>
Will now, remove the data from Venezuela as it has both missing datas from gdp and inflation
cleaned_big_mac_data <- filtered_big_mac_data_imputed %>%
filter(country != "Venezuela")head(cleaned_big_mac_data)# A tibble: 6 × 15
country year currency_code bmi_localprice bmi_usd_price bmi_change export_usd
<chr> <dbl> <chr> <dbl> <dbl> <dbl> <dbl>
1 Argent… 2000 ARS 2.5 2.5 NA 28181510.
2 Argent… 2001 ARS 2.5 2.5 0 22619763.
3 Argent… 2002 ARS 2.5 0.799 0 9838651.
4 Argent… 2003 ARS 4.1 1.42 64 16066847.
5 Argent… 2004 ARS 4.36 1.48 6.34 25007443.
6 Argent… 2005 ARS 4.75 1.64 8.94 30870438.
# ℹ 8 more variables: import_usd <dbl>, net_export <dbl>, GDP <dbl>,
# gdp_per_capita <dbl>, inflation <dbl>, unemployment <dbl>, hdi <dbl>,
# population <dbl>
# Convert year to numeric to ensure proper filtering
cleaned_big_mac_data$year <- as.numeric(as.character(cleaned_big_mac_data$year))
# Filter for years 2002 to 2021
data_2002_2021 <- cleaned_big_mac_data %>%
filter(year >= 2002 & year <= 2021)
# Identify countries with data for each year from 2002 to 2021
countries_with_complete_data <- data_2002_2021 %>%
group_by(country) %>%
filter(n_distinct(year) == (2021 - 2002 + 1)) %>%
ungroup()
countries_with_complete_data <- countries_with_complete_data %>%
mutate(country = ifelse(country == "Britain", "United Kingdom", country),
country = ifelse(country == "Czech Republic", "Czech Rep.", country),
country = ifelse(country == "South Korea", "Korea", country),
country = ifelse(country == "Switzerland", "Switzerland", country))
# View the first few rows to confirm
head(countries_with_complete_data)# A tibble: 6 × 15
country year currency_code bmi_localprice bmi_usd_price bmi_change export_usd
<chr> <dbl> <chr> <dbl> <dbl> <dbl> <dbl>
1 Argent… 2002 ARS 2.5 0.799 0 9838651.
2 Argent… 2003 ARS 4.1 1.42 64 16066847.
3 Argent… 2004 ARS 4.36 1.48 6.34 25007443.
4 Argent… 2005 ARS 4.75 1.64 8.94 30870438.
5 Argent… 2006 ARS 4.75 1.55 0 37603723.
6 Argent… 2007 ARS 8.25 2.67 73.7 49020947.
# ℹ 8 more variables: import_usd <dbl>, net_export <dbl>, GDP <dbl>,
# gdp_per_capita <dbl>, inflation <dbl>, unemployment <dbl>, hdi <dbl>,
# population <dbl>
combine with country
country_all <- data.frame(country = unique(countries_with_complete_data$country))map continent
continent_mapping <- data.frame(
country = c("Argentina", "Australia", "Brazil", "United Kingdom", "Canada", "Chile", "China", "Czech Rep.", "Denmark", "Hong Kong", "Hungary", "Indonesia", "Japan", "Malaysia", "Mexico", "New Zealand", "Peru", "Philippines", "Poland", "Russia", "Singapore", "South Africa", "Korea", "Sweden", "Switzerland", "Thailand", "Turkey", "United States"),
continent = c("South America", "Oceania", "South America", "Europe", "North America", "South America", "Asia", "Europe", "Europe", "Asia", "Europe", "Asia", "Asia", "Asia", "North America", "Oceania", "South America", "Asia", "Europe", "Europe", "Asia", "Africa", "Asia", "Europe", "Europe", "Asia", "Asia", "North America")
)
# Merge the continent mapping with your original DataFrame 'df'
country_all <- merge(country_all, continent_mapping, by = "country", all.x = TRUE)map g7, g20, eu
g7_countries <- c("Canada", "France", "Germany", "Italy", "United Kingdom", "United States", "Japan")
g20_countries <- c("Argentina", "Australia", "Brazil", "Canada", "China", "France", "Germany", "India", "Indonesia", "Italy", "Japan", "Mexico", "Russia", "Saudi Arabia", "South Africa", "South Korea", "Turkey", "United Kingdom", "United States", "European Union")
eu_countries <- c("Austria", "Belgium", "Bulgaria", "Croatia", "Cyprus", "Czech Rep.", "Denmark", "Estonia", "Finland", "France", "Germany", "Greece", "Hungary", "Ireland", "Italy", "Latvia", "Lithuania", "Luxembourg", "Malta", "Netherlands", "Poland", "Portugal", "Romania", "Slovakia", "Slovenia", "Spain", "Sweden")
brics_countries <- c("Brazil", "Russia", "India", "China", "South Africa","Egypt","Ethiopia","Iran","United Arab Emirates")country_all <- country_all %>%
mutate(
g7 = country %in% g7_countries,
g20 = country %in% g20_countries,
eu = country %in% eu_countries,
brics = country %in% brics_countries)take a look at the countries
head(country_all) country continent g7 g20 eu brics
1 Argentina South America FALSE TRUE FALSE FALSE
2 Australia Oceania FALSE TRUE FALSE FALSE
3 Brazil South America FALSE TRUE FALSE TRUE
4 Canada North America TRUE TRUE FALSE FALSE
5 Chile South America FALSE FALSE FALSE FALSE
6 China Asia FALSE TRUE FALSE TRUE
# Save cleaned_big_mac_data dataframe to a CSV file
write.csv(countries_with_complete_data, file = "data/bmi_data.csv")
write.csv(country_all, file = "data/country_data.csv")