Data Preparation & Cleaning

Author

Jiayi, Firdaus & Victoria

Published

February 22, 2024

Modified

March 31, 2024

1. Launching R Packages and Data Import

Import Package

The following code snippet imports the necessary packages used for our analysis:

  • Tidyverse: An aggregation of R packages designed for data science.

  • Dplyr and tidyr: Part of the tidyverse, these packages are instrumental in data manipulation and tidying, respectively.

  • Readxl: Facilitates the import of Excel files.

pacman::p_load(tidyverse, readxl, dplyr, tidyr) 

Import Big Mac Data

The code below imports the Big Mac dataset, which serves as the primary data source for this analysis. The dataset is in CSV format and was downloaded from the Economist’s GitHub page. Prior to import, some less relevant columns were removed for clarity.

big_mac <- read_csv("data/big_mac.csv")

The dataset comprises 9 columns, encompassing 1,386 entries across 29 countries and regions from the year 2000 onward. Notably, for some countries, there are multiple records per year. The date column is initially in character format and requires conversion to a date format for accurate analysis.

The code chunk below converts the date column to a date format

big_mac$date <- dmy(big_mac$date)

select only useful col

big_mac <- big_mac %>%
  select(date, currency_code, name, local_price, dollar_price)

The code below provides a preview of the Big Mac dataset

head(big_mac)
# A tibble: 6 × 5
  date       currency_code name      local_price dollar_price
  <date>     <chr>         <chr>           <dbl>        <dbl>
1 2000-04-01 ARS           Argentina        2.5          2.5 
2 2000-04-01 AUD           Australia        2.59         1.54
3 2000-04-01 BRL           Brazil           2.95         1.65
4 2000-04-01 GBP           Britain          1.9          3.00
5 2000-04-01 CAD           Canada           2.85         1.94
6 2000-04-01 CLP           Chile         1260            2.45

Import Complementary Data Set

complementary dataset are imported to add in additional indicators

  • import export data

  • GDP

  • GDP per Capita

  • inflation

  • GDP per Employed

  • Unemployment Rate

  • Human Development Index (HDI)

  • Population

Other then import export data is from Kaggle, all other indicators are from the World Bank

The code chunk below imports the import export data

export <- read_csv("data/export.csv")
head(export)
# A tibble: 6 × 33
  `Partner Name`  Year `Export (US$ Thousand)` `Import (US$ Thousand)`
  <chr>          <dbl>                   <dbl>                   <dbl>
1 Aruba           1988                   3498.                    328.
2 Afghanistan     1988                 213030.                  54460.
3 Angola          1988                 375528.                 370703.
4 Anguila         1988                    367.                      4 
5 Albania         1988                  30104.                  47709.
6 Andorra         1988                  67924.                   3284.
# ℹ 29 more variables: `Export Product Share (%)` <dbl>,
#   `Import Product Share (%)` <dbl>, `Revealed comparative advantage` <dbl>,
#   `World Growth (%)` <dbl>, `Country Growth (%)` <dbl>,
#   `AHS Simple Average (%)` <dbl>, `AHS Weighted Average (%)` <dbl>,
#   `AHS Total Tariff Lines` <dbl>,
#   `AHS Dutiable Tariff Lines Share (%)` <dbl>,
#   `AHS Duty Free Tariff Lines Share (%)` <dbl>, …

the table is in long format

the code chunk below imports

gdp <- read_csv("data/gdp.csv")
gdp_capita <- read_csv("data/gdp_capita.csv")
gdp_employed <- read_csv("data/gdp_employed.csv")
inflation <- read_csv("data/inflation.csv")
unemployment <- read_csv("data/unemployment_rate.csv")

the code chunk below display an overview of the dataframe

head(gdp)
# A tibble: 6 × 27
  `Series Name`     `Series Code`  `Country Name` `Country Code` `2000 [YR2000]`
  <chr>             <chr>          <chr>          <chr>          <chr>          
1 GDP (current US$) NY.GDP.MKTP.CD Afghanistan    AFG            ..             
2 GDP (current US$) NY.GDP.MKTP.CD Albania        ALB            3480355258     
3 GDP (current US$) NY.GDP.MKTP.CD Algeria        DZA            54790398570    
4 GDP (current US$) NY.GDP.MKTP.CD American Samoa ASM            ..             
5 GDP (current US$) NY.GDP.MKTP.CD Andorra        AND            1429047882     
6 GDP (current US$) NY.GDP.MKTP.CD Angola         AGO            9129594816     
# ℹ 22 more variables: `2001 [YR2001]` <chr>, `2002 [YR2002]` <chr>,
#   `2003 [YR2003]` <chr>, `2004 [YR2004]` <chr>, `2005 [YR2005]` <chr>,
#   `2006 [YR2006]` <chr>, `2007 [YR2007]` <chr>, `2008 [YR2008]` <chr>,
#   `2009 [YR2009]` <chr>, `2010 [YR2010]` <chr>, `2011 [YR2011]` <chr>,
#   `2012 [YR2012]` <chr>, `2013 [YR2013]` <chr>, `2014 [YR2014]` <chr>,
#   `2015 [YR2015]` <chr>, `2016 [YR2016]` <chr>, `2017 [YR2017]` <chr>,
#   `2018 [YR2018]` <chr>, `2019 [YR2019]` <chr>, `2020 [YR2020]` <chr>, …

these indicators are in short format and it is necessary to convert it to long format first so that it can join with the big mac table

the function below converts the input into a long table format

to be updated

hdi is in long format

hdi <- read_csv("data/human-development-index.csv")
population <- read_csv("data/population.csv")

2. Data Wrangling and Merging

BIG MAC DAT

convert date to year

big_mac$year <- year(big_mac$date)
big_mac
# A tibble: 1,386 × 6
   date       currency_code name           local_price dollar_price  year
   <date>     <chr>         <chr>                <dbl>        <dbl> <dbl>
 1 2000-04-01 ARS           Argentina             2.5          2.5   2000
 2 2000-04-01 AUD           Australia             2.59         1.54  2000
 3 2000-04-01 BRL           Brazil                2.95         1.65  2000
 4 2000-04-01 GBP           Britain               1.9          3.00  2000
 5 2000-04-01 CAD           Canada                2.85         1.94  2000
 6 2000-04-01 CLP           Chile              1260            2.45  2000
 7 2000-04-01 CNY           China                 9.9          1.20  2000
 8 2000-04-01 CZK           Czech Republic       54.4          1.39  2000
 9 2000-04-01 DKK           Denmark              24.8          3.08  2000
10 2000-04-01 HKD           Hong Kong            10.2          1.31  2000
# ℹ 1,376 more rows

some of country name is not consistent, rename

big_mac <- big_mac %>%
  mutate(name = case_when(
    name == 'United Arab Emirates' ~ 'UAE',
    TRUE ~ name 
  ))

check number of records per year

entries_per_year_country <- big_mac %>%
  group_by(year, name) %>%
  summarize(count = n())

As most countries only contains 1 record each year, the earliest record for each year will be kept for easier manipulation

first_records <- big_mac %>%
  group_by(year, name) %>%
  slice(1)

rename some variables

big_mac_new <- first_records %>%
  select(country = name, 
         year, 
         currency_code, 
         bmi_localprice = local_price, 
         bmi_usd_price = dollar_price
    
  )

local price %

big_mac_new <- big_mac_new %>%
  group_by(country) %>%  
  arrange(country, currency_code, year) %>%  
  mutate(bmi_change = (bmi_localprice / lag(bmi_localprice) - 1) * 100) %>%
  ungroup()  
head(big_mac_new)
# A tibble: 6 × 6
  country    year currency_code bmi_localprice bmi_usd_price bmi_change
  <chr>     <dbl> <chr>                  <dbl>         <dbl>      <dbl>
1 Argentina  2000 ARS                     2.5          2.5        NA   
2 Argentina  2001 ARS                     2.5          2.5         0   
3 Argentina  2002 ARS                     2.5          0.799       0   
4 Argentina  2003 ARS                     4.1          1.42       64   
5 Argentina  2004 ARS                     4.36         1.48        6.34
6 Argentina  2005 ARS                     4.75         1.64        8.94

other data

export

head(export)
# A tibble: 6 × 33
  `Partner Name`  Year `Export (US$ Thousand)` `Import (US$ Thousand)`
  <chr>          <dbl>                   <dbl>                   <dbl>
1 Aruba           1988                   3498.                    328.
2 Afghanistan     1988                 213030.                  54460.
3 Angola          1988                 375528.                 370703.
4 Anguila         1988                    367.                      4 
5 Albania         1988                  30104.                  47709.
6 Andorra         1988                  67924.                   3284.
# ℹ 29 more variables: `Export Product Share (%)` <dbl>,
#   `Import Product Share (%)` <dbl>, `Revealed comparative advantage` <dbl>,
#   `World Growth (%)` <dbl>, `Country Growth (%)` <dbl>,
#   `AHS Simple Average (%)` <dbl>, `AHS Weighted Average (%)` <dbl>,
#   `AHS Total Tariff Lines` <dbl>,
#   `AHS Dutiable Tariff Lines Share (%)` <dbl>,
#   `AHS Duty Free Tariff Lines Share (%)` <dbl>, …

rename export column and filter important columns for joining

export_data <- rename(export, country = `Partner Name`, year = Year) %>%
  select(country, 
         year,
         export_usd = `Export (US$ Thousand)`,
         import_usd = `Import (US$ Thousand)`)

understand net trading $

export_data <- export_data %>%
  mutate(net_export = export_usd - import_usd)

head(export_data)
# A tibble: 6 × 5
  country      year export_usd import_usd net_export
  <chr>       <dbl>      <dbl>      <dbl>      <dbl>
1 Aruba        1988      3498.       328.      3170.
2 Afghanistan  1988    213030.     54460.    158571.
3 Angola       1988    375528.    370703.      4825.
4 Anguila      1988       367.         4        363.
5 Albania      1988     30104.     47709.    -17606.
6 Andorra      1988     67924.      3284.     64640.

rename some of the country and location name

export_data <- export_data %>%
  mutate(country = case_when(
    country == "Russian Federation" ~ "Russia",
    country == "United Kingdom" ~ "Britain",
    country == "Hong Kong, China" ~ "Hong Kong",
    country == "Korea, Rep." ~ "South Korea",
    country == 'Egypt, Arab Rep.' ~ 'Egypt',
    country == 'Slovak Republic' ~ 'Slovakia',
    country == 'United Arab Emirates' ~ 'UAE',
    TRUE ~ country  
  ))

join the big mac data with export data

joined_data <- left_join(big_mac_new, export_data, by = c("country", "year"))

GDP data

reformat to long format

gdp_long_format <- gdp %>%
  pivot_longer(
    cols = -c(`Series Name`, `Series Code`, `Country Name`, `Country Code`),  
    names_to = "year",
    names_pattern = "^\\d{4} \\[YR(\\d{4})\\]",  
    values_to = "GDP"
  ) %>%
  select(country = `Country Name`, year, GDP)

gdp_long_format$year <- as.integer(gdp_long_format$year)

convert missing data

gdp_long_format <- gdp_long_format %>%
  mutate(GDP = na_if(GDP, ".."))

rename columns in GDP

gdp_long_format <- gdp_long_format %>%
  mutate(country = case_when(
    country == "Russian Federation" ~ "Russia",
    country == "United Kingdom" ~ "Britain",
    country == "Czechia" ~ "Czech Republic",
    country == "Korea, Rep." ~ "South Korea",
    country == "Egypt, Arab Rep." ~ "Egypt",
    country == 'Hong Kong SAR, China' ~ 'Hong Kong',
    country == 'Slovak Republic' ~ 'Slovakia',
    country == 'United Arab Emirates' ~ 'UAE',
    country == 'Viet Nam' ~ 'Vietnam',
    country == 'Turkiye' ~ 'Turkey',
    country == 'Venezuela, RB' ~ 'Venezuela',
    TRUE ~ country  
  ))

join big mac data with gdp

joined_data <- left_join(joined_data, gdp_long_format, by = c("country", "year"))

GDP per Capita

reformat to long format

gdp_capita_long_format <- gdp_capita %>%
  pivot_longer(
    cols = -c(`Series Name`, `Series Code`, `Country Name`, `Country Code`),  
    names_to = "year",
    names_pattern = "^\\d{4} \\[YR(\\d{4})\\]",  
    values_to = "gdp_per_capita"
  ) %>%
  select(country = `Country Name`, year, gdp_per_capita)

gdp_capita_long_format$year <- as.integer(gdp_capita_long_format$year)

convert missing data

gdp_capita_long_format <- gdp_capita_long_format %>%
  mutate(gdp_per_capita = na_if(gdp_per_capita, ".."))

rename columns

gdp_capita_long_format <- gdp_capita_long_format %>%
  mutate(country = case_when(
    country == "Russian Federation" ~ "Russia",
    country == "United Kingdom" ~ "Britain",
    country == "Czechia" ~ "Czech Republic",
    country == "Korea, Rep." ~ "South Korea",
    country == "Egypt, Arab Rep." ~ "Egypt",
    country == 'Hong Kong SAR, China' ~ 'Hong Kong',
    country == 'Slovak Republic' ~ 'Slovakia',
    country == 'United Arab Emirates' ~ 'UAE',
    country == 'Viet Nam' ~ 'Vietnam',
    country == 'Turkiye' ~ 'Turkey',
    country == 'Venezuela, RB' ~ 'Venezuela',
    TRUE ~ country  
  ))

join big mac data with gdp_per_capita

joined_data <- left_join(joined_data, gdp_capita_long_format, by = c("country", "year"))

inflation

reformat to long format

inflation_long_format <- inflation %>%
  pivot_longer(
    cols = -c(`Series Name`, `Series Code`, `Country Name`, `Country Code`),  
    names_to = "year",
    names_pattern = "^\\d{4} \\[YR(\\d{4})\\]",  
    values_to = "inflation"
  ) %>%
  select(country = `Country Name`, year, inflation)

inflation_long_format$year <- as.integer(inflation_long_format$year)

convert missing data

inflation_long_format <- inflation_long_format %>%
  mutate(inflation = na_if(inflation, ".."))

rename columns in inflation

inflation_long_format <- inflation_long_format %>%
  mutate(country = case_when(
    country == "Russian Federation" ~ "Russia",
    country == "United Kingdom" ~ "Britain",
    country == "Czechia" ~ "Czech Republic",
    country == "Korea, Rep." ~ "South Korea",
    country == "Egypt, Arab Rep." ~ "Egypt",
    country == 'Hong Kong SAR, China' ~ 'Hong Kong',
    country == 'Slovak Republic' ~ 'Slovakia',
    country == 'United Arab Emirates' ~ 'UAE',
    country == 'Viet Nam' ~ 'Vietnam',
    country == 'Turkiye' ~ 'Turkey',
    country == 'Venezuela, RB' ~ 'Venezuela',
    TRUE ~ country  
  ))

join big mac data with inflation

joined_data <- left_join(joined_data, inflation_long_format, by = c("country", "year"))

unemployment

reformat to long format

unemployment_long_format <- unemployment %>%
  pivot_longer(
    cols = -c(`Indicator Name`, `Indicator Code`, `Country Name`, `Country Code`),  
    names_to = "year",
    values_to = "unemployment"
  ) %>%
  select(country = `Country Name`, year, unemployment)

unemployment_long_format$year <- as.integer(unemployment_long_format$year)

rename columns in unemployment

unemployment_long_format <- unemployment_long_format %>%
  mutate(country = case_when(
    country == "Russian Federation" ~ "Russia",
    country == "United Kingdom" ~ "Britain",
    country == "Czechia" ~ "Czech Republic",
    country == "Korea, Rep." ~ "South Korea",
    country == "Egypt, Arab Rep." ~ "Egypt",
    country == 'Hong Kong SAR, China' ~ 'Hong Kong',
    country == 'Slovak Republic' ~ 'Slovakia',
    country == 'United Arab Emirates' ~ 'UAE',
    country == 'Viet Nam' ~ 'Vietnam',
    country == 'Turkiye' ~ 'Turkey',
    country == 'Venezuela, RB' ~ 'Venezuela',
    TRUE ~ country  
  ))

join big mac data with unemployment

joined_data <- left_join(joined_data, unemployment_long_format, by = c("country", "year"))

take a look at the dataframe

head(joined_data)
# A tibble: 6 × 13
  country  year currency_code bmi_localprice bmi_usd_price bmi_change export_usd
  <chr>   <dbl> <chr>                  <dbl>         <dbl>      <dbl>      <dbl>
1 Argent…  2000 ARS                     2.5          2.5        NA     28181510.
2 Argent…  2001 ARS                     2.5          2.5         0     22619763.
3 Argent…  2002 ARS                     2.5          0.799       0      9838651.
4 Argent…  2003 ARS                     4.1          1.42       64     16066847.
5 Argent…  2004 ARS                     4.36         1.48        6.34  25007443.
6 Argent…  2005 ARS                     4.75         1.64        8.94  30870438.
# ℹ 6 more variables: import_usd <dbl>, net_export <dbl>, GDP <chr>,
#   gdp_per_capita <chr>, inflation <chr>, unemployment <dbl>

hdi

select columns

head(hdi)
# A tibble: 6 × 3
  country      year   hdi
  <chr>       <dbl> <dbl>
1 Afghanistan  1990 0.273
2 Afghanistan  1991 0.279
3 Afghanistan  1992 0.287
4 Afghanistan  1993 0.297
5 Afghanistan  1994 0.292
6 Afghanistan  1995 0.31 
hdi <- hdi %>%
  mutate(country = case_when(
    country == "Russian Federation" ~ "Russia",
    country == "United Kingdom" ~ "Britain",
    country == "Czechia" ~ "Czech Republic",
    country == "Korea, Rep." ~ "South Korea",
    country == "Egypt, Arab Rep." ~ "Egypt",
    country == 'Hong Kong SAR, China' ~ 'Hong Kong',
    country == 'Slovak Republic' ~ 'Slovakia',
    country == 'United Arab Emirates' ~ 'UAE',
    country == 'Viet Nam' ~ 'Vietnam',
    country == 'Turkiye' ~ 'Turkey',
    country == 'Venezuela, RB' ~ 'Venezuela',
    TRUE ~ country  
  ))
joined_data2 <- left_join(joined_data, hdi, by = c("country", "year"))
joined_data2
# A tibble: 845 × 14
   country    year currency_code bmi_localprice bmi_usd_price bmi_change
   <chr>     <dbl> <chr>                  <dbl>         <dbl>      <dbl>
 1 Argentina  2000 ARS                     2.5          2.5        NA   
 2 Argentina  2001 ARS                     2.5          2.5         0   
 3 Argentina  2002 ARS                     2.5          0.799       0   
 4 Argentina  2003 ARS                     4.1          1.42       64   
 5 Argentina  2004 ARS                     4.36         1.48        6.34
 6 Argentina  2005 ARS                     4.75         1.64        8.94
 7 Argentina  2006 ARS                     4.75         1.55        0   
 8 Argentina  2007 ARS                     8.25         2.67       73.7 
 9 Argentina  2008 ARS                    11            3.64       33.3 
10 Argentina  2009 ARS                    11.5          3.02        4.55
# ℹ 835 more rows
# ℹ 8 more variables: export_usd <dbl>, import_usd <dbl>, net_export <dbl>,
#   GDP <chr>, gdp_per_capita <chr>, inflation <chr>, unemployment <dbl>,
#   hdi <dbl>

population

rename columns

population <- population %>%
  mutate(country = case_when(
    country == "Russian Federation" ~ "Russia",
    country == "United Kingdom" ~ "Britain",
    country == "Czechia" ~ "Czech Republic",
    country == "Korea, Rep." ~ "South Korea",
    country == "Egypt, Arab Rep." ~ "Egypt",
    country == 'Hong Kong SAR, China' ~ 'Hong Kong',
    country == 'Slovak Republic' ~ 'Slovakia',
    country == 'United Arab Emirates' ~ 'UAE',
    country == 'Viet Nam' ~ 'Vietnam',
    country == 'Turkiye' ~ 'Turkey',
    country == 'Venezuela, RB' ~ 'Venezuela',
    TRUE ~ country  
  ))
head(population)
# A tibble: 6 × 3
  country      year population
  <chr>       <dbl>      <dbl>
1 Afghanistan  1950    7480464
2 Afghanistan  1951    7571542
3 Afghanistan  1952    7667534
4 Afghanistan  1953    7764549
5 Afghanistan  1954    7864289
6 Afghanistan  1955    7971933

join with big mac

joined_data2 <- left_join(joined_data2, population, by = c("country", "year"))

check for missing data

drop tai wan change name

big_mac_combined <- joined_data2 %>%
  filter(country != "Taiwan", country != "Euro area")

change column datatype

big_mac_combined <- big_mac_combined %>%
  mutate(
    GDP = as.numeric(GDP),
    inflation = as.numeric(inflation),
    gdp_per_capita = as.numeric(gdp_per_capita)
  )
head(big_mac_combined)
# A tibble: 6 × 15
  country  year currency_code bmi_localprice bmi_usd_price bmi_change export_usd
  <chr>   <dbl> <chr>                  <dbl>         <dbl>      <dbl>      <dbl>
1 Argent…  2000 ARS                     2.5          2.5        NA     28181510.
2 Argent…  2001 ARS                     2.5          2.5         0     22619763.
3 Argent…  2002 ARS                     2.5          0.799       0      9838651.
4 Argent…  2003 ARS                     4.1          1.42       64     16066847.
5 Argent…  2004 ARS                     4.36         1.48        6.34  25007443.
6 Argent…  2005 ARS                     4.75         1.64        8.94  30870438.
# ℹ 8 more variables: import_usd <dbl>, net_export <dbl>, GDP <dbl>,
#   gdp_per_capita <dbl>, inflation <dbl>, unemployment <dbl>, hdi <dbl>,
#   population <dbl>
summary(big_mac_combined)
   country               year      currency_code      bmi_localprice   
 Length:800         Min.   :2000   Length:800         Min.   :      1  
 Class :character   1st Qu.:2006   Class :character   1st Qu.:      6  
 Mode  :character   Median :2012   Mode  :character   Median :     21  
                    Mean   :2012                      Mean   :  15913  
                    3rd Qu.:2018                      3rd Qu.:    112  
                    Max.   :2022                      Max.   :4000000  
                                                                       
 bmi_usd_price     bmi_change         export_usd          import_usd       
 Min.   :0.640   Min.   :-100.000   Min.   :7.416e+06   Min.   :3.959e+06  
 1st Qu.:2.300   1st Qu.:   0.000   1st Qu.:7.120e+07   1st Qu.:6.518e+07  
 Median :2.978   Median :   2.565   Median :1.459e+08   Median :1.346e+08  
 Mean   :3.218   Mean   :   5.670   Mean   :2.862e+08   Mean   :2.609e+08  
 3rd Qu.:3.979   3rd Qu.:   8.065   3rd Qu.:3.285e+08   3rd Qu.:2.818e+08  
 Max.   :8.312   Max.   : 275.000   Max.   :3.292e+09   Max.   :3.383e+09  
                 NA's   :53         NA's   :53          NA's   :53         
   net_export              GDP            gdp_per_capita     inflation     
 Min.   :-963828693   Min.   :1.369e+10   Min.   :   739   Min.   :-3.686  
 1st Qu.: -23167201   1st Qu.:2.180e+11   1st Qu.:  5885   1st Qu.: 1.544  
 Median :   -804945   Median :3.908e+11   Median : 13447   Median : 2.849  
 Mean   :  25245050   Mean   :1.406e+12   Mean   : 23736   Mean   : 4.053  
 3rd Qu.:  29978044   3rd Qu.:1.103e+12   3rd Qu.: 39901   3rd Qu.: 5.062  
 Max.   :1587781425   Max.   :2.544e+13   Max.   :106177   Max.   :72.309  
 NA's   :53                                                NA's   :24      
  unemployment         hdi           population       
 Min.   : 0.130   Min.   :0.4750   Min.   :1.463e+06  
 1st Qu.: 3.745   1st Qu.:0.7320   1st Qu.:1.006e+07  
 Median : 5.280   Median :0.8340   Median :3.862e+07  
 Mean   : 6.375   Mean   :0.8134   Mean   :1.223e+08  
 3rd Qu.: 7.795   3rd Qu.:0.9080   3rd Qu.:1.044e+08  
 Max.   :28.840   Max.   :0.9620   Max.   :1.426e+09  
 NA's   :1        NA's   :53       NA's   :53         
# Summarize data to find the range of years available for each country
year_summary <- big_mac_combined %>%
  group_by(country) %>%
  summarize(StartYear = min(year), EndYear = max(year)) %>%
  ungroup()

# Print the summary
print(year_summary)
# A tibble: 53 × 3
   country    StartYear EndYear
   <chr>          <dbl>   <dbl>
 1 Argentina       2000    2022
 2 Australia       2000    2022
 3 Azerbaijan      2021    2022
 4 Bahrain         2021    2022
 5 Brazil          2000    2022
 6 Britain         2000    2022
 7 Canada          2000    2022
 8 Chile           2000    2022
 9 China           2000    2022
10 Colombia        2004    2022
# ℹ 43 more rows
# Your previous code to summarize StartYear and EndYear by country
year_summary <- big_mac_combined %>%
  group_by(country) %>%
  summarize(StartYear = min(year), EndYear = max(year)) %>%
  ungroup()

# Group by StartYear and list countries along with their EndYear
grouped_summary <- year_summary %>%
  group_by(StartYear) %>%
  summarize(Countries = paste(country, "(EndYear:", EndYear, ")", sep = "", collapse = "; "),
            .groups = 'drop') %>%
  ungroup()

# Print the grouped summary
print(grouped_summary)
# A tibble: 7 × 2
  StartYear Countries                                                           
      <dbl> <chr>                                                               
1      2000 Argentina(EndYear:2022); Australia(EndYear:2022); Brazil(EndYear:20…
2      2001 Philippines(EndYear:2022)                                           
3      2002 Norway(EndYear:2022); Peru(EndYear:2022); Turkey(EndYear:2022)      
4      2003 Egypt(EndYear:2022)                                                 
5      2004 Colombia(EndYear:2022); Pakistan(EndYear:2022); Saudi Arabia(EndYea…
6      2011 India(EndYear:2022)                                                 
7      2021 Azerbaijan(EndYear:2022); Bahrain(EndYear:2022); Costa Rica(EndYear…
filtered_big_mac_data <- big_mac_combined %>%
  filter(!(year %in% c(2022, 2023)))

# View the first few rows of the filtered dataframe to confirm
head(filtered_big_mac_data)
# A tibble: 6 × 15
  country  year currency_code bmi_localprice bmi_usd_price bmi_change export_usd
  <chr>   <dbl> <chr>                  <dbl>         <dbl>      <dbl>      <dbl>
1 Argent…  2000 ARS                     2.5          2.5        NA     28181510.
2 Argent…  2001 ARS                     2.5          2.5         0     22619763.
3 Argent…  2002 ARS                     2.5          0.799       0      9838651.
4 Argent…  2003 ARS                     4.1          1.42       64     16066847.
5 Argent…  2004 ARS                     4.36         1.48        6.34  25007443.
6 Argent…  2005 ARS                     4.75         1.64        8.94  30870438.
# ℹ 8 more variables: import_usd <dbl>, net_export <dbl>, GDP <dbl>,
#   gdp_per_capita <dbl>, inflation <dbl>, unemployment <dbl>, hdi <dbl>,
#   population <dbl>

Checking the missing data and highlighting the countries within the gdp_per_employed and inflation columns

missing_values_summary <- filtered_big_mac_data %>%
  filter(is.na(inflation)) %>% # Adjust 'price' to the column you're interested in
  group_by(country, year) %>%
  summarise(MissingCount = n(), .groups = 'drop') %>%
  arrange(country, year)

# Display the summary
print(missing_values_summary)
# A tibble: 22 × 3
   country    year MissingCount
   <chr>     <dbl>        <int>
 1 Argentina  2000            1
 2 Argentina  2001            1
 3 Argentina  2002            1
 4 Argentina  2003            1
 5 Argentina  2004            1
 6 Argentina  2005            1
 7 Argentina  2006            1
 8 Argentina  2007            1
 9 Argentina  2008            1
10 Argentina  2009            1
# ℹ 12 more rows

Impute the the inflation data for Argentina from [World Data Info Link](https://www.worlddata.info/america/argentina/inflation-rates.php)

library(readxl)
arg_inflation_data <- read_excel("data/ARG_inflation_Data.xlsx")


arg_inflation_data_relevant <- arg_inflation_data %>%
  select(year, inflation)


filtered_big_mac_data_imputed <- filtered_big_mac_data %>%
  
  mutate(inflation_arg = ifelse(country == "Argentina", arg_inflation_data$inflation[match(year, arg_inflation_data$year)], NA)) %>%
  mutate(inflation = ifelse(is.na(inflation) & country == "Argentina", inflation_arg, inflation)) %>%
  select(-inflation_arg)

print(filtered_big_mac_data_imputed)
# A tibble: 747 × 15
   country    year currency_code bmi_localprice bmi_usd_price bmi_change
   <chr>     <dbl> <chr>                  <dbl>         <dbl>      <dbl>
 1 Argentina  2000 ARS                     2.5          2.5        NA   
 2 Argentina  2001 ARS                     2.5          2.5         0   
 3 Argentina  2002 ARS                     2.5          0.799       0   
 4 Argentina  2003 ARS                     4.1          1.42       64   
 5 Argentina  2004 ARS                     4.36         1.48        6.34
 6 Argentina  2005 ARS                     4.75         1.64        8.94
 7 Argentina  2006 ARS                     4.75         1.55        0   
 8 Argentina  2007 ARS                     8.25         2.67       73.7 
 9 Argentina  2008 ARS                    11            3.64       33.3 
10 Argentina  2009 ARS                    11.5          3.02        4.55
# ℹ 737 more rows
# ℹ 9 more variables: export_usd <dbl>, import_usd <dbl>, net_export <dbl>,
#   GDP <dbl>, gdp_per_capita <dbl>, inflation <dbl>, unemployment <dbl>,
#   hdi <dbl>, population <dbl>

Will now, remove the data from Venezuela as it has both missing datas from gdp and inflation

cleaned_big_mac_data <- filtered_big_mac_data_imputed %>%
  filter(country != "Venezuela")
head(cleaned_big_mac_data)
# A tibble: 6 × 15
  country  year currency_code bmi_localprice bmi_usd_price bmi_change export_usd
  <chr>   <dbl> <chr>                  <dbl>         <dbl>      <dbl>      <dbl>
1 Argent…  2000 ARS                     2.5          2.5        NA     28181510.
2 Argent…  2001 ARS                     2.5          2.5         0     22619763.
3 Argent…  2002 ARS                     2.5          0.799       0      9838651.
4 Argent…  2003 ARS                     4.1          1.42       64     16066847.
5 Argent…  2004 ARS                     4.36         1.48        6.34  25007443.
6 Argent…  2005 ARS                     4.75         1.64        8.94  30870438.
# ℹ 8 more variables: import_usd <dbl>, net_export <dbl>, GDP <dbl>,
#   gdp_per_capita <dbl>, inflation <dbl>, unemployment <dbl>, hdi <dbl>,
#   population <dbl>
# Convert year to numeric to ensure proper filtering
cleaned_big_mac_data$year <- as.numeric(as.character(cleaned_big_mac_data$year))

# Filter for years 2002 to 2021
data_2002_2021 <- cleaned_big_mac_data %>%
  filter(year >= 2002 & year <= 2021)

# Identify countries with data for each year from 2002 to 2021
countries_with_complete_data <- data_2002_2021 %>%
  group_by(country) %>%
  filter(n_distinct(year) == (2021 - 2002 + 1)) %>%
  ungroup()

countries_with_complete_data <- countries_with_complete_data %>%
  mutate(country = ifelse(country == "Britain", "United Kingdom", country),
         country = ifelse(country == "Czech Republic", "Czech Rep.", country),
         country = ifelse(country == "South Korea", "Korea", country),
         country = ifelse(country == "Switzerland", "Switzerland", country))

# View the first few rows to confirm
head(countries_with_complete_data)
# A tibble: 6 × 15
  country  year currency_code bmi_localprice bmi_usd_price bmi_change export_usd
  <chr>   <dbl> <chr>                  <dbl>         <dbl>      <dbl>      <dbl>
1 Argent…  2002 ARS                     2.5          0.799       0      9838651.
2 Argent…  2003 ARS                     4.1          1.42       64     16066847.
3 Argent…  2004 ARS                     4.36         1.48        6.34  25007443.
4 Argent…  2005 ARS                     4.75         1.64        8.94  30870438.
5 Argent…  2006 ARS                     4.75         1.55        0     37603723.
6 Argent…  2007 ARS                     8.25         2.67       73.7   49020947.
# ℹ 8 more variables: import_usd <dbl>, net_export <dbl>, GDP <dbl>,
#   gdp_per_capita <dbl>, inflation <dbl>, unemployment <dbl>, hdi <dbl>,
#   population <dbl>

combine with country

country_all <- data.frame(country = unique(countries_with_complete_data$country))

map continent

continent_mapping <- data.frame(
  country = c("Argentina", "Australia", "Brazil", "United Kingdom", "Canada", "Chile", "China", "Czech Rep.", "Denmark", "Hong Kong", "Hungary", "Indonesia", "Japan", "Malaysia", "Mexico", "New Zealand", "Peru", "Philippines", "Poland", "Russia", "Singapore", "South Africa", "Korea", "Sweden", "Switzerland", "Thailand", "Turkey", "United States"),
  continent = c("South America", "Oceania", "South America", "Europe", "North America", "South America", "Asia", "Europe", "Europe", "Asia", "Europe", "Asia", "Asia", "Asia", "North America", "Oceania", "South America", "Asia", "Europe", "Europe", "Asia", "Africa", "Asia", "Europe", "Europe", "Asia", "Asia", "North America")
)

# Merge the continent mapping with your original DataFrame 'df'
country_all <- merge(country_all, continent_mapping, by = "country", all.x = TRUE)

map g7, g20, eu

g7_countries <- c("Canada", "France", "Germany", "Italy", "United Kingdom", "United States", "Japan")

g20_countries <- c("Argentina", "Australia", "Brazil", "Canada", "China", "France", "Germany", "India", "Indonesia", "Italy", "Japan", "Mexico", "Russia", "Saudi Arabia", "South Africa", "South Korea", "Turkey", "United Kingdom", "United States", "European Union")

eu_countries <- c("Austria", "Belgium", "Bulgaria", "Croatia", "Cyprus", "Czech Rep.", "Denmark", "Estonia", "Finland", "France", "Germany", "Greece", "Hungary", "Ireland", "Italy", "Latvia", "Lithuania", "Luxembourg", "Malta", "Netherlands", "Poland", "Portugal", "Romania", "Slovakia", "Slovenia", "Spain", "Sweden")

brics_countries <- c("Brazil", "Russia", "India", "China", "South Africa","Egypt","Ethiopia","Iran","United Arab Emirates")
country_all <- country_all %>%
  mutate(
    g7 = country %in% g7_countries,
    g20 = country %in% g20_countries,
    eu = country %in% eu_countries,
    brics = country %in% brics_countries)

take a look at the countries

head(country_all)
    country     continent    g7   g20    eu brics
1 Argentina South America FALSE  TRUE FALSE FALSE
2 Australia       Oceania FALSE  TRUE FALSE FALSE
3    Brazil South America FALSE  TRUE FALSE  TRUE
4    Canada North America  TRUE  TRUE FALSE FALSE
5     Chile South America FALSE FALSE FALSE FALSE
6     China          Asia FALSE  TRUE FALSE  TRUE
# Save cleaned_big_mac_data dataframe to a CSV file
write.csv(countries_with_complete_data, file = "data/bmi_data.csv")
write.csv(country_all, file = "data/country_data.csv")