This page documents how the HDI table was cleaned and formatted into a usable dataset. We remove the first 8 rows, remove specific columns (4, 6, 8, 10, 12, 13, 14, 15), keep only complete-country rows, and found the means for key variables.The code begins by loading the required packages. The readxl package is used to import the Excel file, while dplyr and tidyr are used for data manipulation tasks such as renaming variables, converting data types, summarising values, and removing missing observations.
Next, the Excel file is read from the data/input folder. Keeping the raw input file separate from outputs ensures the workflow is organized and reproducible. After importing the data, the first eight rows are removed because they contain metadata and formatting text rather than actual country-level observations.
The dataset also includes several unnecessary columns created by the way the Excel table is structured. These columns are removed by position so that only the six required variables remain. Once the correct columns are selected, they are renamed to clear and consistent variable names that reflect the assignment requirements.
Because Excel often imports numeric values as text, the code explicitly converts all numeric variables—including HDI rank, life expectancy, schooling measures, and GNI per capita—into numeric format. This step is essential for performing calculations correctly.
Countries with missing values are then removed using drop_na(). This ensures that the final dataset contains only complete country observations, as required by the assignment.
After cleaning, the final dataset is saved as hdi_clean.csv in the data/output folder. To confirm that the cleaning process was successful, the code calculates the mean of life expectancy at birth, expected years of schooling, mean years of schooling, and GNI per capita. These summary statistics are displayed and also saved as means.csv, providing a clear test that the data is properly formatted and numeric.
library(readxl)
library(dplyr)
library(tidyr)
library(here)
raw_df <- read_excel(here::here("data", "input", "HDR25_Statistical_Annex_HDI_Table.xlsx"))
## New names:
## • `` -> `...1`
## • `` -> `...3`
## • `` -> `...4`
## • `` -> `...5`
## • `` -> `...6`
## • `` -> `...7`
## • `` -> `...8`
## • `` -> `...9`
## • `` -> `...10`
## • `` -> `...11`
## • `` -> `...12`
## • `` -> `...13`
## • `` -> `...14`
## • `` -> `...15`
dim(raw_df)
## [1] 278 15
df <- raw_df[-c(1:8), ]
dim(df)
## [1] 270 15
df <- df[, -c(3, 4, 6, 8, 10, 12, 13, 14, 15)]
dim(df)
## [1] 270 6
colnames(df) <- c(
"hdi_rank",
"country",
"life_expectancy",
"expected_years_schooling",
"mean_years_schooling",
"gni_per_capita"
)
head(df)
## # A tibble: 6 × 6
## hdi_rank country life_expectancy expected_years_schooling mean_years_schooling gni_per_capita
## <chr> <chr> <chr> <chr> <chr> <chr>
## 1 2 Norway 83.308000000000007 18.792850489999999 13.117962179999999 112710.0211
## 2 2 Switzerland 83.953999999999994 16.667530060000001 13.94912109 81948.901769999997
## 3 4 Denmark 81.933000000000007 18.704010010000001 13.027320599999999 76007.856690000001
## 4 5 Germany 81.378 17.30921936 14.296371629999999 64053.221239999999
## 5 5 Sweden 83.262 18.991470339999999 12.740325739999999 66102.085949999993
## 6 7 Australia 83.923000000000002 20.654779430000001 12.869999890000001 58276.876429999997
df <- df %>%
mutate(
hdi_rank = as.numeric(hdi_rank),
life_expectancy = as.numeric(life_expectancy),
expected_years_schooling = as.numeric(expected_years_schooling),
mean_years_schooling = as.numeric(mean_years_schooling),
gni_per_capita = as.numeric(gni_per_capita)
)
## Warning: There were 2 warnings in `mutate()`.
## The first warning was:
## ℹ In argument: `mean_years_schooling = as.numeric(mean_years_schooling)`.
## Caused by warning:
## ! NAs introduced by coercion
## ℹ Run `dplyr::last_dplyr_warnings()` to see the 1 remaining warning.
df <- df %>% drop_na()
dim(df)
## [1] 192 6
dir.create("data/output", recursive = TRUE, showWarnings = FALSE)
write.csv(df, "data/output/hdi_clean.csv", row.names = FALSE)
means <- df %>%
summarise(
mean_life_expectancy = mean(life_expectancy),
mean_expected_years_schooling = mean(expected_years_schooling),
mean_mean_years_schooling = mean(mean_years_schooling),
mean_gni_per_capita = mean(gni_per_capita)
)
write.csv(means, "data/output/means.csv", row.names = FALSE)