Goal

This page documents how the HDI table was cleaned and formatted into a usable dataset. We remove the first 8 rows, remove specific columns (4, 6, 8, 10, 12, 13, 14, 15), keep only complete-country rows, and found the means for key variables.The code begins by loading the required packages. The readxl package is used to import the Excel file, while dplyr and tidyr are used for data manipulation tasks such as renaming variables, converting data types, summarising values, and removing missing observations.

Next, the Excel file is read from the data/input folder. Keeping the raw input file separate from outputs ensures the workflow is organized and reproducible. After importing the data, the first eight rows are removed because they contain metadata and formatting text rather than actual country-level observations.

The dataset also includes several unnecessary columns created by the way the Excel table is structured. These columns are removed by position so that only the six required variables remain. Once the correct columns are selected, they are renamed to clear and consistent variable names that reflect the assignment requirements.

Because Excel often imports numeric values as text, the code explicitly converts all numeric variables—including HDI rank, life expectancy, schooling measures, and GNI per capita—into numeric format. This step is essential for performing calculations correctly.

Countries with missing values are then removed using drop_na(). This ensures that the final dataset contains only complete country observations, as required by the assignment.

After cleaning, the final dataset is saved as hdi_clean.csv in the data/output folder. To confirm that the cleaning process was successful, the code calculates the mean of life expectancy at birth, expected years of schooling, mean years of schooling, and GNI per capita. These summary statistics are displayed and also saved as means.csv, providing a clear test that the data is properly formatted and numeric.

Setup

library(readxl)
library(dplyr)
library(tidyr)
library(here)
raw_df <- read_excel(here::here("data", "input", "HDR25_Statistical_Annex_HDI_Table.xlsx"))
## New names:
## • `` -> `...1`
## • `` -> `...3`
## • `` -> `...4`
## • `` -> `...5`
## • `` -> `...6`
## • `` -> `...7`
## • `` -> `...8`
## • `` -> `...9`
## • `` -> `...10`
## • `` -> `...11`
## • `` -> `...12`
## • `` -> `...13`
## • `` -> `...14`
## • `` -> `...15`
dim(raw_df)
## [1] 278  15
df <- raw_df[-c(1:8), ]
dim(df)
## [1] 270  15
df <- df[, -c(3, 4, 6, 8, 10, 12, 13, 14, 15)]
dim(df)
## [1] 270   6
colnames(df) <- c(
  "hdi_rank",
  "country",
  "life_expectancy",
  "expected_years_schooling",
  "mean_years_schooling",
  "gni_per_capita"
)

head(df)
## # A tibble: 6 × 6
##   hdi_rank country     life_expectancy    expected_years_schooling mean_years_schooling gni_per_capita    
##   <chr>    <chr>       <chr>              <chr>                    <chr>                <chr>             
## 1 2        Norway      83.308000000000007 18.792850489999999       13.117962179999999   112710.0211       
## 2 2        Switzerland 83.953999999999994 16.667530060000001       13.94912109          81948.901769999997
## 3 4        Denmark     81.933000000000007 18.704010010000001       13.027320599999999   76007.856690000001
## 4 5        Germany     81.378             17.30921936              14.296371629999999   64053.221239999999
## 5 5        Sweden      83.262             18.991470339999999       12.740325739999999   66102.085949999993
## 6 7        Australia   83.923000000000002 20.654779430000001       12.869999890000001   58276.876429999997
df <- df %>%
  mutate(
    hdi_rank = as.numeric(hdi_rank),
    life_expectancy = as.numeric(life_expectancy),
    expected_years_schooling = as.numeric(expected_years_schooling),
    mean_years_schooling = as.numeric(mean_years_schooling),
    gni_per_capita = as.numeric(gni_per_capita)
  )
## Warning: There were 2 warnings in `mutate()`.
## The first warning was:
## ℹ In argument: `mean_years_schooling = as.numeric(mean_years_schooling)`.
## Caused by warning:
## ! NAs introduced by coercion
## ℹ Run `dplyr::last_dplyr_warnings()` to see the 1 remaining warning.
df <- df %>% drop_na()
dim(df)
## [1] 192   6
dir.create("data/output", recursive = TRUE, showWarnings = FALSE)
write.csv(df, "data/output/hdi_clean.csv", row.names = FALSE)
means <- df %>%
  summarise(
    mean_life_expectancy = mean(life_expectancy),
    mean_expected_years_schooling = mean(expected_years_schooling),
    mean_mean_years_schooling = mean(mean_years_schooling),
    mean_gni_per_capita = mean(gni_per_capita)
  )

write.csv(means, "data/output/means.csv", row.names = FALSE)