library(dplyr)
library(dbplyr)
library(DBI)
library(janitor)
library(lubridate)
library(tibble)
library(tidyr)
library(purrr)
library(ggplot2)
library(stats)
library(infer)
library(stringr)
library(styler)
library(tidylog)
library(lintr)
library(ggmap)
library(leaflet)
library(remedy)
library(tidygeocoder)
library(readr)
library(knitr)
library(rmarkdown)

Load the data from SQL

To load the data to R from SQL I first have to create a connection with the database “zh_politicians.db” thanks to dbConnect(). Then I need to see and extract every table, in this case there are four.

zh_politicians <- dbConnect(RSQLite::SQLite(), "zh_politicians.db")
dbListTables(zh_politicians)
## [1] "ADDRESSES"    "AFFILIATIONS" "MANDATES"     "PERSONS"

I have to create a variable for each table. I will start with “addresses”. Since the database is relatively small and I feel more at ease working with {dplyr} I use collect() immediately after loading each table into R Studio.

addresses <- tbl(zh_politicians, "addresses") %>% 
  collect() %>% 
  as_tibble() %>% 
  clean_names()

I create the “affiliation” tibble from the database.

affiliations <- tbl(zh_politicians, "affiliations") %>% 
  collect() %>% 
  as_tibble() %>% 
  clean_names()

Create a tibble for “mandates”.

mandates <- tbl(zh_politicians, "mandates") %>% 
  collect() %>% 
  as_tibble() %>% 
  clean_names()

Create the tibble for “persons”.

persons <- tbl(zh_politicians, "persons") %>% 
  collect() %>% 
  as_tibble() %>% 
  clean_names()

I can close the connection with the database now.

dbDisconnect(zh_politicians)

Exploring the assembly composition

Part 1

Instructions

On a first stage, let’s look at the evolution of the active mandates in each of the assemblies. The MANDATES table has information about when people were elected, when their mandate ended and on which assembly they sat.

Explanation

I need to know the years the politicians have been active, this information are all the years between the “mandate_start_year” and “mandate_end_year”. I first want to exclude the empty cells from “mandate_start_year” with filter. I also want to consider the mandates without a “mandate_end_year” value, by replacing 0 by the current year. So I assume all on-going mandates have ended in 2025.

Then I have to use the function seq() which allows to extract the in-between values but doesn’t work with collections. That’s why it has to be used in combination with map2() that can apply a function to a collection, and in this case I have 2 collections/arguments: the start and end date.

I use map2() to apply a custom function to each pair of mandate_start_year and mandate_end_year. Since I want to generate a sequence of years for each mandate, but treat end_year == 0 as the current year, I need to write a function that includes this condition. The tilde ~ introduces an anonymous function, where .x refers to the start year and .y to the end year. This function returns a sequence of active years for each mandate, even when the end year is missing or ongoing.

~ {end <- if (.y==0) current_year else .y

The result returns a list within the “active_year” column, therefore, I have to use unnest_long() and now have a row for each year a politician has been active.

Once the “active_years” column is unnested, I need to obtain the number of active mandates per years per assembly, since each assembly will be represented by a line. To do so, I group by the active years and the assembly. The number of active mandates equals the number of rows per active years, since each row is an active mandate by a politician for a given year.

current_year <- 2025

mandates1 <- mandates %>%
  filter(mandate_start_year != "0") %>% 
  mutate(active_years = map2(mandate_start_year, mandate_end_year, ~ {end <- if (.y==0) current_year else .y
  seq(.x, end)})) %>% 
  unnest_longer(active_years)

Creating a line plot

Once the “active_years” column is unnested, I need to obtain the number of active mandates per years per assembly, since each assembly will be represented by a line. To do so, I group by the active years and the assembly. The number of active mandates equals the number of rows per active years, since each row is an active mandate by a politician for a given year.

After this step, I have to create the line plot to represent the number of active mandates per assemblies throughout the years.

mandates1 %>% 
  group_by(active_years, assembly) %>%
  summarise(n_active_year = n()) %>% 
  ungroup() %>% 
  ggplot(aes(x=active_years, y= n_active_year, colour = assembly))+
  geom_line() +
  theme_minimal()+
  labs(title = "Number of active mandates each year",
       subtitle = "The 'peaks' are caused by election years when multiple mandates were active for the same seat",
       x = "years",
       y= "n")

Comment

This plot shows that the data is incomplete: the period of time of available information for the active mandates for the different assemblies vary. Some periods of time are not covered by assemblies related data, as if there was no active mandates for the Small Council from 1900 to 2000. More precisely, there is no available data for the number of active mandates for the Small Council after 1850, and the available data for the Grand Council stops after 1870. This means that the start and end dates for politicians elected for those assemblies is not available, or that there a no members. On terms of number of active mandates, the highest numbers are for the Cantonal Council and Grand Council, which fluctuate between 200 and 400 active mandates. The number of active mandates for the Executive Council is very stable from 1850 to 2025 and also very low, maybe around 10, which means that there are not many active politicians per year.

Part 2

Instructions

Let’s explore the dynamics of the gender composition of these assemblies throughout the years. We will create a facet chart with one chart per assembly in each facet, one line for the active mandates held by women and another line for those held by men.

How does the gender composition of these active mandates evolve? Write down your remarks.

Explanation

I have to add the gender variable which is in another tibble called “persons”, therefore I will need left_join(). I’d like to check if I will need to discard NA in the “gender” column before proceeding.

persons %>% 
  tabyl(gender) %>% 
  kable()
gender n percent
5 0.0011644
m 4040 0.9408477
w 249 0.0579879

Comment

The percentage of empty cells for the “gender” variable is under 1% but I observe that the percentage of women is also significantly lower than the male percentage, with only around 6% for the whole data set. This will impact the analysis.

Explanation

First I will create a subset of the “persons” tibble with only the “id” and “gender” column so I don’t overburden my “mandates1” tibble from the previous exercise with unnecessary data. I see that there are some missing values in the “gender” column: some ids don’t have any matching gender. I will exclude them given their low proportion.

gender <- persons %>% 
  select(id, gender) %>% 
  filter(gender != "")

gender %>% 
distinct(gender)

Secondly, I will merge the two tibbles: the one I just created and “mandates1” from the previous exercise which contains the “active_year column” with “gender”. I will also change the gender values from “m” to “male” and “w” to “female” and drop any NA. Next, I group by the active years, the assembly and gender and create a new column with mutate() with the number of rows . This will allow to see the number of active mandates per year, gender and assembly.

mandates2 <- mandates1 %>% 
  left_join(gender, by = c("person_id" = "id")) %>% 
  drop_na(gender) %>% 
  mutate(gender = case_when(
    gender == "w" ~ "female",
    gender == "m" ~ "male",
    TRUE ~ gender
  )) %>%
  group_by(active_years, assembly, gender) %>%
  mutate(n_active_year = n()) %>% 
  ungroup()

Creating a line plot

Thirdly I will create a line plot, using facet_wrap() for each assembly, the x axis represents the years and the y axis represents the number of active mandates.

mandates2 %>%
  ggplot(mapping= aes(x = active_years,
                      y = n_active_year,
                      colour = gender,
                      group = gender))+
  geom_line() +
  facet_wrap(vars(assembly), scales = "free") +
  labs(title = "Gender composition of the assemblies from 1803 to 2025",
       subtitle= "Using facets for each assembly",
       x = " years",
       y = "n")

Comment

As shown in the first chart, the Small Council has little data points, it only had 2 active mandates un 1928 or 1929 and 1 active mandate from 1800 to 1850. In addition, the proportion of women is so small that they only appear on the Cantonal Council facet, where they have up to around 70 active mandates from 1900 to 1975, and Executive Council, where they have 1 to 3 active mandates from 2000 onward. As seen before, the Executive Council counts fewer active mandates than the Grand Council and the Cantonal Council.

Part 3

Instructions

Now it’s time to analyse the Party composition of the assemblies. Create a pie chart showing the percentage of active politicians per party in the year 2000. Since we want to show this by assembly, use facet charts.

Explanation

Before I proceed to use the party affiliation tibble, I want to inspect the column “party”.

affiliations %>% 
  tabyl(party) %>% 
  paged_table()

Comment

Unfortunately, the “affiliations” tibble has many missing values, especially for the affiliation party, which is the information that interest me. The NA actually represent 51% of the data for the party variable. There is no way for me to infer a party to these politicians and given the importance of the missing values I can’t exclude them from my analysis. I also want to see how many rows there are per “person_id”.

duplicates <- affiliations %>% 
  group_by(person_id)%>% 
  filter(n()>1)

paged_table(duplicates)

Comment

I can see there are many duplicates in the affiliation tibble, due either of party changes and/or dates of affiliation changes.

Explanation

I understand that the indicator of the start and end year of affiliation doesn’t necessary mean that the politicians were indeed active during a mandate. So the right indicator should be the mandate and not the affiliation years. So I will take the tibble I had created for the previous exercise and merge it with a simplified tibble only containing the parties and id, called “party”. The empty cells in the “party” column will be replaced by “NA”.

party <- affiliations %>% 
  mutate(party = if_else(party == "", "NA", party)) %>% 
  select(mandate_id, person_id, party)

In the following code, I’m merging the “party” tibble with the “mandate2” tibble containing the gender variable. I’m also filtering the active years column to only get the mandates actives in 2000.

mandates3 <- mandates2 %>% 
  left_join(party, by = c("id" = "mandate_id", "person_id")) %>% 
  filter(active_years == "2000")

Comment

There are only 2 assemblies in this tibble. I will continue with this data and count the number of active mandates per assembly in the column “n_assembly”.

count_assembly <- mandates3 %>% 
  group_by(assembly) %>% 
  mutate(n_assembly = n()) %>% 
  ungroup()

count_assembly %>% 
  distinct(n_assembly)

Comment

There are 213 politicians with active mandates in the Cantonal Council and 9 in the Cantonal Council. As a consequence, I can’t use the same total when computing the percentage of parties per assembly.

In the following code, I will create a new column “party_assembly” which includes the number of entries per party per assembly and will be useful to compute the percentage of party per assembly.

count_party_assembly <- count_assembly %>% 
  group_by(assembly, party) %>% 
  mutate(party_assembly = n()) %>% 
  ungroup()

I can now compute the percentage because I have the total politicians per assembly in “n_assembly” and the “party_assembly” values.

mandates_percentage <- count_party_assembly %>% 
  group_by(assembly, party) %>% 
  mutate(percentage = round((100 * party_assembly)/n_assembly, 2)) %>% 
  ungroup()

Creating a pie chart

In order to have a clean chart, I have to keep the unique entries per party, assembly and percentage.

mandates_percentage %>% 
  distinct(party, assembly, percentage) %>% 
  ggplot(aes(x="", y= percentage, fill = party)) +
  geom_bar(stat = "identity", width = 1) +
  coord_polar(theta = "y") +
  theme_void() +
  facet_wrap(vars(assembly)) +
  labs(title = "Pie Chart of the Party composition of the assemblies in 2000 in Zurich")

Comment

The Cantonal Council is more fragmented in terms of parties, than the Executive Council, where only 3 parties are represented: the FPS, the SVP and SP. I don’t have a pie chart for the other two councils since there was no available data.

Grouping the small parties

To make the pie charts more easy to read I will group all parties with a percentage lower than 2% under the label “small parties”. Which means I have to take my original tibble with the percentages “mandates_percentage”, group all parties with the percentage < 2 and call them “small parties”, count again the entries of the parties per assemblies, and calculate the new percentages which will affect the Cantonal Council.

mandates_percentage_small <- mandates_percentage %>% 
  mutate(party = if_else(percentage < 2, "small parties", party)) %>% 
  group_by(party, assembly) %>% 
  mutate(party_assembly = n()) %>% 
  ungroup() %>% 
  group_by(assembly) %>% 
  mutate(percentage = round((100 * party_assembly)/n_assembly, 2)) %>% 
  ungroup()

Creating a second pie chart with the small parties

mandates_percentage_small %>%
  distinct(party, assembly, percentage) %>% 
  ggplot(aes(x="", y= percentage, fill = party)) +
  geom_bar(stat = "identity", width = 1) +
  coord_polar(theta = "y") +
  theme_void() +
  facet_wrap(vars(assembly))+
  labs(title = "Pie Chart of the Party composition of the assemblies in 2000 in Zurich",
       subtitle = "The partie with less than 2% have been grouped into small parties")

Comment

The Cantonal Council is composed by, in order of importance, the SVP, the SP and FDP parties. These 3 parties are the same represented in the Executive Council, expect the FDP represents 50% versus less than 20% in the Cantonal Council. The rest of the parties represented in the Cantonal Council are CVP, small parties, EVP, Grüne and GP.

Creating a table with kableExtra

This table displays the percentage per party per assembly.

mandates_table <- mandates_percentage_small %>% 
  distinct(party, percentage, assembly) %>% 
  arrange(desc(assembly))

library(kableExtra)
mandates_table %>% 
  arrange(desc(assembly)) %>% 
  kbl(caption = "Party affiliations in percentage in 2000") %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
Party affiliations in percentage in 2000
party percentage assembly
FDP 50.00 Executive Council
SVP 25.00 Executive Council
SP 25.00 Executive Council
SP 24.87 Cantonal Council
CVP 6.74 Cantonal Council
FDP 19.17 Cantonal Council
SVP 32.12 Cantonal Council
EVP 5.18 Cantonal Council
GP 2.07 Cantonal Council
Grüne 3.63 Cantonal Council
small parties 6.22 Cantonal Council

Part 4

Instructions

Let’s take a look at the composition of the assemblies, not restricting to the year 2000, but in the whole data set. This time, let’s use a line plot to show how the percentages of party affiliations for active politicians have evolved over the years.

Explanation

For this exercise I focus on the affiliation tibble. And I want to identify the years during which the politicians’ affiliations were active. To to so I have to use the same logic as before with the active mandates.

I will apply the map2() and seq() method again on a merged tibble I have created “mandates_aff” which was a joint tibble from “mandates” and “affiliations”.

mandates_aff <- affiliations %>% 
  left_join(mandates, by = c("person_id", "mandate_id"= "id"))

I also replace the empty cells with “parteilos”, since I don’t want to discard them, or consider them as NA. I decide to consider them without party.

mandates_clean <- mandates_aff %>%
  filter(mandate_start_year != "0") %>% 
  mutate(active_years = map2(mandate_start_year, mandate_end_year, ~ {end <- if (.y==0) current_year else .y
  seq(.x, end)})) %>% 
  mutate(party = if_else(party == "", "parteilos", party)) %>% 
  unnest_longer(active_years)

I see one first problem with this exercise. I know from the data that I won’t have the same number of politicians per year. Therefore the total for each year will vary and the years won’t be comparable if I don’t isolate the number of politician per year and party. Therefore I have to count the number of politician per party and active years.

party_counts <- mandates_clean %>%
  distinct(active_years, person_id, party) %>%  # one row per person-year-party
  group_by(active_years, party) %>%
  summarise(n_party = n()) %>% 
  ungroup()

Now I have to calculate the percentage based on the number of entries of politicians per party per year. The total here is the sum of “n_party” per active year.

party_trends <- party_counts %>%
  group_by(active_years) %>%
  mutate(total = sum(n_party),
         percentage = 100 * n_party / total) %>%
  ungroup()

And finally I have to group the parties which have a percentage below 2% into “small parties”.

party_trends_grouped <- party_trends %>%
  mutate(party_grouped = if_else(percentage < 2, "small parties", party)) %>%
  group_by(active_years, party_grouped) %>%
  summarise(n_party = sum(n_party)) %>%
  ungroup() %>% 
  group_by(active_years) %>%
  mutate(total = sum(n_party),
         percentage = round(100 * n_party / total, 2)) %>%
  ungroup()

Creating line plots

This line plot represents the evolution of party affiliation in percentage in Zurich over the years. To make the chart less cluttered due to the number of parties, I create another tibble “top_parties” to focus the line plot on. On the downside it means I ignore the very small parties. The top parties should have the maximum percentage on average throughout the years.

top_parties <- party_trends_grouped %>%
  group_by(party_grouped) %>%
  summarise(avg = mean(percentage, na.rm = TRUE)) %>%
  slice_max(avg, n = 30) %>%
  pull(party_grouped)

party_trends_grouped %>% 
  filter(party_grouped %in% top_parties) %>%
  ggplot(mapping =aes(
  x = active_years, 
  y = percentage, 
  color = party_grouped)) +
  geom_line() +
  labs(title = "Party Affiliation Over Time in Zurich",
       subtitle = "Parties under 2% grouped as 'Small parties'",
       x = "Year",
       y = "Percentage of Active Politicians",
       color = " parties") +
  theme_minimal()

Comment

The “parteilos” trend reaches 100% in the early year because I replaced the NAs with “parteilos” and all affiliation data for this period is missing. Around 1900 the percentage of party affiliation for different parties rises and the “parteilos” proportion decreases.

To determine the evolution of the top 3 parties I created a second plot.

top3 <- party_trends_grouped %>%
  filter(party_grouped %in% c("SVP", "SP", "FDP")) 

top3 %>% 
ggplot(mapping= aes(
  x = active_years, 
  y = percentage, 
  color = party_grouped)) +
  geom_line() +
  labs(title = "Party Affiliation Over Time in Zurich",
       subtitle = "Evolution of the top 3 parties",
       x = "Year",
       y = "Percentage of Active Politicians",
       color = "parties") + 
  theme_minimal()

Comment

Both the SP and FDP followed the same trend around 1960 and reached 22% in 1980. After this time, the SVP gained affiliations and outgrew both SP and FDP a little before 2000. Nowadays, the first party is SVP at 30%, then SP at 20% and FDP around 17%.

Analysing the Lifespans of Politicians

Let’s analyse some aspects of the lives of our politicians. For the politicians that have a YEAR_OF_DEATH in the data, compute their life span.

Part 5

Instructions

Does the lifespan change if the politician has any TITLE ? Draw an appropriate plot to answer the question.

Then, run a statistical test to assess if the average life span is different for politicians who have a TITLE compared to those without any TITLE.

Explanation

I understand that I need a logical column or at lease a binary column called “title” with only 2 values (yes/no). However, the first difficulty is that I need to compute the lifespan from the politicians by subtracting the year of death and year of birth.

I start by exploring the data to understand the data types I’m working with.

glimpse(persons)
## Rows: 4,294
## Columns: 8
## $ id            <int> 6, 13, 28, 31, 32, 38, 39, 45, 46, 52, 53, 54, 55, 56, 5…
## $ lastname      <chr> "Zollinger", "Petermann", "Rappold", "Frick", "Gubser", …
## $ firstname     <chr> "Martin", "Albert L.", "Niklaus", "Robert", "Albert", "M…
## $ title         <chr> "Dr. iur.", "", "Dr.", "Dr.", "Dr. med.", "Dr.", "Dr.", …
## $ gender        <chr> "m", "m", "m", "m", "m", "m", "m", "m", "m", "m", "m", "…
## $ year_of_birth <int> 1944, 1937, 1901, 1925, 1901, 1930, 1910, 1894, 1914, 19…
## $ year_of_death <chr> "", "", "", "", "", "", "", "", "", "", "", "1978", "", …
## $ activity      <chr> "Bankdirektor Zürich", "Gewerbesekretär", "Rechtsanwalt"…

Comment

I can see that I could transform “year_of_death” into an integer. I decide to clean the tibble by replacing all the empty cells by NA so I can get a clear view of the missing values for the columns “id”, “title”, “year of birth” and years of death”

persons_na <- persons %>% 
  mutate(title = if_else(title == "", NA_character_, title)) %>% 
  mutate(year_of_birth = if_else(year_of_birth == 0, NA_integer_, year_of_birth)) %>% 
  mutate(year_of_death = if_else(year_of_death == "", NA_character_, year_of_death)) %>% 
  mutate(id = if_else(id == "", NA_integer_, id)) %>% 
  mutate(gender = if_else(gender == "", NA_character_, gender)) %>% 
  mutate(year_of_death = as.numeric(year_of_death))
  
persons_na %>% 
  visdat::vis_miss(
    sort_miss = TRUE,
    cluster = TRUE)

Comment

There are 82% of missing values under the title column, which reflects an unbalanced situation and for the next exercises, I will consider the missing values as “no title”.

The percentage of missing values for the “year_of_death” is of 87% and 21% for the “year_of_birth”. There is also a very small portion of NA for gender. I should either find a way to impute a value for the year of birth or discard those values. Regarding the year of death, there are two solutions: either the politician is still alive, either it is indeed a missing data point that should be replaced with a substituted value. I have to be careful I don’t discard too many missing rows from year of death because I would have a very small sample for my statistical analysis. But on the other hand, replacing many missing values by one value, like the mean or median would influence the results significantly and reduce the variance. It looks like a part of the missing values for the year of birth also correspond to missing values for the year of death. I’d like to check the other percentage of missing values for year of death and understand if this might because the politician are alive, i.e. observe the group with no year of death but with a year of birth. I consider that individuals born before 1940 might be deceased and I want to keep the group which might be alive.

persons_na %>% 
  filter(is.na(year_of_death), !(is.na(year_of_birth))) %>% 
  arrange(year_of_birth) %>% 
  filter(year_of_birth >1940)

Comment

Assuming that politicians born after 1940 (they would have 85 years old which is the average life expectancy at birth for women, according to the Federal Office of Statistics from Switzerland) are alive, this part of the data only represents less than half of the missing values for the year of death (from the first filter, 2,824 rows were remaining, and after the second filter 666 rows were remaining), so the factor of still alive politicians is not the only explanation for missing years if death.

Next, I want to check the earliest year of birth to understand the range I’m working for.

persons_na %>% 
  pull(year_of_birth) %>% 
  min(na.rm = TRUE)
## [1] 1732

The earliest year of birth from this tibble is 1732. At that time, the lifespan might have been lower than it is now or in 1970.

I could also note that there is usually a difference of life span between men and women, as men had typically a lower life expectancy especially. The difference was 3 years in 2023, but was 5 years in 2003 (according to the Swiss statistics) so we can assume it was even bigger decades and centuries. I observed previously that the proportion of women in this data set was of 6%. So I will not factor gender in my analysis and data imputation.

Before proceeding I want to make sure there are no id duplicates.

persons_na %>% 
  group_by(id) %>% 
  filter(n()>1) %>% 
  pull()
## group_by: one grouping variable (id)
## filter (grouped): removed all rows (100%)
## character(0)

Comment

There are fortunately no duplicates in the “id” column.

Explanation

I decided I can’t infer the values for the year of death even when the data on the year of birth is available. This mean I have to discard them even if this will reduce my sample for further analysis.

I also want to make sure I get rid of the potential still alive politicians, therefore the year of birth should be lower or equal to 2025 minus 85 years as it the average life expectancy in Switzerland for women (since it’s higher than the average for men, I keep this value to include as many results as possible).

persons_na <- persons_na %>%
  filter(year_of_birth <= current_year - 85)

I notice that there are still many missing values in the year of death that will unfortunately be discarded.

persons_a <- persons_na %>% 
  filter(!is.na(year_of_death)) %>% 
  filter(!is.na(year_of_birth))

I compute the “life_span” column by offsetting the year of birth minus the year of death.

persons_lifespan <- persons_a %>% 
  mutate(life_span = year_of_death - year_of_birth)

I create a column “status_title” based on the values from “title”; the NAs categorized as having “no” title and the rest of the values categorized as “yes”.

persons_title <- persons_lifespan %>% 
  mutate(status_title = if_else(is.na(title), "no", "yes"))

Creating a first violin plot and boxplot

This plot reflects the lifespan distribution of politicians with title and without.

persons_title %>%   
  ggplot(aes(x= status_title,
             y = life_span))+
  geom_boxplot(fill = "lightblue", alpha = 0.6) +
  geom_violin(fill = "lightpink", alpha = 0.4) +
  labs(title = "Life span by title status of politicians from Zurich",
       x = "Has title",
       y = "Life Span")+
  theme_minimal()

Comment

There are a few anomalies in this data: 2 politicians seem to be under 5 years and one more than 150 years old. In the next step, I can discard lifespans below 30 and above 100.

Creating a second violin plot and boxplot without abnormal values

persons_title_clean <- persons_title %>% 
  filter(life_span > 30) %>% 
  filter(life_span < 100)

persons_title_clean %>%   
  ggplot(aes(x= status_title,
             y = life_span))+
  geom_boxplot(width = 0.6, fill = "lightblue", alpha = 0.7) +
  geom_violin(fill = "lightpink", alpha = 0.4) +
  labs(title = "Life span by title status of politicians from Zurich",
       x = "Has title",
       y = "Life Span")+
  theme_minimal()

Comment

Generally, politicians without title tend to have a lower lifespan. The median of lifespan without title is around 65, whereas, the median for the politicians with title is closer to 75. There seems to be an important lifespan difference between the group with title, therefore, the t-test might very well be significant and allow to reject the null hypothesis of “Having a title doesn’t affect the lifespan of politicians”.

Descriptive statistics

persons_title_clean %>%
  group_by(status_title) %>% 
  summarise(mean = mean(life_span),
            st_var = sd(life_span),
            min = min(life_span),
            max = max(life_span),
            group_size = n(),
            median = median(life_span)) %>% 
  kable()
status_title mean st_var min max group_size median
no 68.19943 12.93412 32 99 351 66
yes 73.96809 12.49766 35 97 188 75

Comment

I notice that there is a 9 points difference in the median; individuals with title have a higher median. The mean is also higher for the group with title. The min and max values are however very close.

persons_title_clean %>% 
  t_test(life_span ~ status_title,
         order = c("no", "yes"))

Comment

With a 95% confidence interval, in this t-test, the p-value is statistically significant since it is lower than 0.05. The value is too extreme to confirm the null hypothesis. Therefore, the reverse hypothesis that the title status affects the lifespan of politicians is possible.

Part 6

Instructions

To refine your analysis, assign the politicians in two groups. First, those who were born before 1918, and a second group with the politicians who were born after 1918 .

Perform the analysis of the previous point.

Explanation

To answer to this part I need to create a column that will separate the politicians born before 1918 and after 1918.

persons_birth <- persons_title_clean %>% 
  mutate(b_1918= if_else(year_of_birth < 1918, "born before 1918", "born after 1918"))

Next, I will create a violin plot to see the difference in life span by title and based on the 1918 birth threshold.

Creating a violin plot

persons_birth %>% 
    ggplot(aes(x= status_title,
             y = life_span))+
  geom_boxplot(fill = "lightblue", alpha = 0.6) +
  geom_violin(fill = "lightgrey", alpha = 0.4, trim = FALSE) +
  facet_wrap(vars(b_1918)) +
  labs(title = "Life span by title status of politicians from Zurich",
       x = "Has title",
       y = "Life Span")+
  theme_minimal()

Comment

Overall I see that having a title makes even a bigger difference when the individuals are born after 1918. The median lifespan for both groups (with and without a title) for politicians born after 1918 lies around 80, whereas the politicians born before 1918 have a lifespan way below 80. In addition, the difference in lifespan for title holders and non-title holders is more visible for politicians born before 1918. The median difference is almost of 10 years for politicians born before 1918. The t-test might be significant to determine a lifespan difference between depending on whether the politician has title, specifically for the before 1918 cohort.

persons_birth %>% 
  filter(b_1918 == "born before 1918") %>% 
  t_test(life_span ~ status_title,
         order = c("no", "yes"))

Comment

The p-value is lower than 0.05. Which means that we can reject the null hypothesis. The hypothesis that the average lifespan for politicians born before 1918 is the same whether or not they have a title is too unlikely. The reverse hypothesis to assume that the title status makes a difference on the lifespan for this group can’t be rejected. This t-test is statistically significant.

persons_birth %>% 
  filter(b_1918 == "born after 1918") %>% 
  t_test(life_span ~ status_title,
         order = c("no", "yes"))

Comment

The p-value is higher than 0.05. Which means that we cannot reject the null hypothesis. The hypothesis that the average lifespan for politicians born after 1918 is the same whether or not they have a title is too likely to assume that the title status makes a difference on the lifespan for this group. This t-test is not statistically significant.

Finding some facts about the Politicians

Part 7

Instructions

Over time, some politicians have been elected many times and therefore have held many mandates. Let’s make a top 10 of the politicians who held multiple mandates and display it as a chart!

Explanation

As a first step, I have to identify the politician’s names with the person_id by joining the “persons” tibble and the “mandates” tibble.

persons_fullname <- persons %>% 
  mutate(full_name = str_glue("{firstname} {lastname}"))

mandates_p <- mandates %>% 
  left_join(persons_fullname, by=c("person_id"="id")) 

As a second step, I have to create a column which count the number of mandates per individual.

mandates_count <-mandates_p %>%
  group_by(person_id) %>% 
  mutate(n_mandates= n()) %>% 
  ungroup()

As a third step, I have to take the top 10 with slice_max().

top_10_mandates <- mandates_count %>% 
  distinct(person_id, n_mandates, full_name) %>% 
  slice_max(order_by = n_mandates, n= 10) 

kable(top_10_mandates)
person_id n_mandates full_name
9468 7 Willy Trostel
11456 6 Johann Jakob Spinner
13358 6 Johann Jakob Schenk
13623 6 Heinrich Spiller
14184 6 Herman Greulich
14595 6 Konrad Arbenz
14945 6 Sigmund Spöndli
15011 6 Heinrich Weiss
15145 6 Johann Jakob Hatt
15232 6 Eduard Sulzer
15317 6 Hans Ulrich Toggenburger
15340 6 Melchior Friedrich Sulzer

Creating a bar plot

This plot represents the number of mandates per politician id.

top_10_mandates %>% 
  ggplot(aes(x= full_name,
             y= n_mandates))+
  geom_col(alpha= 0.7)+
  coord_flip()+
  theme_minimal()+
  labs(title = "Top 10 politicians with most mandates in Zurich over the years",
       x = "politician name",
       y = "n")

Comment

Willy Trostel had the most mandates over the year with 7 mandates. The remaining 9 politicians had 6 mandates.

Part 8

Instructions

That’s a great chart! But it only tells part of the story. Some politicians might have held many mandates at the same time. In some cases this could be due to them having some overlap between mandates, or when the political system allows holding mandates across assemblies.

What politicians have held multiple mandates at the same time? Again, draw the top 10 in a chart.

Explanation

At the moment there is no column which contains the start and end date of a mandate. There are columns for the days, months and years. Therefore, I have to create a new column which will be the result of three columns glued together. Before I glue them and parse the data as a date, I have to add zeroes with str_pad() because if the month or day has only one digit it won’t be recognized as a date.

mandates_start <- mandates_count %>% 
  mutate(mandate_start_day = str_pad(mandate_start_day, width = 2, pad = "0")) %>% 
  mutate(mandate_start_month = str_pad(mandate_start_month, width = 2, pad = "0")) %>% 
  mutate(start_date = str_glue("{mandate_start_day}-{mandate_start_month}-{mandate_start_year}")) %>% 
  mutate(start_date = dmy(start_date))

Comment

After this operation, the 4 missing values are:

mandates_start %>% 
  filter(is.na(start_date)) %>% 
  select(id, person_id, mandate_start_day, mandate_start_month, mandate_start_year, start_date) %>%
  kable()
id person_id mandate_start_day mandate_start_month mandate_start_year start_date
2639 9525 30 02 1907 NA
5609 15150 00 00 0 NA
5846 15340 00 00 1831 NA
5888 15380 00 00 1815 NA

The issue with id 2639 is that the 30th of February doesn’t exist.

Let’s repeat the same operation with the end of the year mandate date.

mandates_start_end <- mandates_start %>% 
  mutate(mandate_end_day = str_pad(mandate_end_day, width = 2, pad = "0")) %>% 
  mutate(mandate_end_month = str_pad(mandate_end_month, width = 2, pad = "0")) %>% 
  mutate(end_date = str_glue("{mandate_end_day}-{mandate_end_month}-{mandate_end_year}")) %>% 
  mutate(end_date = dmy(end_date))

The missing values are more numerous for the end dates this can because some politicians didn’t finish their mandates at the time this data base was made.

mandates_start_end %>% 
  filter(is.na(end_date)) %>% 
  select(id, person_id, mandate_end_day, mandate_end_month, mandate_end_year, end_date) 
paged_table(mandates_start_end)

Comment

I could try to impute the missing values with the median or average length of a mandate, for a given assembly, but I think there are too many variables that could influence the length, such as gender, the party affiliation or the decade to help me predict the value.

mandates_start_end %>% 
  visdat::vis_miss(sort_miss = TRUE,
                   cluster = TRUE)

Comment

Given that the missing values represent 3% percent of the column, I will discard them, as I don’t think it will affect the overall analysis.

Explanation

Let’s discard the NAs from the “end_date” column and create a column “interval” that will represent the period of time between the start and end date of a mandate. To do so, I use the function interval() on the start and end columns. I will also create a “mandate_id” column to have a key to identify the different mandates. Each row will be ranked and become the mandates id thanks to row_number().

mandates_start_end <- mandates_start_end %>% 
  filter(!is.na(end_date)) %>% 
  mutate(interval = interval(start_date, end_date)) %>% 
  mutate(mandate_id = row_number())

To assess if the period of mandates or intervals from one politician overlap with one of their other mandates - if there was more than one mandate - I have to compare the mandates intervals to other intervals by politicians. One way of doing this is to join the “mandates_start_end” with itself so R can compare two columns rather than comparing one column. Therefore, I use inner_join() using the person id as a key. Then I have to filter the duplicates created by the tibble fusion. I need the values from “mandates_id.x” to be different than “mandates_id.y” to avoid comparison between same mandates intervals. Next, I can create a new column “overlap” thanks to the int_overlaps() function that determines if there is an overlap between the intervals and returns boolean values. Since I’m only interested into the overlapping intervals, I filter the tibble to keep them.

overlaps <- mandates_start_end %>% 
  inner_join(mandates_start_end, by = c("person_id")) %>% 
  filter(mandate_id.x !=mandate_id.y) %>% 
  mutate(overlap = int_overlaps(interval.x, interval.y)) %>% 
  filter(overlap == TRUE)

As a last step I have to keep the top 10 politicians with overlapping mandates intervals and count the number of overlapping mandates per politician. I also need their name, therefore I join the tibble “persons_fullname”.

top_10_overlaps_m <- overlaps %>% 
  group_by(person_id) %>% 
  summarise(n_overlap = n()) %>% 
  ungroup() %>% 
  slice_max(order_by = n_overlap, n= 10, with_ties = FALSE) %>% 
  left_join(persons_fullname, by=c("person_id" ="id")) %>% 
  select(person_id, n_overlap, full_name)

  kable(top_10_overlaps_m)
person_id n_overlap full_name
15232 8 Eduard Sulzer
13830 6 Johann Jakob Fierz
14715 6 Johann Jakob Streuli
14803 6 Hans Caspar Hirzel
14945 6 Sigmund Spöndli
15411 6 Heinrich Hüni
14149 4 H. Rudolf Bollier
14944 4 Johann Heinrich Emanuel Mousson-Wyss
15210 4 Ferdinand Meyer
15248 4 Hans Jakob Huber

Creating a bar plot

top_10_overlaps_m %>% 
  ggplot(aes(x= reorder(factor(person_id), n_overlap),
             y= n_overlap,
             fill= full_name))+
  geom_col()+
  coord_flip()+
  theme_minimal()+
  labs(title = "Top 10 politicians with most overlapping mandates in Zurich",
       x = "politician name",
       y = "number of overlapping mandates",
       fill = "politician name")

Comment

Eduard Sulzer has the highest number of overlapping mandates with 8 mandates. Five politicians had 6 overlapping mandates and the last four politicians had 4 overlapping mandates.

Part 8 corrected

Explanation

When using int_overlaps() I unfortunately compare every mandates for a given person_id against every other mandates (even if I filter out the self-comparisons, I still get the same comparison combination twice), which creates double counting. Also I have to be careful because int_overlaps() treats touching intervals as TRUE and I would want to avoid that.

overlaps %>% 
  filter(full_name.x=="Sigmund Spöndli") %>% 
  select(full_name.x, mandate_id.x, mandate_id.y, interval.x, overlap) %>% 
  kable()
full_name.x mandate_id.x mandate_id.y interval.x overlap
Sigmund Spöndli 5250 5251 1839-09-19 UTC–1842-05-17 UTC TRUE
Sigmund Spöndli 5251 5250 1839-10-01 UTC–1845-11-10 UTC TRUE
Sigmund Spöndli 5251 5255 1839-10-01 UTC–1845-11-10 UTC TRUE
Sigmund Spöndli 5253 5254 1827-12-18 UTC–1832-03-19 UTC TRUE
Sigmund Spöndli 5254 5253 1814-06-22 UTC–1838-02-15 UTC TRUE
Sigmund Spöndli 5255 5251 1842-05-18 UTC–1845-11-10 UTC TRUE

Comment

In the example above, which is a subset from the “overlaps” table, I isolated the politician who according to my previous operations is in the top 10 with most overlapping mandates. I notice that although I get 6 TRUE values in the “overlap” column (obtained by int_overlaps) only mandates_id.x 5250, 5251 and 5254 overlap, so I should have a result of 3 overlapping mandates instead of 6. In addition, there is a duplicate for mandate number 5251.

Self-joining the tibble also has the risk of creating further confusion and unnecessary duplicates. In this example, mandate_id.y and mandate_id.x are not equivalent since I excluded the self-comparison filter(mandate_id.x != mandate_id.y) so I remove the comparison 5250,5250 when self-joining, yet I can’t control for duplicates and the comparison 5250-5251 and 5250-5251 is counted twice.

One work around is to replace filter(mandate_id.x != mandate_id.y) by filter(mandate_id.x < mandate_id.y) but I still can’t control for the duplicates.

Explanation

In the following code, the logic differs in :

  1. instead of discarding missing end dates, replacing them with the most recent date in the dataset.

  2. after arranging by “person_id” and “start_date”, using lag() to create a new column “previous_mandate_end” useful for comparison, instead of self-joining. This first operation allows to compare on one row the start date of a mandate with the end of the previous one for each politician.

  3. to compute if there is a difference with the start and end dates of successive mandates, one can create another column “overlapping_days” where one subtracts :

    start_date- previous_mandate_end

    Since the values are recognized as dates by R thanks to the {lubridate} package, this operation returns realistic results. If the result is lower than 0, that means the dates are overlapping because it means the next mandate started when the last mandate was still active. The delta is a negative value because the end dates are “higher”/“more recent” than the start dates.

    This operation replaces the int_overlaps() function more accurately. We avoid the concept of interval overlapping and rely on the start and end dates differences.

  4. finally, after filtering the below zero values, count() is used to compute the number of rows per politician.

# here I extract the latest date in the dataset
last_day <- mandates %>% 
  pull(mandate_start_day) %>% 
  max()

last_month <- mandates %>% 
  pull(mandate_start_month) %>% 
  max()

last_year <- mandates %>% 
  pull(mandate_start_year) %>% 
  max()

# I create the dates
library(lubridate)
overlapping_mandates <- 
  mandates %>% 
  filter(mandate_start_day>0) %>% 
  mutate(
    start_date = str_glue("{mandate_start_day}-{mandate_start_month}-{mandate_start_year}"), 
    end_date = str_glue("{mandate_end_day}-{mandate_end_month}-{mandate_end_year}"), 
    ) %>% 
  mutate(
    start_date = dmy(start_date), 
    end_date = if_else(end_date == "0-0-0", str_glue("{last_day}-{last_month}-{last_year}"), end_date) %>% dmy()
  )

# I compute the top 10 politcians with most overlapping mandates i.e. mandates where the start date is below the end of the previous mandate end
top_10_overlapping_mandates <- 
  overlapping_mandates %>% 
  arrange(person_id, start_date) %>% 
  group_by(person_id) %>% 
  mutate(previous_mandate_end = lag(end_date), 
    overlapping_days = start_date- previous_mandate_end) %>% 
  ungroup() %>%
  filter(overlapping_days<0) %>% 
  count(person_id, sort = TRUE) %>% 
  slice_max(n , n =10) %>% 
  left_join(persons, by = c("person_id" = "id")) %>% 
  mutate(full_name = paste(firstname, lastname))

# plotting 
top_10_overlapping_mandates %>% 
  ggplot(aes(x = forcats::fct_reorder(full_name, n), y = n)) + # orders a categorical variable by values in n
  geom_col(alpha = 0.7) +
  coord_flip() +
  theme_minimal() +
  labs(title = "Politicians with Most Mandates",
       x = "",
       y = "Number of Mandates")  

Comment

Johann Jakob Fierz and Sigmund Spöndli have the highest number of overlapping mandates with 3 mandates. Ten politicians had 2 overlapping mandates and since they are ex-aequos they are shown on the chart.

This latter method avoids creating duplicates and double counting. Some names from the previous attempt are the same but the number of mandates is, as expected, lower by half.

Part 9

Instructions

Finally, let’s find the politicians who have changed ranks over the years. What politicians been affiliated to different parties over the years? Draw a chart with the top 10!

Explanation

I know there are some empty cells in the “party” column, so I will discard them by assigning them to NA and then filter them out so I only keep the politicians affiliated to specific parties. I also decide to create a column “start_date” as a date data type, by gluing the day, month and affiliation year columns. Like the previous case with the “mandates” tibble I have to add the 0 digit so the numbers once glued are recognized as dates when using dmy(). This column allows to order the mandates chronologically.

affiliations_clean <- affiliations %>% 
  mutate(party = if_else(party == "", NA_character_, party)) %>% 
  filter(!is.na(party)) %>% 
  mutate(affiliation_start_day = str_pad(affiliation_start_day, width = 2, pad = "0")) %>% 
  mutate(affiliation_start_month = str_pad(affiliation_start_month, width = 2, pad = "0")) %>% 
  mutate(start_date = str_glue("{affiliation_start_day}-{affiliation_start_month}-{affiliation_start_year}")) %>% 
  mutate(start_date = dmy(start_date))

This time I do not use inner_join to compare values inside the same column. I order the rows by person id and date since I want to observe the changes of party affiliation over time. Based on this order, I can use mutate() and lag() to create a new column “prev_party”. This operation fills the column with the party from the above (n+1) row. The next step consists in creating another logical column “party_change” to track which politicians changed parties. With mutate() I create a column with 2 conditions: any value from the party column different from the “prev_party” column (this shows the change of party over time) and any non NA value in “prev_party” to avoid the politicians who didn’t change parties across mandates. Finally I keep all the politicians who changed parties.

party_changes <- affiliations_clean %>% 
  arrange(person_id, start_date) %>% 
  group_by(person_id) %>% 
  mutate(prev_party = lag(party)) %>% 
  ungroup() %>% 
  mutate(party_change = party != prev_party & !is.na(prev_party)) %>% 
  filter(party_change == TRUE)

I extract the top 10 by creating a new tibble and counting the number of rows per id.

top_10_party_changes <- party_changes %>% 
  group_by(person_id) %>% 
  summarise(n_p_changes = n()) %>% 
  ungroup() %>% 
  slice_max(order_by = n_p_changes, n= 10) %>% 
  left_join(persons_fullname, by=c("person_id" ="id")) %>% 
  select(person_id, n_p_changes, full_name)

  kable(top_10_party_changes)
person_id n_p_changes full_name
3885 4 Johannes Lienhard
9233 4 Ulrich Gysler-Esslinger
10537 4 Johann Jakob Dünki
2744 3 Jean Studer
4771 3 Alfred Illi
5216 3 Rudolf Reichling
9172 3 Johann Glättli
9545 3 Albert Neukom
9922 3 Jakob Bachmann
10196 3 Ernst Hägi
13985 3 Emil Staub

Creating a bar plot

top_10_party_changes %>% 
  ggplot(aes(x = forcats::fct_reorder(full_name, n_p_changes), 
             y = n_p_changes)) +
  geom_col(fill = "steelblue", alpha=0.7)+
  coord_flip()+
  theme_minimal()+
  labs(title = "Top 10 politicians with most party changes in Zurich",
       x = "politician name",
       y = "number of party changes")

Comment

Three politicians: Johannes Lienhard, Ulrich Gysler-Esslinger and Johann Jakob Dünki changed party affiliations 4 times. The other seven politicians changed ranks 3 times.

Part 10

Instructions

Take a sample of 20 politicians with a listed address and plot them on a map with {leaflet} or {ggmap}. You will need to use an API that converts the addresses to geocoordinates.

Explanation

I need to build a “location” column with the street, house number, city and postal code columns. Before doing so, I need to clean the “postal_code” from the brackets. Then, I discard the empty cells from the columns I am about to glue together.

addresses_new <- addresses %>% 
  mutate(postal_code_n = str_remove(postal_code, "\\s*\\([^\\)]+\\)")) %>% 
  filter(street !="") %>% 
  filter(house_number !="") %>% 
  filter(city != "") %>% 
  mutate(location = str_glue("{house_number} {street}, {city} {postal_code_n}"))

The next step is to load the map from Zurich.

Then I need to determine the coordinates for the 20 selected addresses thanks to {tidygeocoder}.

addresses_20 <- addresses_new %>% 
  sample_n(20) %>% 
  tidygeocoder::geocode(address = location)

Finally I can create the plot on the map from Zurich.

ggmap(zh_map) +
  geom_point(data= addresses_20,
            aes(x = long, y = lat, alpha =0.5)) +
  labs(title= "Politicians addresses",
       x = "Longitude",
       y = "Latitude",
       alpha= "politician")