What factor has the biggest impact on divorce rates in Xalapa, Mexico?
For this RQ, possible factors that we could consider would be age gap, income, number of kids, length of marriage, place of birth, and level of education. We can perform statistical analysis on each of these factors and determine which factor has the strongest correlation with divorce rates in Xalapa Mexico.
Data collection and cleaning
Have an initial draft of your data cleaning appendix. Document every step that takes your raw data file(s) and turns it into the analysis-ready data set that you would submit with your final project. Include text narrative describing your data collection (downloading, scraping, surveys, etc) and any additional data curation/cleaning (merging data frames, filtering, transformations of variables, etc). Include code for data curation/cleaning, but not collection.
The data was downloaded off of Kaggle with the version where the columns were translated into English. Afterwards, we decided to filter the columns we need to type of divorce, nationalities, ages, monthly incomes, occupations, levels of education, number of children, employment statuses, and marriage duration.
The observations/rows represent each individual divorce documented.
The columns are characteristics of each partner in the divorce and of the divorce as a whole. For example, we have the ages, occupations, levels of education, and other interesting descriptors of both partners. Additionally, we have things like duration, children, and type to describe the divorce as a whole.
Although the official website of the Mexican government does not explicitly state the reasoning behind the creation of the data set, we can infer that it was to study trends in divorce in Xalapa, Mexico.
Moreover, it is given that the creation of this data set was funded by the Mexican Government.
The data set does not include any divorced that went undocumented, and it does not note when any couples separated but did not divorce. Hence, the data set is limited to only the divorces that went through and completed the divorce process through the government.
The data we are using came from the scanning of divorce documents and noting the characteristics of the divorce. Moreover, it is not given that the people whose divorce as listed in the data were aware of it.
Data limitations
There are NA values in the integer columns that need to be made into 0s.
Since this data is older, almost 10 years old, the trends represented may be out of date.
Most of the string/char variables have Spanish responses that we will need to translate in our final data set.
Finally, there are many columns that are not needed for our analysis that we need to filter out.
Exploratory data analysis
In our initial exploratory data analysis, we first want to organize and mutate our data set so that we can better organize our data for analysis, such as translating, adding, and subtracting columns of the divorce partners so that we can consider data about the divorce in general.
In our initial analysis, we want to see a general trend between each of the factors chosen, and the frequency of divorce, so in the below code chunk we will analyze such general trends using a bar plot of each variable with respect to the number of divorces.
divorces |>mutate(age_gap =abs(Age_partner_man - Age_partner_woman)) |>ggplot(mapping =aes(x = age_gap)) +geom_bar() +labs(x ="Age Gap of Partners",y ="Frequency" )
From the analysis above, there seems to be a trend where couples with smaller age gaps tend to have a higher rate of divorce; however, this cannot be assumed, as there could be a chance that the majority of marriages tend to have a lower age gap, so it may not be the best to analyze age as a variable alone with resepect to divorce tendencies
divorces |>mutate(nationality_same =ifelse(Nationality_partner_man == Nationality_partner_woman, "Same", "Different") ) |>ggplot(mapping =aes(x = nationality_same, fill = nationality_same)) +geom_bar() +labs(x ="Whether partners have the same or different nationalities",y ="Frequency" )
Similar to the first analysis, upon looking at the difference between same nationality vs different nationality marriages, there seems to be a tendency where those with the same nationality have a higher divorce frequency, but again, it cannot be assumed without further analysis, since it could be the case where same nationality marriages in general are just more common
Questions for reviewers
List specific questions for your peer reviewers and project mentor to answer in giving you feedback on this phase.
What is your overall impression of the project and its presentation?
Was the research question clearly defined and does the analysis adequately address it?
Did the data collection and cleaning process appear thorough and accurate?
Were any assumptions or limitations of the analysis clearly stated and addressed?
What areas could be improved or further developed in future research?