The world’s leading publication for data science, AI, and ML professionals.

Early Marriage in Vietnam from a Socio-Geographic Perspective

This analysis is a part of my work for the IBM Data Science Capstone.

This analysis is a part of my work for the IBM Data Science Capstone.

Photo by Hisu lee on Unsplash
Photo by Hisu lee on Unsplash

1. Introduction

In Vietnam, millennials are putting off or postponing their marriage. The average age at the first marriage increased from 24.5 to 25.5 for the past ten years. This trend has directly accelerated its aging population, creating big concern for one of Asia’s fastest-growing economies. According to data from the Ministry of Planning and Investment, Vietnamese aged 15 to 64 account for 68% of the population. Those 65 and older, meanwhile, represent the fastest-growing segment. This age group represented 7.7% of the population in 2019, up from 6.4% in 2009.

Image by Author
Image by Author

As Vietnam officials worry about low fertility rates in some parts of the country, young people are encouraged to get married before turning 30 and having babies. There is no much information in the government’s latest announcement regarding specific strategies, policies, or guidance provided to the local authorities. Before diving into the solutions, here are two main motivations of this analysis:

  • Which factors might have driven the trend of late marriage in Vietnam?
  • Is there any statistical difference in the age to get married among geographical areas?

Addressing the first question will create a clear picture of the marriage situation in Vietnam. In contrast, the second one’s answer will support the localization at a lower level (if any) during the implementation phase.

2. Data

2.1. Data understanding and collection

Below are all the sources used in this analysis:

  • Statistics from General Statistics Office of Vietnam, including three key elements: (1) Population and Employment; (2) Education; and (3) Health, Culture, and Living Standard (May 29, 2020).
  • Employment Working Hour data collected from the Report on Labor force survey in 2018 by the General Statistics Office of Vietnam.
  • Geospatial Coordinates: polygon datasets that contain the geospatial data of provinces administration in Vietnam from the Open Development Mekong database. These JSON-format data provide input for the Vietnam map’s visualization.
  • List of Vietnam’s Administrative Division: at Provincial/City level. This information is obtained from the General Statistics Office (GSO) of Vietnam (May 29, 2020).

2.2. Data cleaning

Since all the data obtained and organized by government agencies were in a structured format, there was no major cleaning and processing. However, the collected data were not centralized and varied across different forms. For example, statistics about the Living Standards with four primary criteria were allocated in four different forms. Therefore, it was important to go through each dataset before merging them into a master dataset used for the analysis. There are some minor cleaning tasks done, including:

  • Removed the Vietnamese accents from the data for later classification
  • Changed province/city name to match Geospatial Coordinates (polygon datasets in JSON format)
  • Examined the data types of each column
  • Removed duplicate records

The output of the data cleaning (top 5 rows) is as below:

Image by Author
Image by Author
Image by Author
Image by Author

2.3. Explanatory data analysis – EDA

EDA refers to performing initial investigations on data to discover patterns, spot anomalies, test hypotheses, and check assumptions with the help of summary statistics and graphical representations.

First, each factor’s data distribution was examined to identify their underlying structures:

Image by Author
Image by Author

Next, Scatter Plot visualization was conducted to learn more about the correlations between the datasets and select data features for the clustering model.

Image by Author
Image by Author

These two steps resulted in selecting eight data variables that most relevant to the clustering modeling, a model used in further analysis.

Image by Author
Image by Author

Validating by Scatter Plot: In the case below, there is a positive linear relationship between monthly income and age at the first marriage. This suggests some differences in monthly income amongst different age groups in different regions. Meanwhile, the cost of living index, which had been taken into account before the analysis, was removed from the model.

Image by Author
Image by Author

3. Clustering analysis

3.1. Methodology

As the project covers the unsupervised machine learning problem, the K-Means Clustering method is applied to profile administrative provinces/cities in Vietnam based on the economic, social, and statistical census. The objective of K-means is simple: group similar data points together and discover underlying patterns. To achieve this objective, K-means looks for a fixed number (k) of clusters in a dataset. A cluster refers to a collection of data points aggregated together because of certain similarities.

The first step of K-means analysis is to normalize data features by changing the values of numeric columns in the dataset to a common scale without distorting differences in value ranges. This is required since our features have different ranges.

Next, we identify the optimum number of clusters by evaluating the relationship between the number of clusters and Within Cluster Sum of Squares (WCSS). The ideal number of clusters is where the change in WCSS begins to level off (Elbow method).

3.2. Clustering

Below is the resulted selection of data variables used in the clustering analysis:

Image by Author
Image by Author

Using the data exploration techniques and Elbow method mentioned above, we came up with the number three – as the most optimal number of clusters.

Image by Author
Image by Author

Running the clustering analysis with three clusters desired, we have three groups of provinces/cities in Vietnam with their locations reflected the cluster themself.

Image by Author
Image by Author

From a statistical viewpoint, we have a summary of three clusters:

Image by Author
Image by Author

To validate the analysis, there were two graphs used with clusters illustrated in three colors. The first one reflects the distinction between three clusters in terms of monthly income and age getting married. The second shows the separation between clusters in density. People tend to get married early in a low-density province/city.

Image by Author
Image by Author

3.3. Insights

Based on the above clustering analysis, we label three clusters as the three groups of provinces/cities with associated features as below:

  • Group 1 (Cluster 0): Remote and underdeveloped areas (Average Age getting married – 22.8)

This group recorded the earliest marriage in Vietnam. Located in remote and less densely populated areas, provinces in this group are underdeveloped in terms of both economic and social aspects. Low investment in education and training might lead to the lowest percentage of people in the workforce compared to other regions.

  • Group 2 (Cluster 1): Moderately populated and developing areas (Average Age getting married – 25.6)

This largest group includes 42 cities/provinces, many of which are regional connecting hubs. Considered as the next-tier cities/provinces to Group 3 (Cluster 2), these areas have good locations (Mekong River Delta, Red River Delta, Coastal Areas). More people have access to higher education, a key to enter the labor market.

  • Group 3 (Cluster 2): Urban, centralized, and fastest-growing areas (Average Age getting married – 25.8)

This group has the highest average age of getting married. It includes Hanoi, Da Nang, and Ho Chi Minh City – the three biggest cities in Vietnam. The other cities: Hai Phong, Bac Ninh, Binh Duong, and Dong Nai, are also the largest economic and industrial centers. People in this group have the highest monthly income, and most of them are in the workforce. These cities have an increasing number of immigrants year by year.

Besides, there are some other findings from the clustering analysis summarized in the above table: (1) The female/male ratio is 100.8 in Group 1, in which people get married early, while it is down to 95.8 in Group 3, in which delaying marriage is more popular. (2) Group 3 has 10% higher than Group 1 in terms of working hours per week. People in this group tend to prioritize career development instead of starting a family. (3) There is a very weak correlation between monthly income and the cost of living. (4) Binh Duong has the highest net migration rate of 47.9, while the others have an average rate of -2.1.

4. Implications, conclusion, and limitations

Based on the analysis, here is the summary of my findings that address two research questions: (1) Which factors might have driven the late marriage in Vietnam? (2) Is there any statistical difference in the age to get married among geographical areas?

  • There is a positive correlation between income and marriage: Many urban residents are waiting longer to enter the marriage market when more career opportunities have resulted from educational and economic transformations outside their home.
  • More early marriages in rural and small towns than in cities. As the booming economy has converted more remote poor regions into urban areas, in which social environment and living standards change, people are less likely to get married.
  • Mobility is correlated with the age of getting married. People tend to get married late in developed areas, which are high in mobility and migration.

In conclusion, encouraging young people to get married before 30 can be a challenge for Vietnamese officials. As delaying marriage has been widely discussed as an inevitable trend for modern developing societies, a solution to an aging population may require long-term approaches at both country and province/city level, such as: investing in education and training in poor regions to bring more people into the workforce, expanding the economic centers and networks to alleviate the unnecessary migration, or balancing living conditions among different geographical areas.

After all, I’d love to spend more time in the future to develop this analysis, which is currently very simple and naive. There are some of the following limitations:

• The data obtained are all from 2018, which ignores the trend through different periods

• The social and economic data did not consider the gender layer. Such figures as increases in women’s employment and rapid improvements in women’s educational attainment would have explained a stronger correlation between women’s changes and less marriage.

• Some provinces/cities use different features to evaluate their education, social, and economic development. These missing data impacted the effectiveness of the Elbow method to find the optimal number of clusters.

You can access more detail of my analysis via my Github.


Written By

Topics:

Related Articles