How to: Avoid Gender Data Bias at Each Stage of the Data Lifecycle

4 min readFeb 2, 2022

This series explores phases at which gender data bias may be introduced or exacerbated in the simplified data lifecycle. The series contains four parts: requirements gathering, data collection, data cleansing, and analysis and use. This is part one: requirements gathering.

Stage 2, Data Collection

Data collection is the first action step in the data lifecycle and therefore the first step at which integrated gender bias can creep into a dataset. Without intervention, data collected without gender considerations can end up in ML models which inform such institutions as higher education, government, and big business. However, it is possible to identify areas in which gender bias could creep before the process begins and eliminate these biases via improvements in the collection process.

Methods

There are many methods by which data is collected. We will explore collection from scratch, through studies and surveys, as well as from purchasing.

From Scratch

Collecting data from scratch offers a unique opportunity to reduce gender data bias. Researchers can identify areas where bias might arise and mitigate at every step of the collection process.

When selecting a sample, it is best that the sample is small and controlled rather than large and uncontrolled. Once the sample has been selected, researchers can choose from two collection methods, probability (random) sampling and non-probability sampling. Probability sampling involves choosing individuals at random. While non-probability sampling involves selecting individuals to represent the larger target population. For example, if the target population is distributed 60% female and 40% male then researchers would ensure 60% of the selected sample is female and 40% is male. Non-probability sampling ensures the the survey is balanced; however, random sampling better allows for the avoidance of gender bias.

Random sampling, however, can be expensive as it requires up to date data on the target population. If non-probability sampling can not be avoided, weighting can be used to aid the balance of the data to the target population. If the target population is evenly split by gender but data is collected from 400 individuals, 300 of whom are male and 100 female, men could be “down weighted” by 2/3 to go from 75% to 50% while females are “upweighted” by 2 to go from 25% to 50%. Now the sample data is weighted evenly male and female to match the target population.

Purchased

Organizations that buy data have a great amount of purchasing power — buying from third party providers that have gender-disaggregated data is one of the best ways to encourage market intelligence companies to start incorporating time grain and gender binary fields into their schemas.

Beyond a few anomalies, the only datasets with the gender binary and time grain fields are those for which gender equality and/or empowerment is identified as a primary outcome. It is difficult to find gender-disaggregated data that may be identified as a potential source of power for an organization

Alternative Collection Methods

Ultimately, most datasets are not considered gender disaggregated. Facing the need for gendered data but without the proper schema, it is possible to use advanced analytics to reduce bias. Via a predictive model, a gender variable can be introduced into the data. While not entirely accurate, use of a predictive model allows for women to be identified, row by row, with relatively strong accuracy.

Automated gender and time grain classification have become increasingly prevalent, particularly with the rise of social media and targeted marketing. Researchers have come up with various algorithms using classification and ML concepts to identify gender.

The simplest approach to determining gender is via name, using reference table mapping. For example, Sarah is likely a female name and Rob is likely a male name. Tables provide the relative probability that the name is male or female. Gender may also be determined using more advanced ML models.

Advanced Analytics at AARP uses a Random Forest Model and an Auto ML Pipeline to predict gender in their member demographic records. Gender is an important factor in these records as it impacts profiling and targeting of programs. The process assesses gender using a 7-step model based on letter frequency, position, length, and age. The model displays 90% accuracy on mixed name list with names already existing in AARP member data. While this is sufficient to improve AARP targeting, it is worth considering that the model can only impute a binary gender variable and that on purely new, distinct names the model is only 76% accurate in predicting gender.

Even without apparent data points, ML can be used to predict gender. In one particular research study, Timo Koch analyzed 308,632 WhatsApp messages from 199 participants using machine learning algorithms to predict participants’ age and gender. Rather than name, Koch studied language, emoji, and emoticon use. Koch found that “female participants used emojis more often, used a broader range of emoji, and incorporated more function words, especially personal pronouns in 1st person singular, in their messages than men.” Interestingly, the trained algorithm was able to predict participants’ gender significantly better than the baseline Random Forest Model used by AARP (Source).

Pitfalls and Risks

Although these models show promise in cleansing historical data with certain characteristics, there are a number of risks worth heeding in any endeavor to gender data.

Classification algorithms must be trained on existing datasets.
Bias can arise in dataset as well as specific data selection.
Only a binary gender variable can be imputed.
Predictive models are not 100% accurate.
Modeling gender requires deep understanding of data source and bias considerations.
There is a steep initial fixed costs associated with implementing data analytics.