A FUN PROJECT: Determine the Gender of A Name by Building a Web Application in 40 Lines of Python Codes

dharmawan.raharjo
5 min readDec 18, 2020

--

If I have seen further it is by standing on the shoulders of Giants — Isaac Newton

In this article, I would like to share a fun project where we build a model to determine the gender of a name. In short, this article will cover up how we can utilize previous project (NLP in oil industry link in my profile), save the model, and deploy the model to a web application with ease using streamlit module with less than 40 line of python code.

Web application using streamlit

Introduction: what can we build from name dataset and how it may benefit our business

I think of two possible outcomes of a name dataset if it has name, gender, and origin in it.

1. We can use “name” to predict gender information for creating a better recommendation model

To my opinion, having a gender feature in business’s dataset is beneficial because we can learn the behavior of our consumers from gender perspective. For example, what men or women purchases in e-commerce & its frequency can be described statistically if we have gender data. And if we have the statistics, we can utilize that for creating a recommendation model using demographic type more specifically. This case epitomizes why gender data is important. There could be many reasons, but that what cross my mind right now.

2. We can use “name” to predict their origin

If you heard name “agam”, “teuku”, “cut”, probably “Aceh” will directly cross your mind right now because we think those names intrinsically related to that region. Or for example, having a certain clan name will relate us to possibly think of Batak, Padang, or Manado. This cognitive ability is built because we are exposed whenever we meet people and get to know each other name and respective hometown that gradually constructing a set of neurons in our brain. If we have this data, I think it is possible for us to predict where the name coming from since it may have inherent & latent geographical information within.

So, name dataset is also another “hidden treasure”. Today, we will play with that dataset to create a ML classifier that predict someone’s gender from his/her name. So stay tuned!!

How to build: 3 simple steps to follow

1. Build the dataset

First of all, we need to prepare the data that we’ll feed to computer. The dataset will look like this.

Name-gender dataset

For this example, I use web scraping to scrap the names and gender from a public website. I’ll include the dataset in my Github page if you want to play around. However, this is a very small dataset in a very specific region in my hometown Bengkulu. If it predicts incorrectly & consistently, it means you need to enrich the dataset by including more names from different region.

2. Training the model

For building the model, we’ll repurpose previous project code that use FastText. FastText is a library for efficient learning of word representations and sentence classification created by Facebook.

We use fastText to create a vector representation of our text data that we’ll be used for training ML classifier. FastText treats each word as composed of character ngrams. So, the vector of the word is created by the sum of this character n grams. For example the word vector “phone” can be constructed using 3-grams and 5-grams as following “pho”, “hon”, “one”, “phon”, “hone”, “phone”. This feature makes us able to reconstruct a vector of a word out of the vocabulary word that they are trained before.

Understanding this feature, we shall expect the model can perform better in small dataset rather than TFIDF. Why? Because it can create new vector of a name though it never sees it before due to vast combination of n-grams. If you see the fastText model, it generates 1 GBs of model, I don’t even know what inside of it. I think it is due to astronomical amount of text combination. More important to know that if we use TFIDF, small dataset will not work well since the vocabulary will not probably be sufficient to cater new names to put into its Bag of Word model first. That’s why fastText n-grams I think perform best in name to gender classification task.

After we train the fastText model, we use that for creating a vector (128 in length) that represent each name in dataset.

Name to vector representation

Since we now have a well structure tabular numerical dataset with target, we simply can feed this data to ML classifier which in this case using SVC (support vector classifier) and perform hyperparameter tuning. We then save the model of word2vec fastText and SVC classifier model that we are going to use for creating our web apps.

3. Create web apps using streamlit

There are many framework of deployment tool in data science that I know namely Django, Flask, and Streamlit. I never use Django before so I can’t talk about it much. But, I once use Flask framework to deploy my search engine model. It’s quite simple and straight forward but you need to have prerequisite knowledge of HTML to develop the apps. So, for those that don’t have much basic about it, don’t worry. Thanks to the developer of streamlit, we now can build web application directly from python script without having to know about web development.

Simply put, in streamlit the object of web (such as button, tittle, bar and etc) is represented in variables that we can play with very easily. Using the following code, you can build simple web apps in less than 40 lines of code. It’s extremely easy and very straight forward with drawbacks of limited customization. However, I think this framework is a really game changing for those who want to deploy their ML model fast. And I believe as well, in the future, more feature will be added if we truly look at its potential.

To run the script, run the following command in anaconda prompt

“streamlit run any_name_of_web_apps_script.py”

If we see closely, it falsely classifies “Suzy bae” as male name since we don’t have any dataset from Korean name. Thus, to improve model, we can add more data to our dataset.

To sum up, library like scikit learn, streamlit, fastText, pandas, numpy and jcopml have really democratized data science to the point that everyone now could make ML model with ease. Thank you to all developer!!

--

--