The Internet’s Domain Name System (DNS) converts domain names to numeric Internet Protocol (IP) addresses [1]. This makes navigating the web far easier for humans, who are better at remembering medium.com than a string of seemingly random numbers. But like almost all activities in cyberspace, cyber threats have and will continue to exploit DNS activity [2].
Because of this, cyber security-focused data scientists often find themselves handling DNS data. Typically, this consists of a list of domains with not much else; finding unusual or malicious domains can prove challenging. However, understanding domain structures and basic feature engineering allows data scientists to enrich domain name data for deeper analysis.
What’s in a (Domain) Name?
Domain names pack quite a bit of information. Consider the below example of Wikipedia:

The top-level domain (TLD) and second-level domain (SLD) combine to form what internet users typically remember: the root domain. TLDs can convey additional information, such as geographic location for country-code TLDs (ccTLDs). Some sites have subdomains; in the example above, Wikipedia has different languages associated with various subdomains. Wikipedia uses "en" to denote the English language subdomain, while fr.wikipedia.org would provide the French language subdomain [3]. Other common subdomain uses include blogs, shops, and support sites.
When it comes to DNS and cyber security data analysis, a list of domains such as medium.com, google.com, yahoo.com, etc., may not seem of much value. This is where Data Preparation via TLDextract [4] and concepts from feature engineering [5] come into play:
Feature engineering is the process of using domain knowledge to extract features (characteristics, properties, attributes) from raw data.
Domain Data
First, we need some data. The Majestic Million tracks the top million domains based on referring subnets to gauge website importance or relevance [6]. The site offers CSV data exports licensed under a Creative Commons Attribution.
This walkthrough uses a sample of the top 100 domains from the Majestic Million available at this GitHub page along with the full Jupyter Notebook used in the analysis. Download the notebook and data to follow along. Here’s a sample of what the raw data looks like:

TLDextract
Extracting the TLDs, subdomains, and so on from a domain may seem as simple as splitting the string at the periods. However some TLDs are multipart in the case of a country code (example: co.uk); TLDextract accounts for this and is a superior tool to a homebuilt function that might split strings off of a period [7].
First start by installing TLDextract:
pip install tldextract
And then import the library:
import tldextract
1. Strip the Second-Level Domain
To take a domain, such as medium.com, and extract "medium", run the following code:
df['SLD'] = [tldextract.extract(i).domain for i in df.Domain]
The resultant dataframe sample is:

2. Strip the Top-Level Domain
To extract a TLD, such as "com" from medium.com, run the following code:
df['TLD'] = [tldextract.extract(i).suffix for i in df.Domain]
The resultant dataframe sample is:

3. Extract Subdomains
To extract a subdomain, such as the "en" from en.wikipedia.com, run the following code:
df['subdomain'] = [tldextract.extract(i).subdomain for i in df.Domain]
Not all domains have a subdomain, so to see the effects, sorting via the following code is necessary:
df.sort_values(by='subdomain', ascending=False).head(3)
The resultant dataframe sample, sorted to view subdomains, is:

Note how the use of TLDextract has transformed a single column of domains into several columns of data. Looking at subdomains and TLDs can prove useful for various types of cyber security analysis. For example:
- Spamhaus tracks the Top 10 Worst TLDs for abuse, and offers a TLD Check function [8]. Extracting TLDs allows a way to identify presence of commonly abused TLDs in a dataset.
- Subdomains pose an opportunity for hijacking, and are a known vulnerability according to the MITRE ATT&CK Framework [9].
Next Steps: Feature Engineering Concepts
While TLDextract provided a wealth of information, the Domain Names have more data worth extracting:
1. Letter Ratios
Domain names are alphanumeric. In addition to letters, there are other characters such as numbers and periods present in domain names. The ratio of letters to non-letters can provide insight into the characteristics of the domain. The following python function will help extract letter ratio information:
def LetterRatio(domain):
if len(domain) > 0:
return len(list(filter(str.isalpha, domain)))/len(domain)
return "No Domain"
Try running a sample domain, such as medium.com:
LetterRatio("medium.com")
This should return 0.9, since there are 10 characters, of which 9 are letters and one is a period. Apply this to the dataframe with the following code:
df['LetterRatio'] = df['Domain'].apply(LetterRatio)

2. Domain Sections
Domain sections breaks the domain into the number of sections available. For example, medium.com has two, with the period separating the two sections. The function for extracting this is:
def DomainSections(domainName):
array1 = domainName.split(".")
return len(array1)
Apply this to the dataframe with the following code:
df['DomainSections'] = df['Domain'].apply(DomainSections)

3. Character Count
This last one is a simple count of characters in the domain name. Apply it to the dataframe with the following code:
df['DomainCharacters'] = df['Domain'].str.len()

The Prepared Dataframe and its Uses
What was once a single column of domain names is now a dataframe with six additional columns. The derived information can be useful for future work, such as identifying the presence of abused TLDs or even anomaly detection leveraging things like letter ratios and character counts. An example of a basic visualization to identify outliers is the below boxplot of letter ratios:
Note that any domain with a letter ratio below .81 is an outlier.
This is a simple starting point for domain name analysis, and the extracted features can inform deeper, more advanced follow-on analyses. Feel free to use the notebook available at this GitHub page for your own domain name data preparation.
References
[1] Cloudflare, What is DNS? | How DNS works (2021), Cloudflare.
[2] MITRE | ATT&CK, Application Layer Protocol: DNS, Sub-technique T1071.004 (2021), MITRE.
[3] Wikipedia: The Free Encyclopedia, Wikipedia (2021).
[4] J. Kurkowski, GitHub – john-kurkowski/tldextract (2021), GitHub.
[5] Wikipedia, Feature engineering (2021).
[6] Majestic, The Majestic Million (2021).
[7] J. Kurkowski, GitHub – john-kurkowski/tldextract (2021), GitHub.
[8] The Spamhaus Project, The Top 10 Most Abused TLDs (2021), Spamhaus.
[9] MITRE | ATT&CK, Compromise Infrastructure: Domains (2021), MITRE.