Geospatial Solutions Expert: Generating fake or dummy data using Python Faker library

There are many legit reasons why you will want to have access fake/dummy data sets. Some of these reasons are listed below:-
1) They are very useful when you are just starting out building an app and you don't have any data yet.
2) It is useful for testing or filling databases with some dummy data.

3) To protect data Privacy due to security and many other constraints.

There times when you need to have access to large amount of real world data to test an app you are developing. Then you suddenly realized that you don't have such data set available so you won't be able to put your app to real world test before it's final launch.

If you find yourself in the cenario above, then you are not alone. On this page, I will introduce you to a python module that will help you generate good amount of dummy or fake data that looks just like the once in real life for you to test run your application.

The name of the module is called: faker

Faker library
With this python package, you can generate test data without infringing on peoples' privacy. For example you can generate real names, addresses, latitude/longitude coordinates, phone numbers, fax numbers, occupations, profile titles, email addresses, website addresses, job titles, text data, random numbers, currencies, words, birthdates, hashes and uuids, date/time etc.

Let assume we are need to test a banking database, so due to sensitive nature of this kind of data we can use a real production data. So we need dummy/fake people's bank details to test run the app database.

This is where the python Faker library comes in handy. Let see how it is used.

First step is to install it using:

pip install Faker

Next is to import the module and initialize a faker generator like this:

from faker import Faker
fake = Faker()

Now we can use the fake object to generate all sorts of data type attributes as follow:-

# Numerical data type
fake.pybool()
fake.pydecimal(left_digits=5, right_digits=3, positive=True, min_value=None, max_value=None)
fake.pyfloat(left_digits=3, right_digits=3, positive=False, min_value=None, max_value=None)
fake.pyint(min_value=0, max_value=9999, step=1)
fake.latitude()
fake.longitude()


# String data type
fake.name()
fake.address()
fake.text()
fake.word()
fake.sentence()
fake.job()
fake.currency()
fake.currency_name()
fake.currency_code()
fake.country()
fake.user_name()
fake.first_name()
fake.last_name()
fake.name()
fake.email()
fake.address()
fake.phone_number()
fake.street_address()
fake.city()
fake.state()
fake.zipcode()
fake.company()
fake.catch_phrase()
fake.color_name()
fake.name_female()
fake.name_male()

# Internet related strings...
fake.md5()
fake.sha1()
fake.sha256()
fake.uuid4()

fake.email()
fake.safe_email()
fake.free_email()
fake.company_email()
fake.hostname()
fake.domain_name()
fake.domain_word()
fake.tld()
fake.ipv4()
fake.ipv6()
fake.ipv4_private()
fake.mac_address()
fake.slug()
fake.image_url()

# Date/Time ....
fake.date_of_birth(minimum_age=30)
fake.century()
fake.year()
fake.month()
fake.month_name()
fake.day_of_week()
fake.day_of_month()
fake.timezone()
fake.am_pm()


# Other data types/structures
fake.random_int(0, 100) # fake.random_int(min=0, max=9999, step=1)
fake.random_digit()
fake.profile()
fake.pystr(min_chars=None, max_chars=10)
fake.pylist(5, False, 'str') # (nb_elements=5, variable_nb_elements=True, *value_types='str')
fake.pytuple(10, True, 'str') # (nb_elements=10, variable_nb_elements=True, *value_types='tuple')
fake.pydict(10, True, 'url') # (nb_elements=10, variable_nb_elements=True, *value_types='url')
fake.pyiterable(10, True, 'date') # (nb_elements=10, variable_nb_elements=True, *value_types='date')
fake.pyset(10, True, 'list') # (nb_elements=10, variable_nb_elements=True, *value_types='list')
fake.pystruct(10, 'float') # (count=10, value_types='float') - NOTE: *value_types can be any of the datatypes: int, float, str, url, date, list, tuple, dict, set

# If you noticed the issue with *, then see this link: https://github.com/FactoryBoy/factory_boy/issues/387

Banking App Database

As you can see, there are alot of attributes we can generate data for. You can look-up more on the Faker documentation page. For now let generate some data for our banking app database.

Assuming for our bank management system costumers' data, we need hundreds/thousands/millions these attributes: Title, Account_name, Account_number, Account_type, Account_PIN, Card_number, Open_date, Account_balance, Data_of_birth, Occupation, Street_Address, City, Zip, Email and Phone_number

Obvoiusly these attributes are sensitive and we can use a production data like this for development. Hence, we need a fake/dummy dataset to work with. Let's use the faker module to get these datasets.

Treating each attribute one at a time:-

1) Title
This column should contain honorific titles such as Mr., Mrs., Miss, Ms., Mx., Sir., Dr., Rev., Lady, Lord, Mr President, General, Captain, Father, Pastor etc.

Unfortunately, as at the time of writting, I don't find such funtion in the module. When I tried something like: fake.title(), I got this error:-

So, we can just make up for this using a list and randomly pick from the list like this:-

import random

fake_title = ['Mr.', 'Mrs.', 'Miss', 'Ms.', 'Mx.', 'Sir.', 'Dr.', 'Rev.', 'Lady', 'Lord', 'Mr President', 'General', 'Captain', 'Father', 'Pastor']
random.choice(fake_title)

2) Account_name
This column is the full name for the account holder. Usually, it will be a person's name so we use the function: fake.name(). This will get random names each time it is called.

3) Account_number
Here we need a large number of more than five digits. So, we used fake.pyint(min_value=2000000000, max_value=9000000000, step=10). Note the I specified the minimun and maximum values to allow it return something close to what is expected of a real world account number.

4) Account_type
Just like in 'title' above, I don't find a faker function for the account_type, so we will use a list like so;-

import random
account_type = ['Savings', 'Checking', 'Current', 'Joint', 'Personal']
random.choice(account_type)

5) Account_PIN
fake.pyint(min_value=1000, max_value=9999, step=1)

6) Card_number
fake.pyint(min_value=100000000000, max_value=999999999999, step=1)

7) Open_date
fake.date(pattern='%d-%m-%Y', end_datetime=2020)

8) Account_balance
fake.pyint(min_value=100, max_value=9999, step=1)

9) Date_of_birth
fake.date_of_birth(minimum_age=18, maximum_age=105)

10) Occupation
fake.job()

11) Street_Address
fake.street_address()

12) City
fake.city()

13) Zip
fake.zipcode()

14) Email
fake.email()

15) Phone_number
fake.phone_number()

Putting it all together

We will take advantage of the using dictionary to generate dataframe columns as seen below:-

df = pd.DataFrame({
    'Name': ['John', 'Smith', 'Hassan', 'Tim'],
    'Age': [28, 49, 34, 25],
    'Acct Type': ['Savings', 'Checking', 'Current', 'Joint']
})

df

The only difference here is that we will generate the list dynamically. Lets say we want to generate 100 customers records, then we will use the range(100) function.

import random
from faker import Faker
fake = Faker()


fake_title = ['Mr.', 'Mrs.', 'Miss', 'Ms.', 'Mx.', 'Sir.', 'Dr.', 'Rev.', 'Lady', 'Lord', 'Mr President', 'General', 'Captain', 'Father', 'Pastor']
account_type = ['Savings', 'Checking', 'Current', 'Joint', 'Personal']


# create dataframe with all the 15 columns...
df = pd.DataFrame({'Title': [random.choice(fake_title) for _ in range(100)],
                   'Name': [fake.name() for _ in range(100)],
                   'Account Number': [fake.pyint(min_value=2000000000, max_value=9000000000, step=10) for _ in range(100)],
                   'Account Type': [random.choice(account_type) for _ in range(100)],
                   'Account PIN': [fake.pyint(min_value=1000, max_value=9999, step=1) for _ in range(100)],
                   'Card Number': [fake.pyint(min_value=100000000000, max_value=999999999999, step=1) for _ in range(100)],
                   'Open Date': [fake.date(pattern='%d-%m-%Y', end_datetime=2020) for _ in range(100)],
                   'Balance': ['₦'+ str(fake.pyint(min_value=100, max_value=9999, step=1)) for _ in range(100)],
                   'Date of Birth': [fake.date_of_birth(minimum_age=18, maximum_age=105) for _ in range(100)],
                   'Occupation': [fake.job() for _ in range(100)],
                   'Street Address': [fake.street_address() for _ in range(100)],
                   'City': [fake.city() for _ in range(100)],
                   'Zip Code': [fake.zipcode() for _ in range(100)],
                   'Email': [fake.email() for _ in range(100)],
                   'Phone Number': [fake.phone_number() for _ in range(100)],
                   
                  })

df.to_excel('BankDB.xlsx', index=None)

df

If you look closely, these are definitely not realistic data in the sense that some attribute entries don't really match their corresponding attribute. Example we have "Mr." in title column for a name that is obviously that of a female.

There are possible work around this minor issues as explained in this article on: How to generate realistic test data with Faker. However, for our use case here, it doesn't really matter. This is a good sample data to work with.

Note: if you have preference for any column's data type, you can specify the its data type at the point of creating the dataframe using dtype={'column':'type'} keyword. To check and set dtype of a columns do this;-

The end!

Geospatial Solutions Expert

Monday, August 31, 2020

Generating fake or dummy data using Python Faker library

No comments:

Post a Comment