Wednesday, October 9, 2019

Calculating the total size of zip code maps


On this US printable zip codes maps page, there is a list of all the US states zip code maps with their respective sizes in braces like this "Alabama ZIP Code Map (3.59MB)" as seen below...



Lets calculate the total size of all the maps using python scripting!

Off course, there are several or even better ways to get this done. But here we want to test our python skills on this, let us stick to using python 😏.

Some other reasons it is good idea we use python is that we can easily use our python skill to:-
1) make HTTP request to scrape/download the map data
2) generate the download links on the fly
3) create a bot to monitor change in map size (which could indicate the map has been updated).
4) visualization of the string including map/geographic visualization.

The list can go on and on, but I will keep it simple here to just calculate the total sum the map sizes.

Step 1:
First thing is to get the string/text off the web page into our python environment. There are several ways to do this as I have mentioned above, but I will just select, copy and paste it in a CSV file as seen below.



Step 2:
Read the CSV file in python. Here I will use the pandas module to read the CSV file, could have also used the CSV module to do this.





Now that we have the text in python, it is time to find our way of getting just the numerical part of the text/string that represent each file size in MB. So, basically we need to clean each string to get its size then we can sum the entire sizes.

For each row that look like this "Alabama ZIP Code Map (3.59MB)", we need it to be like this "3.59".

Step 3: Lets split each record at '('. Here I use a lambda function that takes in each row and apply the split function on it as seen below...

z1 = zip_df['Maps'].map(lambda x: x.split('('))



Now if you check the second element on each row, the value should look like this: '3.59MB)'. We are close, we just need to remove the trailing ')' from each row.


Step 4: Let's take a deeper look at the series 'z1' above. Each row in the series is just a list with two items. The first part of the item is the state name followed by the string ' ZIP Code Map ', then the second part is file size.

We can now make a two columns dataframe one for the state names and the other for the sizes as follow. In the process, we will strip out ')' using the strip(')') function, we could also strip out the repeated string ' ZIP Code Map ' on the state names.


state_list, size_list = [], []

i = 0
for ele in z1:
    state_list.append(z1[i][0])
    size_list.append(z1[i][1].strip(')'))
    
    i = i+1
    
new_df = pd.DataFrame(list(zip(state_list, size_list)), columns=['State', 'Size'])
new_df




Step 5: Now we can focus on the list 'size_list' or the column 'new_df['Size']' to get the total size of the zip codes, here I will use the 'size_list' list.

If you take a look at the list, it looks like this:-
size_list = ['3.59MB', '0.96MB', '2.12MB', '3.95MB', '3.72MB', '2.32MB', '3.63MB', '1.19MB', '2.94MB', '4.57MB', '702.01KB', '1.25MB', '3.59MB', '3.44MB', '3.12MB', '2.24MB', '2.58MB', '3.17MB', '1.86MB', '2.47MB', '3.05MB', '3.03MB', '2.73MB', '2.66MB', '3.97MB', '1.54MB', '1.73MB', '933.34KB', '1.95MB', '3.76MB', '2.30MB', '4.13MB', '2.37MB', '1.65MB', '5.71MB', '1.90MB', '2.13MB', '5.80MB', '1.05MB', '1.76MB', '3.35MB', '1.49MB', '1.90MB', '4.42MB', '1.71MB', '2.33MB', '2.44MB', '2.37MB', '1.06MB', '3.87MB', '3.35MB', '1.60MB']
If you look at the list closely, you will notice that is contain some entries in megabytes (MB) and some in kilobytes (KB), we need to calculate them separately, convert and sum the total size.

For the conversion, here below is nice visual explanation by TechSpot.com.


In our case, we just need to convert from kilobytes to megabytes which means we will divide by 1,024. Or if we decide to convert from megabytes to kilobytes, then we will multiply by 1,024.


Step 6: Lets separate the sizes list into MB and KB. Then remove the alphabetical strings and sum them before doing the conversion.

# Separate MB from KB...
mb_list, kb_list = [], []

for s in size_list:
    if 'MB' in s:
        mb_list.append(float(s.strip('MB')))
    if 'KB' in s:
        kb_list.append(float(s.strip('KB')))
        

Step 7: Sum the content of each list, then either convert from kilobytes to megabytes or from megabytes to kilobytes and then sum the results to get the total size.

# Sum and Convert... 
total_mb = sum(mb_list)
total_kb = sum(kb_list)

# To convert from kilobytes to megabytes, divide by 1,024
total_kb = total_kb/1024

total_size = total_mb + total_kb
print(total_size)

The total size is: 137.417MB


Putting it all together, the final code is as seen below:-

import pandas as pd

# Read the spreadsheet file...
zip_df = pd.read_csv(r"C:\Users\Yusuf_08039508010\Desktop\GIS Data Processing Scripts\US_ZipMap_Size.csv")

# Split each row at '('...
z1 = zip_df['Maps'].map(lambda x: x.split('('))

# Make df of the splited parts...
state_list, size_list = [], []

i = 0
for ele in z1:
    state_list.append(z1[i][0])
    size_list.append(z1[i][1].strip(')'))
    
    i = i+1
    
new_df = pd.DataFrame(list(zip(state_list, size_list)), columns=['State', 'Size'])


# Separate MB from KB...
mb_list, kb_list = [], []

for s in size_list:
    if 'MB' in s:
        mb_list.append(float(s.strip('MB')))
    if 'KB' in s:
        kb_list.append(float(s.strip('KB')))
        

# Sum and Convert... 
total_mb = sum(mb_list)
total_kb = sum(kb_list)

# To convert from kilobytes to megabytes, divide by 1,024
total_kb = total_kb/1024

total_size = total_mb + total_kb
print(total_size)


That is it!

No comments:

Post a Comment