Wednesday, November 4, 2020

Split dataframe into chunks

The dataset we will use for this demo is the sample 'World Cities Database' from Simplemaps.com. Download a copy from the given url above.



Upon inspecting the file, you will see that there are 15493 rows and 11 columns. 

There several ways to split a dataframe into parts or chunks. How it is split depends on how to dataframe want to be used. In this post we will take a look at three ways a dataframe can be split into parts. Specifically, we will look at:-

1) Split dataframe into chunks of n files

This will split dataframe into given number of files. One way to achieve the splitting of a dataframe into chunks of evenly specified number is to use the numpy array_split() function. Using the default axis=0 in numpy array_split() function, splits the dataframe by rows. To split it by column, set axis=1

Lets say we want to split the dataframe into 11 files, the code will be:-

import pandas as pd

# Read the file into a df...
world_df = pd.read_excel(r"C:\Users\Yusuf_08039508010\Desktop\worldcities.xlsx")
world_df.sample(10)

# Check how many rows and columns are in the df...
world_df.shape

# Split dataframe into chuncks of n files...
i = 1
for x in np.array_split(world_df, 11, axis=0):
    print('Processing df... ', i)
    
    x.to_excel('worldcities_CHUNK-N-FILES'+str(i)+'.xlsx', index=None)
    
    i += 1
If on the other hand, we want to split the dataframe by its columns (that is to have each column in a separate file). All we have to do is set the keyword argument to: axis=1



2) Split dataframe into chunks of n rows

This will split dataframe into given number of rows. So, lets say our client want us to split the dataframe into chunks of 2000 rows per file. That means we will have 7 files of 2000 rows each and 1 file of less that 2000 rows. In total, we will have 8 files.

import math
import pandas as pd

# Read the file into a df...
world_df = pd.read_excel(r"C:\Users\Yusuf_08039508010\Desktop\worldcities.xlsx")

# Get number of parts/chunks... by dividing the df total number of rows by expected number of rows plus 1
expected_rows = 2000
chunks = math.floor(len(world_df['country'])/expected_rows + 1)

# Slice the dataframe...
df_list = []

i = 0
j = expected_rows
for x in range(chunks):
    df_sliced = world_df[i:j]
    
    # df_list.append(df_sliced)
    df_sliced.to_excel('worldcities_CHUNK-BY-ROWS_'+str(i)+'.xlsx', index=None)
    
    i += expected_rows
    j += expected_rows

Note that we used math.floor() to round down the calculated chunk number. We could have also used math.trunc() or simply wrap int() function on the calculation.



3) Split dataframe into chunks group of column items

This will split dataframe into given groups found in a column. Lets split on the country column. Since there 223 countries, there will be 223 group which mean there will also be 223 files to be generated.

The code is below;-

import pandas as pd

# Read the file into a df...
world_df = pd.read_excel(r"C:\Users\Yusuf_08039508010\Desktop\worldcities.xlsx")

# Group by country column...
gb = world_df.groupby('country') #['country'].value_counts()

# Get each grouped df into list...
dfs = [gb.get_group(x) for x in gb.groups]
# len(dfs)

i = 1
for df in dfs:
    print('Processing df... ', i)
    
    df.to_excel('worldcities_CHUNK-COLUMN-GROUP_'+str(i)+'.xlsx', index=None)
    
    i += 1
Note: we could have used the names of the countries accessed from the groupby object to name the files accordingly like this:

# Access the keys/values of the groups...
groups_value = dict(list(gb))
groups_value.keys()


Another approach that save the files with their respective country names:-
# Group the dataframe by country column using groupby() method
group_df = world_df.groupby('country')
# Generate list of the countries (groupby keys)
group_keys = list(group_df.groups.keys()) # Loop over the keys and save each group to excel file for s in group_keys: # save_df = group_df.get_group('Abia') save_df = group_df.get_group(s) # make the file name, e.g: "Abia state.xlsx" s_name = s + ' state.xlsx' save_df.to_excel(s_name, index=None)

That is it!

No comments:

Post a Comment