Sunday, January 22, 2017

Combining multiple tabular data files together (excel, csv, html)

Introduction


If you work with any of these files "excel, csv, or html" regularly, you have probably encounter a situation where you have data in same file type say excel for instance and you needed to merge or combine those data from multiple excel files into a one big master file.

The traditional approach to such task is to open up all the files and then copy and paste their contents into the master file. This approach isn't efficient as it is slow and prone to errors.

Lets use the tools in python programming that is much more powerful than the manual copying and pasting most people do in excel.


The Problem


The basic concept is that there is a periodic need to combine data from multiple excel/csv/html files into a master file. I will combine some same files format in a directory into one big master file using python and its libraries (glob and pandas).

There are different possible scenarios, but for clear explanation and am going to use a very simple case here. Depending on what the situation is, any of these word (merge, join or concatenate) can used to explain what is aimed to be achieved.



Pandas provides various facilities for easily combining together Series, DataFrame, and Panel objects with various kinds of set logic for the indexes and relational algebra functionality in the case of join/merge-type operations.

Here, I have six csv files with the same number of columns that required to be merged together into one master csv file.

The Solution


The first step to solving this problem is to collect all the files into one directory/folder as I have done above. Then proceed as follow:-
1) import the glob and pandas module
2) collect the names of the csv files into a list
3) loop through the list of csv file names, make each a dataframe and append into a list
4) put all the dataframes together into a new dataframe using concat() function
5) change the order of dataFrame columns by re-assign the dataframe with a list of the columns, re-arranged as needed
6) save the new dataframe to a master csv file

Now the final code will look like below:-



The above explanation is applicable for other file types such as excel and html.

That is it!
Thanks for reading.


Thanks for reading.

No comments:

Post a Comment