Monday, April 10, 2017

Cleaning big data CSV file with Python

Hello there,

I know I can open a CSV file in a spreadsheet software like MS Excel or a text editor like NotePad++ and edit/clean its contents (by removing unwanted contents).

In a situation where by the CSV file is very big (big data - it contains millions of rows and columns), if you try to open such file in a spreadsheet software or a text editor there is possibility of memory error from the software you are using as seen in the image below. This excel error says: "File not loaded completely", the file appeared to be beyond MS excel capabilities.

This is where using a data oriented language such as Python comes in handy. Python pandas module can efficiently load/open and process such heavy duty big data files without much difficulties.


In this case study, I have a big CSV file contain polygon coordinates for the boundary of all states' LGA in Nigeria. Which I needed to format to fit a certain XML format for further geo processing.

I will need to replace over 200,000+ "spaces" with "commas" and another over 200,000+ "commas" with "spaces" within the polygon coordinates geometry column structure.


1) Import the big csv file into python pandas dataframe

2) Write the algorithm that will replace the "spaces" with "commas" and vise-versa.

3) At the end, save the cleaned dataset into a new CVS file

That is it!
Thanks for reading

No comments:

Post a Comment