On the web, it is very common to find dataset displayed in column-wise manner as seen below.
As you will already noticed, each new record is separated by the company name in bold capital letters. So, lets add a sign to separate one record from the other. I used this "------------------------" sign, but you can use anything as long as it is unique and not part of the records itself.
So, the working data copied from the website is like this:-
ACCESS CREDIT MANAGEMENT, INC. Tim Cullen, Attorney 11225 Huron Ln Ste 222 Little Rock, AR 72211-1861 United States Phone: (501) 664-2922 Fax: (501) 664-3207 MAP Attorney ------------------------ CREDIT CONTROL CO., INC. Bill Caldwell, President Bill Caldwell, Ethics Contact 10201 W Markham St Ste 104 Little Rock, AR 72205-2180 United States Phone: (501) 225-2050 Fax: (501) 225-2135 ACA Member since 1982 Line of Business: Third Party Collections ------------------------ THE MCHUGHES LAW FIRM, PLLC Becky A. McHughes Esq., Attorney at Law 10810 Executive Center Dr Danville Bldg Ste 312 Little Rock, AR 72211 United States Phone: (501) 376-9131 Fax: (501) 374-9332 http://www.mchugheslaw.com ACA Member since 2013 Line of Business: Law Firm Line of Business: Third Party Collections ------------------------ THE MCHUGHES LAW FIRM, PLLC Becky A. McHughes Esq., Attorney at Law 10809 Executive Center Dr Danville Bldg Ste 312 Little Rock, AR 72204 United States Phone: (501) 376-9131 Fax: (501) 374-9332 MAP Attorney Lowell ------------------------ CENTRAL RESEARCH, INC. Karena Holt, Vice President of Operations Karena Holt, Ethics Contact 122 N. Bloominton Ste 1 Lowell, AR 72745 United States Phone: (479) 419-5456 Fax: (479) 419-5460 http://www.central-research.com ACA Member since 2016 Line of Business: Third Party Collections ------------------------ CENTRAL RESEARCH, INC. Shane Taylor 106 N Bloomington Ste S Lowell, AR 72745-8988 United States Phone: (479) 419-5456 MAP Attorney Mabelvale ------------------------ FIRST COLLECTION SERVICES Chris Dunkum, President Chris Dunkum, Ethics Contact 10925 Otter Creek East Blvd Mabelvale, AR 72103-1661 United States Phone: (501) 455-1658 http://www.FCScollects.com ACA Member since 1983 Line of Business: Outsourced First Party or Billing Company Line of Business: Third Party Collections
What we really want is something like this:-
ACCESS CREDIT MANAGEMENT, INC. :: Tim Cullen, Attorney :: 11225 Huron Ln Ste 222 :: Little Rock, AR 72211-1861 :: United States :: Phone: (501) 664-2922 :: Fax: (501) 664-3207 :: MAP Attorney ------------------------ CREDIT CONTROL CO., INC. :: Bill Caldwell, President :: Bill Caldwell, Ethics Contact :: 10201 W Markham St Ste 104 :: Little Rock, AR 72205-2180 :: United States :: Phone: (501) 225-2050 :: Fax: (501) 225-2135 :: ACA Member since 1982 :: Line of Business: Third Party Collections ------------------------ THE MCHUGHES LAW FIRM, PLLC Becky A. McHughes Esq., Attorney at Law :: 10810 Executive Center Dr :: Danville Bldg Ste 312 :: Little Rock, AR 72211 :: United States :: Phone: (501) 376-9131 :: Fax: (501) 374-9332 :: http://www.mchugheslaw.com :: ACA Member since 2013 :: Line of Business: Law Firm :: Line of Business: Third Party Collections ------------------------ THE MCHUGHES LAW FIRM, PLLC :: Becky A. McHughes Esq., Attorney at Law :: 10809 Executive Center Dr :: Danville Bldg Ste 312 :: Little Rock, AR 72204 :: United States :: Phone: (501) 376-9131 :: Fax: (501) 374-9332 :: MAP Attorney :: Lowell ------------------------ CENTRAL RESEARCH, INC. :: Karena Holt, Vice President of Operations :: Karena Holt, Ethics Contact :: 122 N. Bloominton Ste 1 :: Lowell, AR 72745 :: United States :: Phone: (479) 419-5456 :: Fax: (479) 419-5460 :: http://www.central-research.com :: ACA Member since 2016 :: Line of Business: Third Party Collections
From vertical arrangement to horizontal arrangement. The horizontal (row-wise) arrangement, works best in spreadsheet. We will have common column for the same records.
What we have (vertical/column-wise arrangement)
What we want (horizontal/row-wise arrangement)
The Code
There are several ways to do this. Here, we will see two approaches, the first will split the data into chucks of list based on the separator string. While the second approach will make use of the groupby method from itertools package.
Read the data into a list
f = r"C:\Users\Yusuf_08039508010\Desktop\All in One.xlsx" df = pd.read_excel(f, sheet_name='Sheet3') # Convert the column-wise data to list... asign the separator string to a variable mylist = df['Data'].to_list() #.split('=======') mysepstring = '------------------------' df
Approach 1:
# The separator string "mysepstring" most be repeated at the end... def list_splitz(baseList, sepString): group = [] for x in baseList: if x != sepString: group.append(x) elif group: yield group group = [] # print(list(list_splitz(mylist, mysepstring))) df_list1 = list(list_splitz(mylist, mysepstring)) display(len(df_list1)) display(df_list1)
[['ACCESS CREDIT MANAGEMENT, INC.', 'Tim Cullen, Attorney', '11225 Huron Ln Ste 222', 'Little Rock, AR 72211-1861', 'United States', 'Phone: (501) 664-2922', 'Fax: (501) 664-3207', 'MAP Attorney'], ['CREDIT CONTROL CO., INC.', 'Bill Caldwell, President', 'Bill Caldwell, Ethics Contact', '10201 W Markham St Ste 104', 'Little Rock, AR 72205-2180', 'United States', 'Phone: (501) 225-2050', 'Fax: (501) 225-2135', 'ACA Member since 1982', 'Line of Business: Third Party Collections'], ['THE MCHUGHES LAW FIRM, PLLC', 'Becky A. McHughes Esq., Attorney at Law', '10810 Executive Center Dr', 'Danville Bldg Ste 312', 'Little Rock, AR 72211', 'United States', 'Phone: (501) 376-9131', 'Fax: (501) 374-9332', 'http://www.mchugheslaw.com', 'ACA Member since 2013', 'Line of Business: Law Firm', 'Line of Business: Third Party Collections'], ['THE MCHUGHES LAW FIRM, PLLC', 'Becky A. McHughes Esq., Attorney at Law', '10809 Executive Center Dr', 'Danville Bldg Ste 312', 'Little Rock, AR 72204', 'United States', 'Phone: (501) 376-9131', 'Fax: (501) 374-9332', 'MAP Attorney', 'Lowell'], ['CENTRAL RESEARCH, INC.', 'Karena Holt, Vice President of Operations', 'Karena Holt, Ethics Contact', '122 N. Bloominton Ste 1', 'Lowell, AR 72745', 'United States', 'Phone: (479) 419-5456', 'Fax: (479) 419-5460', 'http://www.central-research.com', 'ACA Member since 2016', 'Line of Business: Third Party Collections'], ['CENTRAL RESEARCH, INC.', 'Shane Taylor', '106 N Bloomington', 'Ste S', 'Lowell, AR 72745-8988', 'United States', 'Phone: (479) 419-5456', 'MAP Attorney', 'Mabelvale'], ['FIRST COLLECTION SERVICES', 'Chris Dunkum, President', 'Chris Dunkum, Ethics Contact', '10925 Otter Creek East Blvd', 'Mabelvale, AR 72103-1661', 'United States', 'Phone: (501) 455-1658', 'http://www.FCScollects.com', 'ACA Member since 1983', 'Line of Business: Outsourced First Party or Billing Company', 'Line of Business: Third Party Collections']]
Now, we can convert the list to a dataframe and subsequently to spreadsheet file.
df1 = pd.DataFrame(df_list1) df1.to_excel('CleanData.xlsx', index=None)
Approach 2:
# No need for the separator string "mysepstring" to be repeated at the end... from itertools import groupby df_list2 = [list(g) for k, g in groupby(mylist, lambda x: x != mysepstring) if k] len(df_list2) df2 = pd.DataFrame(df_list2) df2.to_excel('CleanData2.xlsx', index=None) df2
That is it!
No comments:
Post a Comment