Friday, July 23, 2021

Parsing XML data in python

 Extensible Markup Language (XML) is one of the data exchange formats. In the post, we will work through using python's 'xml.etree.ElementTree' module to parse its data content.

An XML data can be available in one of the three forms, namely:-

  • XML file
  • XML URL
  • XML String

Depending on how your XML in made available to you, there is slightly different approaches of reading it use the 'xml.etree.ElementTree' module. Lets see how below.

The sample XML I will use is the CD Catalog by W3Schools. I have copied the data into the three formats listed above as seen below;-


# XML file...
xml_file = r"C:\Users\Yusuf_08039508010\Documents\w3s_sample.xml"

# XML URL...
xml_url = 'https://www.w3schools.com/xml/cd_catalog.xml'

# XML String...
xml_string = '''<?xml version="1.0" encoding="UTF-8"?>
<CATALOG>
  <CD>
    <TITLE>Empire Burlesque</TITLE>
    <ARTIST>Bob Dylan</ARTIST>
    <COUNTRY>USA</COUNTRY>
    <COMPANY>Columbia</COMPANY>
    <PRICE>10.90</PRICE>
    <YEAR>1985</YEAR>
  </CD>
  <CD>
    <TITLE>Hide your heart</TITLE>
    <ARTIST>Bonnie Tyler</ARTIST>
    <COUNTRY>UK</COUNTRY>
    <COMPANY>CBS Records</COMPANY>
    <PRICE>9.90</PRICE>
    <YEAR>1988</YEAR>
  </CD>
  <CD>
    <TITLE>Greatest Hits</TITLE>
    <ARTIST>Dolly Parton</ARTIST>
    <COUNTRY>USA</COUNTRY>
    <COMPANY>RCA</COMPANY>
    <PRICE>9.90</PRICE>
    <YEAR>1982</YEAR>
  </CD>
  <CD>
    <TITLE>Still got the blues</TITLE>
    <ARTIST>Gary Moore</ARTIST>
    <COUNTRY>UK</COUNTRY>
    <COMPANY>Virgin records</COMPANY>
    <PRICE>10.20</PRICE>
    <YEAR>1990</YEAR>
  </CD>
  <CD>
    <TITLE>Eros</TITLE>
    <ARTIST>Eros Ramazzotti</ARTIST>
    <COUNTRY>EU</COUNTRY>
    <COMPANY>BMG</COMPANY>
    <PRICE>9.90</PRICE>
    <YEAR>1997</YEAR>
  </CD>
  <CD>
    <TITLE>One night only</TITLE>
    <ARTIST>Bee Gees</ARTIST>
    <COUNTRY>UK</COUNTRY>
    <COMPANY>Polydor</COMPANY>
    <PRICE>10.90</PRICE>
    <YEAR>1998</YEAR>
  </CD>
  <CD>
    <TITLE>Sylvias Mother</TITLE>
    <ARTIST>Dr.Hook</ARTIST>
    <COUNTRY>UK</COUNTRY>
    <COMPANY>CBS</COMPANY>
    <PRICE>8.10</PRICE>
    <YEAR>1973</YEAR>
  </CD>
  <CD>
    <TITLE>Maggie May</TITLE>
    <ARTIST>Rod Stewart</ARTIST>
    <COUNTRY>UK</COUNTRY>
    <COMPANY>Pickwick</COMPANY>
    <PRICE>8.50</PRICE>
    <YEAR>1990</YEAR>
  </CD>
  <CD>
    <TITLE>Romanza</TITLE>
    <ARTIST>Andrea Bocelli</ARTIST>
    <COUNTRY>EU</COUNTRY>
    <COMPANY>Polydor</COMPANY>
    <PRICE>10.80</PRICE>
    <YEAR>1996</YEAR>
  </CD>
  <CD>
    <TITLE>When a man loves a woman</TITLE>
    <ARTIST>Percy Sledge</ARTIST>
    <COUNTRY>USA</COUNTRY>
    <COMPANY>Atlantic</COMPANY>
    <PRICE>8.70</PRICE>
    <YEAR>1987</YEAR>
  </CD>
  <CD>
    <TITLE>Black angel</TITLE>
    <ARTIST>Savage Rose</ARTIST>
    <COUNTRY>EU</COUNTRY>
    <COMPANY>Mega</COMPANY>
    <PRICE>10.90</PRICE>
    <YEAR>1995</YEAR>
  </CD>
  <CD>
    <TITLE>1999 Grammy Nominees</TITLE>
    <ARTIST>Many</ARTIST>
    <COUNTRY>USA</COUNTRY>
    <COMPANY>Grammy</COMPANY>
    <PRICE>10.20</PRICE>
    <YEAR>1999</YEAR>
  </CD>
  <CD>
    <TITLE>For the good times</TITLE>
    <ARTIST>Kenny Rogers</ARTIST>
    <COUNTRY>UK</COUNTRY>
    <COMPANY>Mucik Master</COMPANY>
    <PRICE>8.70</PRICE>
    <YEAR>1995</YEAR>
  </CD>
  <CD>
    <TITLE>Big Willie style</TITLE>
    <ARTIST>Will Smith</ARTIST>
    <COUNTRY>USA</COUNTRY>
    <COMPANY>Columbia</COMPANY>
    <PRICE>9.90</PRICE>
    <YEAR>1997</YEAR>
  </CD>
  <CD>
    <TITLE>Tupelo Honey</TITLE>
    <ARTIST>Van Morrison</ARTIST>
    <COUNTRY>UK</COUNTRY>
    <COMPANY>Polydor</COMPANY>
    <PRICE>8.20</PRICE>
    <YEAR>1971</YEAR>
  </CD>
  <CD>
    <TITLE>Soulsville</TITLE>
    <ARTIST>Jorn Hoel</ARTIST>
    <COUNTRY>Norway</COUNTRY>
    <COMPANY>WEA</COMPANY>
    <PRICE>7.90</PRICE>
    <YEAR>1996</YEAR>
  </CD>
  <CD>
    <TITLE>The very best of</TITLE>
    <ARTIST>Cat Stevens</ARTIST>
    <COUNTRY>UK</COUNTRY>
    <COMPANY>Island</COMPANY>
    <PRICE>8.90</PRICE>
    <YEAR>1990</YEAR>
  </CD>
  <CD>
    <TITLE>Stop</TITLE>
    <ARTIST>Sam Brown</ARTIST>
    <COUNTRY>UK</COUNTRY>
    <COMPANY>A and M</COMPANY>
    <PRICE>8.90</PRICE>
    <YEAR>1988</YEAR>
  </CD>
  <CD>
    <TITLE>Bridge of Spies</TITLE>
    <ARTIST>T'Pau</ARTIST>
    <COUNTRY>UK</COUNTRY>
    <COMPANY>Siren</COMPANY>
    <PRICE>7.90</PRICE>
    <YEAR>1987</YEAR>
  </CD>
  <CD>
    <TITLE>Private Dancer</TITLE>
    <ARTIST>Tina Turner</ARTIST>
    <COUNTRY>UK</COUNTRY>
    <COMPANY>Capitol</COMPANY>
    <PRICE>8.90</PRICE>
    <YEAR>1983</YEAR>
  </CD>
  <CD>
    <TITLE>Midt om natten</TITLE>
    <ARTIST>Kim Larsen</ARTIST>
    <COUNTRY>EU</COUNTRY>
    <COMPANY>Medley</COMPANY>
    <PRICE>7.80</PRICE>
    <YEAR>1983</YEAR>
  </CD>
  <CD>
    <TITLE>Pavarotti Gala Concert</TITLE>
    <ARTIST>Luciano Pavarotti</ARTIST>
    <COUNTRY>UK</COUNTRY>
    <COMPANY>DECCA</COMPANY>
    <PRICE>9.90</PRICE>
    <YEAR>1991</YEAR>
  </CD>
  <CD>
    <TITLE>The dock of the bay</TITLE>
    <ARTIST>Otis Redding</ARTIST>
    <COUNTRY>USA</COUNTRY>
    <COMPANY>Stax Records</COMPANY>
    <PRICE>7.90</PRICE>
    <YEAR>1968</YEAR>
  </CD>
  <CD>
    <TITLE>Picture book</TITLE>
    <ARTIST>Simply Red</ARTIST>
    <COUNTRY>EU</COUNTRY>
    <COMPANY>Elektra</COMPANY>
    <PRICE>7.20</PRICE>
    <YEAR>1985</YEAR>
  </CD>
  <CD>
    <TITLE>Red</TITLE>
    <ARTIST>The Communards</ARTIST>
    <COUNTRY>UK</COUNTRY>
    <COMPANY>London</COMPANY>
    <PRICE>7.80</PRICE>
    <YEAR>1987</YEAR>
  </CD>
  <CD>
    <TITLE>Unchain my heart</TITLE>
    <ARTIST>Joe Cocker</ARTIST>
    <COUNTRY>USA</COUNTRY>
    <COMPANY>EMI</COMPANY>
    <PRICE>8.20</PRICE>
    <YEAR>1987</YEAR>
  </CD>
</CATALOG>
'''

Lets read the data...
import requests
import pandas as pd
import xml.etree.ElementTree as ET


# Read XML from URL...
xml_response = requests.get(xml_url, timeout=5)
my_xml_root = ET.fromstring(xml_response.text)
    

# Parse XML from str...
my_xml_root = ET.fromstring(xml_data) # to read from file: ET.parse('sample.xml').getroot()

# Parse XML from file...
my_xml_root = ET.parse(xml_file).getroot()


my_xml_root


The three options above will return the xml Element at different memory location like this: <Element 'CATALOG' at 0x000001C0E04EA360>, <Element 'CATALOG' at 0x000001C0E03DBD10>, <Element 'CATALOG' at 0x000001C0E04E5B80>

From this point onward, every other property or function called on the xml root element is the same irrespective of the source xml data coming from a file, url or a string.

If you do dir(my_xml_root), you will see all the available properties or functions that can be called on the XML object. You could also check the help documentation like this help(my_xml_root) for more.

print(dir(my_xml_root))
['__class__', '__copy__', '__deepcopy__', '__delattr__', '__delitem__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getitem__', '__getstate__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__len__', '__lt__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__setitem__', '__setstate__', '__sizeof__', '__str__', '__subclasshook__', 'append', 'attrib', 'clear', 'extend', 'find', 'findall', 'findtext', 'get', 'getchildren', 'getiterator', 'insert', 'items', 'iter', 'iterfind', 'itertext', 'keys', 'makeelement', 'remove', 'set', 'tag', 'tail', 'text']

Before you start using the property or function, it is a good practice to use an XML viewer to view how the XML data is structured. I will use this free online XML viewer to inspect/explore the structure.



From the xml tree we can see that the XML data has a root element named 'CATALOG' which one sub-element named 'CD' which has twenty six sub-elements with each containing six items as seen above.


Using the property or function, we can now parse the data in python for more manipulations.

To access the least child/item of the xml tree, the template is as follow;-
rootElement >> sub-element >> sub-element >> ... >> items
'CATALOG' >> 'CD' >> items (childs)
In this case there are just three layers as seen above. Some complex xml can have many sub-elements before getting to the last child/item.




# The root element is only one in this case and it has no attributes as seen below...
my_xml_root.tag # 'CATALOG'
my_xml_root.attrib # empty dict {}


# The first child...
my_xml_root[0]

for x in my_xml_root[0]:
    print(x.tag, x.attrib, x.text)
# =========================================

    
# Now lets get all children as well as there text attributes
all_children = list(my_xml_root) # Old way is: my_xml_root.getchildren()

for child in all_children:
    print(child.find('TITLE').text)
    print(child.find('ARTIST').text)
    print(child.find('COUNTRY').text)
    print(child.find('COMPANY').text)
    print(child.find('PRICE').text)
    print(child.find('YEAR').text)
    
    break # only print the first record
# =========================================



# Now we can collect the records into list/dictionary to construct a friend dataframe structure
data_list = []
for child in all_children:
    data_dict = {}
    
    data_dict['Title'] = child.find('TITLE').text
    data_dict['Artist'] = child.find('ARTIST').text
    data_dict['Country'] = child.find('COUNTRY').text
    data_dict['Company'] = child.find('COMPANY').text
    data_dict['Price'] = child.find('PRICE').text
    data_dict['Year'] = child.find('YEAR').text
    
    data_list.append(data_dict)
# =========================================

    

# Generate the df...
data_list_df = pd.DataFrame(data_list)
data_list_df



The code above can simply be use as XML to CSV Spreadsheet converter.

Note that, the code above may note work for another XML with different data structure. This is because of varying root elements and child names, example find('TITLE') won't work with this XML because it doesn't have such child name.

Similarly, it won't work for any of these XMLs:-
https://www.w3schools.com/xml/plant_catalog.xml
https://www.w3schools.com/xml/simple.xml

Instead what will work will look like this for this this XML data:-



As you can see, the structure of this new XML data isn't the same as the former from w3Schools, so the code can't be the same.
new_xml_url = 'https://spreadsheets.google.com/feeds/list/17beU3LK_ZJEp4m_WKb2D8xDyX-6_MmmgU8BAnWf4NM0/o8b7v7h/public/values/ciyn3'

# Read XML from URL...
xml_response = requests.get(new_xml_url, timeout=5)
my_xml_root = ET.fromstring(xml_response.text)
# ------------------------

my_xml_root.attrib
my_xml_root.tag

# Get all children...
all_children = list(my_xml_root)

# the items we wanted are in the fifth child...
all_children[4].tag
all_children[4].attrib

all_children[4].text



Wrap up
As an exercise, try converting these XMLs ("https://www.w3schools.com/xml/plant_catalog.xml", "https://www.w3schools.com/xml/simple.xml" and "https://docs.microsoft.com/en-us/previous-versions/windows/desktop/ms762271(v=vs.85)" ) to CVS spreadsheet. See the solutions on this Jupyter Notebook.

On a final note, you can modify or update an XML data and write/save it to file. I won't talk about this here, if you are interested in learning how to modify XML, check the following web pages:-

1) https://www.edureka.co/blog/python-xml-parser-tutorial/amp/
2) https://stackabuse.com/reading-and-writing-xml-files-in-python

That is it!

No comments:

Post a Comment