Extensible Markup Language (XML) is one of the data exchange formats. In the post, we will work through using python's 'xml.etree.ElementTree' module to parse its data content.
An XML data can be available in one of the three forms, namely:-
- XML file
- XML URL
- XML String
Depending on how your XML in made available to you, there is slightly different approaches of reading it use the 'xml.etree.ElementTree' module. Lets see how below.
The sample XML I will use is the CD Catalog by W3Schools. I have copied the data into the three formats listed above as seen below;-
# XML file...
xml_file = r"C:\Users\Yusuf_08039508010\Documents\w3s_sample.xml"
# XML URL...
xml_url = 'https://www.w3schools.com/xml/cd_catalog.xml'
# XML String...
xml_string = '''<?xml version="1.0" encoding="UTF-8"?>
<CATALOG>
<CD>
<TITLE>Empire Burlesque</TITLE>
<ARTIST>Bob Dylan</ARTIST>
<COUNTRY>USA</COUNTRY>
<COMPANY>Columbia</COMPANY>
<PRICE>10.90</PRICE>
<YEAR>1985</YEAR>
</CD>
<CD>
<TITLE>Hide your heart</TITLE>
<ARTIST>Bonnie Tyler</ARTIST>
<COUNTRY>UK</COUNTRY>
<COMPANY>CBS Records</COMPANY>
<PRICE>9.90</PRICE>
<YEAR>1988</YEAR>
</CD>
<CD>
<TITLE>Greatest Hits</TITLE>
<ARTIST>Dolly Parton</ARTIST>
<COUNTRY>USA</COUNTRY>
<COMPANY>RCA</COMPANY>
<PRICE>9.90</PRICE>
<YEAR>1982</YEAR>
</CD>
<CD>
<TITLE>Still got the blues</TITLE>
<ARTIST>Gary Moore</ARTIST>
<COUNTRY>UK</COUNTRY>
<COMPANY>Virgin records</COMPANY>
<PRICE>10.20</PRICE>
<YEAR>1990</YEAR>
</CD>
<CD>
<TITLE>Eros</TITLE>
<ARTIST>Eros Ramazzotti</ARTIST>
<COUNTRY>EU</COUNTRY>
<COMPANY>BMG</COMPANY>
<PRICE>9.90</PRICE>
<YEAR>1997</YEAR>
</CD>
<CD>
<TITLE>One night only</TITLE>
<ARTIST>Bee Gees</ARTIST>
<COUNTRY>UK</COUNTRY>
<COMPANY>Polydor</COMPANY>
<PRICE>10.90</PRICE>
<YEAR>1998</YEAR>
</CD>
<CD>
<TITLE>Sylvias Mother</TITLE>
<ARTIST>Dr.Hook</ARTIST>
<COUNTRY>UK</COUNTRY>
<COMPANY>CBS</COMPANY>
<PRICE>8.10</PRICE>
<YEAR>1973</YEAR>
</CD>
<CD>
<TITLE>Maggie May</TITLE>
<ARTIST>Rod Stewart</ARTIST>
<COUNTRY>UK</COUNTRY>
<COMPANY>Pickwick</COMPANY>
<PRICE>8.50</PRICE>
<YEAR>1990</YEAR>
</CD>
<CD>
<TITLE>Romanza</TITLE>
<ARTIST>Andrea Bocelli</ARTIST>
<COUNTRY>EU</COUNTRY>
<COMPANY>Polydor</COMPANY>
<PRICE>10.80</PRICE>
<YEAR>1996</YEAR>
</CD>
<CD>
<TITLE>When a man loves a woman</TITLE>
<ARTIST>Percy Sledge</ARTIST>
<COUNTRY>USA</COUNTRY>
<COMPANY>Atlantic</COMPANY>
<PRICE>8.70</PRICE>
<YEAR>1987</YEAR>
</CD>
<CD>
<TITLE>Black angel</TITLE>
<ARTIST>Savage Rose</ARTIST>
<COUNTRY>EU</COUNTRY>
<COMPANY>Mega</COMPANY>
<PRICE>10.90</PRICE>
<YEAR>1995</YEAR>
</CD>
<CD>
<TITLE>1999 Grammy Nominees</TITLE>
<ARTIST>Many</ARTIST>
<COUNTRY>USA</COUNTRY>
<COMPANY>Grammy</COMPANY>
<PRICE>10.20</PRICE>
<YEAR>1999</YEAR>
</CD>
<CD>
<TITLE>For the good times</TITLE>
<ARTIST>Kenny Rogers</ARTIST>
<COUNTRY>UK</COUNTRY>
<COMPANY>Mucik Master</COMPANY>
<PRICE>8.70</PRICE>
<YEAR>1995</YEAR>
</CD>
<CD>
<TITLE>Big Willie style</TITLE>
<ARTIST>Will Smith</ARTIST>
<COUNTRY>USA</COUNTRY>
<COMPANY>Columbia</COMPANY>
<PRICE>9.90</PRICE>
<YEAR>1997</YEAR>
</CD>
<CD>
<TITLE>Tupelo Honey</TITLE>
<ARTIST>Van Morrison</ARTIST>
<COUNTRY>UK</COUNTRY>
<COMPANY>Polydor</COMPANY>
<PRICE>8.20</PRICE>
<YEAR>1971</YEAR>
</CD>
<CD>
<TITLE>Soulsville</TITLE>
<ARTIST>Jorn Hoel</ARTIST>
<COUNTRY>Norway</COUNTRY>
<COMPANY>WEA</COMPANY>
<PRICE>7.90</PRICE>
<YEAR>1996</YEAR>
</CD>
<CD>
<TITLE>The very best of</TITLE>
<ARTIST>Cat Stevens</ARTIST>
<COUNTRY>UK</COUNTRY>
<COMPANY>Island</COMPANY>
<PRICE>8.90</PRICE>
<YEAR>1990</YEAR>
</CD>
<CD>
<TITLE>Stop</TITLE>
<ARTIST>Sam Brown</ARTIST>
<COUNTRY>UK</COUNTRY>
<COMPANY>A and M</COMPANY>
<PRICE>8.90</PRICE>
<YEAR>1988</YEAR>
</CD>
<CD>
<TITLE>Bridge of Spies</TITLE>
<ARTIST>T'Pau</ARTIST>
<COUNTRY>UK</COUNTRY>
<COMPANY>Siren</COMPANY>
<PRICE>7.90</PRICE>
<YEAR>1987</YEAR>
</CD>
<CD>
<TITLE>Private Dancer</TITLE>
<ARTIST>Tina Turner</ARTIST>
<COUNTRY>UK</COUNTRY>
<COMPANY>Capitol</COMPANY>
<PRICE>8.90</PRICE>
<YEAR>1983</YEAR>
</CD>
<CD>
<TITLE>Midt om natten</TITLE>
<ARTIST>Kim Larsen</ARTIST>
<COUNTRY>EU</COUNTRY>
<COMPANY>Medley</COMPANY>
<PRICE>7.80</PRICE>
<YEAR>1983</YEAR>
</CD>
<CD>
<TITLE>Pavarotti Gala Concert</TITLE>
<ARTIST>Luciano Pavarotti</ARTIST>
<COUNTRY>UK</COUNTRY>
<COMPANY>DECCA</COMPANY>
<PRICE>9.90</PRICE>
<YEAR>1991</YEAR>
</CD>
<CD>
<TITLE>The dock of the bay</TITLE>
<ARTIST>Otis Redding</ARTIST>
<COUNTRY>USA</COUNTRY>
<COMPANY>Stax Records</COMPANY>
<PRICE>7.90</PRICE>
<YEAR>1968</YEAR>
</CD>
<CD>
<TITLE>Picture book</TITLE>
<ARTIST>Simply Red</ARTIST>
<COUNTRY>EU</COUNTRY>
<COMPANY>Elektra</COMPANY>
<PRICE>7.20</PRICE>
<YEAR>1985</YEAR>
</CD>
<CD>
<TITLE>Red</TITLE>
<ARTIST>The Communards</ARTIST>
<COUNTRY>UK</COUNTRY>
<COMPANY>London</COMPANY>
<PRICE>7.80</PRICE>
<YEAR>1987</YEAR>
</CD>
<CD>
<TITLE>Unchain my heart</TITLE>
<ARTIST>Joe Cocker</ARTIST>
<COUNTRY>USA</COUNTRY>
<COMPANY>EMI</COMPANY>
<PRICE>8.20</PRICE>
<YEAR>1987</YEAR>
</CD>
</CATALOG>
'''
import requests
import pandas as pd
import xml.etree.ElementTree as ET
# Read XML from URL...
xml_response = requests.get(xml_url, timeout=5)
my_xml_root = ET.fromstring(xml_response.text)
# Parse XML from str...
my_xml_root = ET.fromstring(xml_data) # to read from file: ET.parse('sample.xml').getroot()
# Parse XML from file...
my_xml_root = ET.parse(xml_file).getroot()
my_xml_root
The three options above will return the xml Element at different memory location like this: <Element 'CATALOG' at 0x000001C0E04EA360>, <Element 'CATALOG' at 0x000001C0E03DBD10>, <Element 'CATALOG' at 0x000001C0E04E5B80>
From this point onward, every other property or function called on the xml root element is the same irrespective of the source xml data coming from a file, url or a string.
If you do dir(my_xml_root), you will see all the available properties or functions that can be called on the XML object. You could also check the help documentation like this help(my_xml_root) for more.
print(dir(my_xml_root))
['__class__', '__copy__', '__deepcopy__', '__delattr__', '__delitem__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getitem__', '__getstate__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__len__', '__lt__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__setitem__', '__setstate__', '__sizeof__', '__str__', '__subclasshook__', 'append', 'attrib', 'clear', 'extend', 'find', 'findall', 'findtext', 'get', 'getchildren', 'getiterator', 'insert', 'items', 'iter', 'iterfind', 'itertext', 'keys', 'makeelement', 'remove', 'set', 'tag', 'tail', 'text']
Before you start using the property or function, it is a good practice to use an XML viewer to view how the XML data is structured. I will use this free online XML viewer to inspect/explore the structure.
rootElement >> sub-element >> sub-element >> ... >> items
'CATALOG' >> 'CD' >> items (childs)
# The root element is only one in this case and it has no attributes as seen below...
my_xml_root.tag # 'CATALOG'
my_xml_root.attrib # empty dict {}
# The first child...
my_xml_root[0]
for x in my_xml_root[0]:
print(x.tag, x.attrib, x.text)
# =========================================
# Now lets get all children as well as there text attributes
all_children = list(my_xml_root) # Old way is: my_xml_root.getchildren()
for child in all_children:
print(child.find('TITLE').text)
print(child.find('ARTIST').text)
print(child.find('COUNTRY').text)
print(child.find('COMPANY').text)
print(child.find('PRICE').text)
print(child.find('YEAR').text)
break # only print the first record
# =========================================
# Now we can collect the records into list/dictionary to construct a friend dataframe structure
data_list = []
for child in all_children:
data_dict = {}
data_dict['Title'] = child.find('TITLE').text
data_dict['Artist'] = child.find('ARTIST').text
data_dict['Country'] = child.find('COUNTRY').text
data_dict['Company'] = child.find('COMPANY').text
data_dict['Price'] = child.find('PRICE').text
data_dict['Year'] = child.find('YEAR').text
data_list.append(data_dict)
# =========================================
# Generate the df...
data_list_df = pd.DataFrame(data_list)
data_list_df
new_xml_url = 'https://spreadsheets.google.com/feeds/list/17beU3LK_ZJEp4m_WKb2D8xDyX-6_MmmgU8BAnWf4NM0/o8b7v7h/public/values/ciyn3'
# Read XML from URL...
xml_response = requests.get(new_xml_url, timeout=5)
my_xml_root = ET.fromstring(xml_response.text)
# ------------------------
my_xml_root.attrib
my_xml_root.tag
# Get all children...
all_children = list(my_xml_root)
# the items we wanted are in the fifth child...
all_children[4].tag
all_children[4].attrib
all_children[4].text
No comments:
Post a Comment