Extensible Markup Language (XML) is a widely used format for storing and exchanging structured data. XML files are commonly used to represent hierarchical data, such as configuration files, data interchange formats, web service responses, and web sitemaps.
Parsing XML files in Python is a common task, especially for automating manual processes like processing data retrieved from web APIs orweb scraping.
In this article, you’ll learn about some of the libraries that you can use to parse XML in Python, including theElementTree
module,lxml library,minidom
,Simple API for XML (SAX), anduntangle.
Key Concepts of an XML File
Before you learn how to parse XML in Python, you must understand what XML Schema Definition (XSD) is and what elements make up an XML file. This understanding can help you select the appropriate Python library for your parsing task.
XSDis a schema specification that defines the structure, content, and data types allowed in an XML document. It serves as a syntax for validating the structure and content of XML files against a predefined set of rules.
An XML file usually includes the elementsNamespace
,root
,attributes
,elements
, andtext content
, which collectively represent structured data.
Namespace
allows elements and attributes in XML documents to be uniquely identified.Namespace
helps avoid naming conflicts and enables interoperability between XML documents.roo
t
is the top-level element in an XML document. It serves as the starting point for navigating the XML structure and contains all other elements as its children.attributes
provide additional information about the element. They’re specified within the start tag of an element and consist of a name-value pair.elements
are the building blocks of an XML document and represent the data or structure being described. Elements can be nested within other elements to create a hierarchical structure.text content
refers to the textual data enclosed within an element’s start and end tags. It can include plaintext, numbers, or other characters.
For example, theBright Data sitemaphas the following XML structure:
urlset
is theroot
element.<urlset xsi:schemaLocation="http://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd>
is the namespace declaration specific to theurlset
element, implying that this declaration’s rules extend to theurlset
element. All elements under it must conform to the schema outlined by this namespace.url
is the first child of theroot
element.loc
is the child element of theurl
element.
Now that you know a little more about XSD and XML file elements, let’s use that information to help parse an XML file with a few libraries.
Various Ways to Parse XML in Python
For demonstration purposes, you’ll use the Bright Data sitemap for this tutorial, which is available in XML format. In the following examples, the Bright Data sitemap content is fetched using thePython requests library.
The Python requests library is not built-in, so you need to install it before proceeding. You can do so using the following command:
pip install requests
ElementTree
TheElementTree XML APIprovides a simple and intuitive API for parsing and creating XML data in Python. It’s a built-in module in Python’s standard library, which means you don’t need to install anything explicitly.
For example, you can use thefindall()
method to find all theurl
elements from the root and print the text value of theloc
element, like this:
import xml.etree.ElementTree as ETimport requestsurl = 'https://brightdata.com/post-sitemap.xml'response = requests.get(url)if response.status_code == 200: root = ET.fromstring(response.content) for url_element in root.findall('.//{http://www.sitemaps.org/schemas/sitemap/0.9}url'): loc_element = url_element.find('{http://www.sitemaps.org/schemas/sitemap/0.9}loc') if loc_element is not None: print(loc_element.text)else: print("Failed to retrieve XML file from the URL.")
All the URLs in the sitemap are printed in the output:
https://brightdata.com/case-studies/powerdrop-case-studyhttps://brightdata.com/case-studies/addressing-brand-protection-from-every-anglehttps://brightdata.com/case-studies/taking-control-of-the-digital-shelf-with-public-online-datahttps://brightdata.com/case-studies/the-seo-transformationhttps://brightdata.com/case-studies/data-driven-automated-e-commerce-toolshttps://brightdata.com/case-studies/highly-targeted-influencer-marketinghttps://brightdata.com/case-studies/data-driven-products-for-smarter-shopping-solutionshttps://brightdata.com/case-studies/workplace-diversity-facilitated-by-online-datahttps://brightdata.com/case-studies/alternative-travel-solutions-enabled-by-online-data-railofyhttps://brightdata.com/case-studies/data-intensive-analytical-solutionshttps://brightdata.com/case-studies/canopy-advantage-solutionshttps://brightdata.com/case-studies/seamless-digital-automations
ElementTree is a user-friendly way to parse XML data in Python, featuring a straightforward API that makes it easy to navigate and manipulate XML structures. However, ElementTree does have its limitations; it lacks robust support for schema validation and is not ideal if you need to ensure strict adherence to a schema specification before parsing.
If you have a small script that reads an RSS feed, the user-friendly API of ElementTree would be a useful tool for extracting titles, descriptions, and links from each feed item. However, if you have a use case with complex validation or massive files, it would be better to consider another library like lxml.
lxml
lxmlis a fast, easy-to-use, and feature-rich API for parsing XML files in Python; however, it’s not a prebuilt library in Python. While some Linux and Mac platforms have the lxml package already installed, other platforms need manual installation.
lxml is distributed viaPyPIand you caninstalllxml
using the followingpip
command:
pip install lxml
Once installed, you can uselxml
to parse XML files usingvarious APImethods, such asfind()
,findall()
,findtext()
,get()
, andget_element_by_id()
.
For instance, you can use thefindall()
method to iterate over theurl
elements, find theirloc
elements (which are child elements of theurl
element), and then print the location text using the following code:
from lxml import etreeimport requestsurl = "https://brightdata.com/post-sitemap.xml"response = requests.get(url)if response.status_code == 200: root = etree.fromstring(response.content) for url in root.findall(".//{http://www.sitemaps.org/schemas/sitemap/0.9}url"): loc = url.find("{http://www.sitemaps.org/schemas/sitemap/0.9}loc").text.strip() print(loc)else: print("Failed to retrieve XML file from the URL.")
The output displays all the URLs found in the sitemap:
https://brightdata.com/case-studies/powerdrop-case-studyhttps://brightdata.com/case-studies/addressing-brand-protection-from-every-anglehttps://brightdata.com/case-studies/taking-control-of-the-digital-shelf-with-public-online-datahttps://brightdata.com/case-studies/the-seo-transformationhttps://brightdata.com/case-studies/data-driven-automated-e-commerce-toolshttps://brightdata.com/case-studies/highly-targeted-influencer-marketinghttps://brightdata.com/case-studies/data-driven-products-for-smarter-shopping-solutionshttps://brightdata.com/case-studies/workplace-diversity-facilitated-by-online-datahttps://brightdata.com/case-studies/alternative-travel-solutions-enabled-by-online-data-railofyhttps://brightdata.com/case-studies/data-intensive-analytical-solutionshttps://brightdata.com/case-studies/canopy-advantage-solutionshttps://brightdata.com/case-studies/seamless-digital-automations
So far, you’ve learned how to find elements and print their value. Now, let’s explore schema validation before parsing the XML. This process ensures that the file conforms to the specified structure defined by the schema.
The XSD for the sitemap looks like this:
<?xml version="1.0"?><xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" targetNamespace="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" elementFormDefault="qualified" xmlns:xhtml="http://www.w3.org/1999/xhtml"> <xs:element name="urlset"> <xs:complexType> <xs:sequence> <xs:element ref="url" minOccurs="0" maxOccurs="unbounded"/> </xs:sequence> </xs:complexType> </xs:element> <xs:element name="url"> <xs:complexType> <xs:sequence> <xs:element name="loc" type="xs:anyURI"/> </xs:sequence> </xs:complexType> </xs:element></xs:schema>
To use the sitemap for schema validation, make sure you copy it manually and create a file namedschema.xsd
.
To validate the XML file using this XSD, use the following code:
from lxml import etreeimport requestsurl = "https://brightdata.com/post-sitemap.xml"response = requests.get(url)if response.status_code == 200: root = etree.fromstring(response.content) try: print("Schema Validation:") schema_doc = etree.parse("schema.xsd") schema = etree.XMLSchema(schema_doc) schema.assertValid(root) print("XML is valid according to the schema.") except etree.DocumentInvalid as e: print("XML validation error:", e)
Here, you parse the XSD file using theetree.parse()
method. Then you create anXML Schemausing the parsed XSD doc content. Finally, you validate the XML root document against the XML schema using theassertValid()
method. If the schema validation passes, your output includes a message that says something likeXML is valid according to the schema
. Otherwise, theDocumentInvalid
exception is raised.
Your output should look like this:
Schema Validation: XML is valid according to the schema.
Now, let’s read an XML file that uses thexpath
method to find the elements using their path.
To read the elements using thexpath()
method, use the following code:
from lxml import etreeimport requestsurl = "https://brightdata.com/post-sitemap.xml"response = requests.get(url)if response.status_code == 200: root = etree.fromstring(response.content) print("XPath Support:") root = etree.fromstring(response.content) namespaces = {"ns": "http://www.sitemaps.org/schemas/sitemap/0.9"} for url in root.xpath(".//ns:url/ns:loc", namespaces=namespaces): print(url.text.strip())
In this code, you register the namespace prefixns
and map it to the namespace URIhttp://www.sitemaps.org/schemas/sitemap/0.9
. In theXPath
expression, you use thens
prefix to specify elements in the namespace. Finally, the expression.//ns:url/ns:loc
selects allloc
elements that are children ofurl
elements in the namespace.
Your output will look like this:
XPath Support:https://brightdata.com/case-studies/powerdrop-case-studyhttps://brightdata.com/case-studies/addressing-brand-protection-from-every-anglehttps://brightdata.com/case-studies/taking-control-of-the-digital-shelf-with-public-online-datahttps://brightdata.com/case-studies/the-seo-transformationhttps://brightdata.com/case-studies/data-driven-automated-e-commerce-toolshttps://brightdata.com/case-studies/highly-targeted-influencer-marketinghttps://brightdata.com/case-studies/data-driven-products-for-smarter-shopping-solutionshttps://brightdata.com/case-studies/workplace-diversity-facilitated-by-online-datahttps://brightdata.com/case-studies/alternative-travel-solutions-enabled-by-online-data-railofyhttps://brightdata.com/case-studies/data-intensive-analytical-solutionshttps://brightdata.com/case-studies/canopy-advantage-solutionshttps://brightdata.com/case-studies/seamless-digital-automations
As you can see, thefind()
andfindall()
methods are faster than thexpath
method becausexpath
collects all the results into the memory before returning them. It’s recommended that you use thefind()
method unless there is a specific reason for usingXPath
queries.
lxml offers powerful features for parsing and manipulating XML and HTML. It supports complex queries usingXPath expressions, validates documents against schemas, and even allows foreXtensible Stylesheet Language Transformations (XSLT). This makes it ideal for scenarios where performance and advanced functionality are crucial. However, keep in mind that lxml requires a separate installation as it’s not part of the core Python package.
If you’re dealing with large or complex XML data that requires both high performance and advanced manipulation, you should consider using lxml. For instance, if you’re processing financial data feeds in XML format, you might need to use XPath expressions to extract specific elements like stock prices, validate the data against a financial schema to ensure accuracy, and potentially transform the data using XSLT for further analysis.
minidom
minidom
is a lightweight and simple XML parsing library that’s included in Python’s standard library. While it’s not as feature-rich or efficient as parsing with lxml, it offers a straightforward way to parse and manipulate XML data in Python.
You can use the various methods available in theDOMobject to access elements. For example, you can use thegetElementsByTagName()
methodto retrieve the value of an element using its tag name.
The following example demonstrates how to use theminidom
library to parse an XML file and fetch the elements using their tag names:
import requestsimport xml.dom.minidomurl = "https://brightdata.com/post-sitemap.xml"response = requests.get(url)if response.status_code == 200: dom = xml.dom.minidom.parseString(response.content) urlset = dom.getElementsByTagName("urlset")[0] for url in urlset.getElementsByTagName("url"): loc = url.getElementsByTagName("loc")[0].firstChild.nodeValue.strip() print(loc)else: print("Failed to retrieve XML file from the URL.")
Your output would look like this:
https://brightdata.com/case-studies/powerdrop-case-studyhttps://brightdata.com/case-studies/addressing-brand-protection-from-every-anglehttps://brightdata.com/case-studies/taking-control-of-the-digital-shelf-with-public-online-datahttps://brightdata.com/case-studies/the-seo-transformationhttps://brightdata.com/case-studies/data-driven-automated-e-commerce-toolshttps://brightdata.com/case-studies/highly-targeted-influencer-marketinghttps://brightdata.com/case-studies/data-driven-products-for-smarter-shopping-solutionshttps://brightdata.com/case-studies/workplace-diversity-facilitated-by-online-datahttps://brightdata.com/case-studies/alternative-travel-solutions-enabled-by-online-data-railofyhttps://brightdata.com/case-studies/data-intensive-analytical-solutionshttps://brightdata.com/case-studies/canopy-advantage-solutionshttps://brightdata.com/case-studies/seamless-digital-automations
minidom
works with XML data by representing it as a DOM tree. This tree structure makes it easy to navigate and manipulate data, and it’s best suited for basic tasks such as reading, changing, or building simple XML structures.
If your program involves reading default settings from an XML file, the DOM approach ofminidom
allows you to easily access specific settings within the XML file using methods such as finding child nodes or attributes. Withminidom
, you can easily retrieve specific settings from the XML file, such as thefont-size
node, and utilize its value within your application.
SAX Parser
TheSAX parseris an event-driven XML parsing approach in Python that processes XML documents sequentially and generates events as it encounters various parts of the document. Unlike DOM-based parsers that construct a tree structure representing the entire XML document in memory, SAX parsers do not build a complete representation of the document. Instead, it emits events such as start tags, end tags, and text content as they parse through the document.
SAX parsers are good for processing large XML files or streams where memory efficiency is a concern as they operate on XML data incrementally without loading the entire document into memory.
When using the SAX parser, you need to define the event handlers that respond to specific XML events, such as thestartElement
andendElement
emitted by the parser. These event handlers can be customized to perform actions based on the structure and content of the XML document.
The following example demonstrates how to parse an XML file using the SAX parser by defining thestartElement
andendElement
events and retrieving the URL information from the sitemap file:
import requestsimport xml.sax.handlerfrom io import BytesIOclass MyContentHandler(xml.sax.handler.ContentHandler): def __init__(self): self.in_url = False self.in_loc = False self.url = "" def startElement(self, name, attrs): if name == "url": self.in_url = True elif name == "loc" and self.in_url: self.in_loc = True def characters(self, content): if self.in_loc: self.url += content def endElement(self, name): if name == "url": print(self.url.strip()) self.url = "" self.in_url = False elif name == "loc": self.in_loc = Falseurl = "https://brightdata.com/post-sitemap.xml"response = requests.get(url)if response.status_code == 200: xml_content = BytesIO(response.content) content_handler = MyContentHandler() parser = xml.sax.make_parser() parser.setContentHandler(content_handler) parser.parse(xml_content)else: print("Failed to retrieve XML file from the URL.")
Your output would look like this:
https://brightdata.com/case-studies/powerdrop-case-studyhttps://brightdata.com/case-studies/addressing-brand-protection-from-every-anglehttps://brightdata.com/case-studies/taking-control-of-the-digital-shelf-with-public-online-datahttps://brightdata.com/case-studies/the-seo-transformationhttps://brightdata.com/case-studies/data-driven-automated-e-commerce-toolshttps://brightdata.com/case-studies/highly-targeted-influencer-marketinghttps://brightdata.com/case-studies/data-driven-products-for-smarter-shopping-solutionshttps://brightdata.com/case-studies/workplace-diversity-facilitated-by-online-datahttps://brightdata.com/case-studies/alternative-travel-solutions-enabled-by-online-data-railofyhttps://brightdata.com/case-studies/data-intensive-analytical-solutionshttps://brightdata.com/case-studies/canopy-advantage-solutionshttps://brightdata.com/case-studies/seamless-digital-automations
Unlike other parsers that load the entire file into memory, SAX processes files incrementally, conserving memory and enhancing performance. However, SAX necessitates writing more code to manage each data segment dynamically. Additionally, it cannot revisit and analyze specific parts of the data later on.
If you need to scan a large XML file (ega log file containing various events) to extract specific information (egerror messages), SAX can help you efficiently navigate through the file. However, if your analysis requires understanding the relationships between different data segments, SAX may not be the best choice.
untangle
untangleis a lightweight XML parsing library for Python that simplifies the process of extracting data from XML documents. Unlike traditional XML parsers that require navigating through hierarchical structures, untangle lets you access XML elements and attributes directly as Python objects.
With untangle, you can convert XML documents into nested Python dictionaries, where XML elements are represented as dictionary keys, and their attributes and text content are stored as corresponding values. This approach makes it easy to access and manipulate XML data using Python data structures.
untangle is not available by default in Python and needs to be installed using the followingPyPI
command:
pip install untangle
The following example demonstrates how to parse the XML file using the untangle library and access the XML elements:
import untangleimport requestsurl = "https://brightdata.com/post-sitemap.xml"response = requests.get(url)if response.status_code == 200: obj = untangle.parse(response.text) for url in obj.urlset.url: print(url.loc.cdata.strip())else: print("Failed to retrieve XML file from the URL.")
Your output will look like this:
https://brightdata.com/case-studies/powerdrop-case-studyhttps://brightdata.com/case-studies/addressing-brand-protection-from-every-anglehttps://brightdata.com/case-studies/taking-control-of-the-digital-shelf-with-public-online-datahttps://brightdata.com/case-studies/the-seo-transformationhttps://brightdata.com/case-studies/data-driven-automated-e-commerce-toolshttps://brightdata.com/case-studies/highly-targeted-influencer-marketinghttps://brightdata.com/case-studies/data-driven-products-for-smarter-shopping-solutionshttps://brightdata.com/case-studies/workplace-diversity-facilitated-by-online-datahttps://brightdata.com/case-studies/alternative-travel-solutions-enabled-by-online-data-railofyhttps://brightdata.com/case-studies/data-intensive-analytical-solutionshttps://brightdata.com/case-studies/canopy-advantage-solutionshttps://brightdata.com/case-studies/seamless-digital-automations
untangle offers a user-friendly approach to working with XML data in Python. It simplifies the parsing process with clear syntax and automatically converts the XML structure into easy-to-use Python objects, eliminating the need for complex navigation techniques. However, keep in mind that untangle requires separate installation as it’s not part of the core Python package.
You should consider using untangle if you have a well-formed XML file and need to quickly convert it into Python objects for further processing. For example, if you have a program that downloads weather data in XML format, untangle could be a good fit to parse the XML and create Python objects representing the current temperature, humidity, and forecast. These objects could then be easily manipulated and displayed within your application.
Conclusion
In this article, you learned all about XML files and the various methods for parsing XML files in Python.
Whether you’re working with small configuration files, parsing large web service responses, or extracting data from extensive sitemaps, Python offers versatile libraries to automate and streamline your XML parsing tasks. However, when accessing files from the web using the requests library without proxy management, you may encounter quota exceptions and throttling issues.Bright Datais an award-winning proxy network that provides reliable and efficientproxy solutionsto ensure seamless data retrieval and parsing. With Bright Data, you can tackle XML parsing tasks without worrying about limitations or disruptions. Contact oursales teamto learn more.
Want to skip the whole scraping and parsing process? Try our dataset marketplace for free!
No credit card required