Introduction to XPath
Hey there! Let’s talk about XPath, a powerful tool for web scraping. XPath is a language that helps you navigate through XML and HTML documents. It’s especially useful when you’re using Scrapy, a popular web scraping framework.
Why use XPath?
- It’s more flexible than CSS selectors
- You can extract data based on text content, not just page structure
- It’s a lifesaver for hard-to-scrape websites
The Basics
Imagine HTML as a tree. The root isn’t part of the document, but it’s the parent of the element. Let’s look at a simple HTML document:
<html>
<head>
<title>My page</title>
</head>
<body>
<h2>Welcome to my <a href="#">page</a></h2>
<p>This is the first paragraph.</p>
<!-- this is the end -->
</body>
</html>
In XPath, we have different types of nodes:
- Element nodes: HTML tags like
- Attribute nodes: Attributes in tags, like href in
- Comment nodes: Comments like
- Text nodes: The text inside elements
Basic XPath Expressions:
- Full path:
/html/head/title
This starts from the root and follows each element. - Anywhere in the document:
//title
This finds any ‘title’ element anywhere in the document. - Child elements:
//h2/a
This finds ‘a’ elements that are direct children of ‘h2’ elements.
Node Tests:
- Select comments:
//comment()
- Select any node:
//node()
- Select text nodes:
//text()
- Select all elements:
//*
Combining tests: //p/text()
selects text nodes inside ‘p’ elements.
Filtering with Predicates
Let’s look at a new HTML snippet:
<html>
<body>
<ul>
<li>Quote 1</li>
<li>Quote 2 with <a href="...">link</a></li>
<li>Quote 3 with <a href="...">another link</a></li>
<li><h2>Quote 4 title</h2> ...</li>
</ul>
</body>
</html>
To select specific elements, we use predicates (conditions in square brackets):
- First ‘li’:
//li[1]
or//li[position() = 1]
- Even positioned ‘li’:
//li[position()%2=0]
- ‘li’ with ‘a’ inside:
//li[a]
- ‘li’ with ‘a’ or ‘h2’:
//li[a or h2]
- ‘li’ with specific text:
//li[a[text() = "link"]]
- Last ‘li’:
//li[last()]
Combining expressions: //a | //h2
selects all ‘a’ and ‘h2’ elements.
Working with Attributes
New HTML example:
<html>
<body>
<ul>
<li id="begin"><a href="https://scrapy.org">Scrapy</a></li>
<li><a href="https://scrapinghub.com">Scrapinghub</a></li>
<li><a href="https://blog.scrapinghub.com">Scrapinghub Blog</a></li>
<li id="end"><a href="http://quotes.toscrape.com">Quotes To Scrape</a></li>
</ul>
</body>
</html>
- Select ‘a’ with HTTPS links:
//a[starts-with(@href, "https")]
- Select specific ‘a’:
//a[@href="https://scrapy.org"]
- Select all href values:
//a/@href
- Select ‘li’ with id:
//li[@id]
More on Axes
Axes define the direction to look for nodes. Let’s use this HTML:
<html>
<body>
<p>Intro paragraph</p>
<h1>Title #1</h1>
<p>A random paragraph #1</p>
<h1>Title #2</h1>
<p>A random paragraph #2</p>
<p>Another one #2</p>
A single paragraph, with no markup
<div id="footer"><p>Footer text</p></div>
</body>
</html>
- First paragraph after each title:
//h1/following-sibling::p[1]
- Text before footer:
//div[@id='footer']/preceding-sibling::text()[1]
- Parent of footer text:
//p[text()="Footer text"]/..
or//*[p/text()="Footer text"]
Remember, practice makes perfect! Try these examples in your browser’s developer tools or in an XPath playground. As you get more comfortable, you’ll be able to use XPath effectively in your web scraping projects.
Happy scraping!