Get element by XPath: Tutorial for Web Scraping

Introduction to XPath

Hey there! Let’s talk about XPath, a powerful tool for web scraping. XPath is a language that helps you navigate through XML and HTML documents. It’s especially useful when you’re using Scrapy, a popular web scraping framework.

Why use XPath?

It’s more flexible than CSS selectors
You can extract data based on text content, not just page structure
It’s a lifesaver for hard-to-scrape websites

The Basics

Imagine HTML as a tree. The root isn’t part of the document, but it’s the parent of the element. Let’s look at a simple HTML document:

<html>
  <head>
    <title>My page</title>
  </head>
  <body>
    <h2>Welcome to my <a href="#">page</a></h2>
    <p>This is the first paragraph.</p>
    <!-- this is the end -->
  </body>
</html>

In XPath, we have different types of nodes:

Element nodes: HTML tags like
Attribute nodes: Attributes in tags, like href in
Comment nodes: Comments like
Text nodes: The text inside elements

Basic XPath Expressions:

Full path: /html/head/title
This starts from the root and follows each element.
Anywhere in the document: //title
This finds any ‘title’ element anywhere in the document.
Child elements: //h2/a
This finds ‘a’ elements that are direct children of ‘h2’ elements.

Node Tests:

Select comments: //comment()
Select any node: //node()
Select text nodes: //text()
Select all elements: //*

Combining tests: //p/text() selects text nodes inside ‘p’ elements.

Filtering with Predicates

Let’s look at a new HTML snippet:

<html>
  <body>
    <ul>
      <li>Quote 1</li>
      <li>Quote 2 with <a href="...">link</a></li>
      <li>Quote 3 with <a href="...">another link</a></li>
      <li><h2>Quote 4 title</h2> ...</li>
    </ul>
  </body>
</html>

To select specific elements, we use predicates (conditions in square brackets):

First ‘li’: //li[1] or //li[position() = 1]
Even positioned ‘li’: //li[position()%2=0]
‘li’ with ‘a’ inside: //li[a]
‘li’ with ‘a’ or ‘h2’: //li[a or h2]
‘li’ with specific text: //li[a[text() = "link"]]
Last ‘li’: //li[last()]

Combining expressions: //a | //h2 selects all ‘a’ and ‘h2’ elements.

Working with Attributes

New HTML example:

<html>
  <body>
    <ul>
      <li id="begin"><a href="https://scrapy.org">Scrapy</a></li>
      <li><a href="https://scrapinghub.com">Scrapinghub</a></li>
      <li><a href="https://blog.scrapinghub.com">Scrapinghub Blog</a></li>
      <li id="end"><a href="http://quotes.toscrape.com">Quotes To Scrape</a></li>
    </ul>
  </body>
</html>

Select ‘a’ with HTTPS links: //a[starts-with(@href, "https")]
Select specific ‘a’: //a[@href="https://scrapy.org"]
Select all href values: //a/@href
Select ‘li’ with id: //li[@id]

More on Axes

Axes define the direction to look for nodes. Let’s use this HTML:

<html>
  <body>
    <p>Intro paragraph</p>
    <h1>Title #1</h1>
    <p>A random paragraph #1</p>
    <h1>Title #2</h1>
    <p>A random paragraph #2</p>
    <p>Another one #2</p>
    A single paragraph, with no markup
    <div id="footer"><p>Footer text</p></div>
  </body>
</html>

First paragraph after each title: //h1/following-sibling::p[1]
Text before footer: //div[@id='footer']/preceding-sibling::text()[1]
Parent of footer text: //p[text()="Footer text"]/.. or //*[p/text()="Footer text"]

Remember, practice makes perfect! Try these examples in your browser’s developer tools or in an XPath playground. As you get more comfortable, you’ll be able to use XPath effectively in your web scraping projects.

Happy scraping!

Get element by XPath: Tutorial for Web Scraping

Navigating and Selecting Elements with XPath(XML Path Language) in JavaScript

How to Add Text Shadow In Tailwind CSS

Digital Locks: Strengthening Your Content

Leave a Reply Cancel reply