HTML & CSS

Get element by XPath: Tutorial for Web Scraping

Get element by XPath: Tutorial for Web Scraping

Introduction to XPath

Hey there! Let’s talk about XPath, a powerful tool for web scraping. XPath is a language that helps you navigate through XML and HTML documents. It’s especially useful when you’re using Scrapy, a popular web scraping framework.

Why use XPath?

  1. It’s more flexible than CSS selectors
  2. You can extract data based on text content, not just page structure
  3. It’s a lifesaver for hard-to-scrape websites

The Basics

Imagine HTML as a tree. The root isn’t part of the document, but it’s the parent of the element. Let’s look at a simple HTML document:

<html>
  <head>
    <title>My page</title>
  </head>
  <body>
    <h2>Welcome to my <a href="#">page</a></h2>
    <p>This is the first paragraph.</p>
    <!-- this is the end -->
  </body>
</html>

In XPath, we have different types of nodes:

  1. Element nodes: HTML tags like
  2. Attribute nodes: Attributes in tags, like href in
  3. Comment nodes: Comments like
  4. Text nodes: The text inside elements

Basic XPath Expressions:

  1. Full path: /html/head/title
    This starts from the root and follows each element.
  2. Anywhere in the document: //title
    This finds any ‘title’ element anywhere in the document.
  3. Child elements: //h2/a
    This finds ‘a’ elements that are direct children of ‘h2’ elements.

Node Tests:

  1. Select comments: //comment()
  2. Select any node: //node()
  3. Select text nodes: //text()
  4. Select all elements: //*

Combining tests: //p/text() selects text nodes inside ‘p’ elements.

Filtering with Predicates

Let’s look at a new HTML snippet:

<html>
  <body>
    <ul>
      <li>Quote 1</li>
      <li>Quote 2 with <a href="...">link</a></li>
      <li>Quote 3 with <a href="...">another link</a></li>
      <li><h2>Quote 4 title</h2> ...</li>
    </ul>
  </body>
</html>

To select specific elements, we use predicates (conditions in square brackets):

  1. First ‘li’: //li[1] or //li[position() = 1]
  2. Even positioned ‘li’: //li[position()%2=0]
  3. ‘li’ with ‘a’ inside: //li[a]
  4. ‘li’ with ‘a’ or ‘h2’: //li[a or h2]
  5. ‘li’ with specific text: //li[a[text() = "link"]]
  6. Last ‘li’: //li[last()]

Combining expressions: //a | //h2 selects all ‘a’ and ‘h2’ elements.

Working with Attributes

New HTML example:

<html>
  <body>
    <ul>
      <li id="begin"><a href="https://scrapy.org">Scrapy</a></li>
      <li><a href="https://scrapinghub.com">Scrapinghub</a></li>
      <li><a href="https://blog.scrapinghub.com">Scrapinghub Blog</a></li>
      <li id="end"><a href="http://quotes.toscrape.com">Quotes To Scrape</a></li>
    </ul>
  </body>
</html>
  1. Select ‘a’ with HTTPS links: //a[starts-with(@href, "https")]
  2. Select specific ‘a’: //a[@href="https://scrapy.org"]
  3. Select all href values: //a/@href
  4. Select ‘li’ with id: //li[@id]

More on Axes

Axes define the direction to look for nodes. Let’s use this HTML:

<html>
  <body>
    <p>Intro paragraph</p>
    <h1>Title #1</h1>
    <p>A random paragraph #1</p>
    <h1>Title #2</h1>
    <p>A random paragraph #2</p>
    <p>Another one #2</p>
    A single paragraph, with no markup
    <div id="footer"><p>Footer text</p></div>
  </body>
</html>
  1. First paragraph after each title: //h1/following-sibling::p[1]
  2. Text before footer: //div[@id='footer']/preceding-sibling::text()[1]
  3. Parent of footer text: //p[text()="Footer text"]/.. or //*[p/text()="Footer text"]

Remember, practice makes perfect! Try these examples in your browser’s developer tools or in an XPath playground. As you get more comfortable, you’ll be able to use XPath effectively in your web scraping projects.

Happy scraping!

Suggested Articles

Leave a Reply

Your email address will not be published. Required fields are marked *