Beautifulsoup Cheat Sheet

Summary: Use urllib.parse.urljoin() to scrape the base URL and the relative path and join them to extract the complete/absolute URL. You can also concatenate the base URL and the absolute path to derive the absolute path; but make sure to take care of erroneous situations like extra forward-slash in this case.

# The SoupStrainer class allows you to choose which parts of an # incoming document are parsed from bs4 import SoupStrainer # conditions onlyatags = SoupStrainer ('a') onlytagswithidlink2 = SoupStrainer (id = 'link2') def isshortstring (string): return len (string). The following are 30 code examples for showing how to use BeautifulSoup.BeautifulSoup.These examples are extracted from open source projects. You can vote up the ones you like or vote down the ones you don't like, and go to the original project or source file by following the links above each example.

Beautifulsoup Cheat Sheet Pdf

Problem Formulation

Problem: How to extract all the absolute URLs from an HTML page?

Example: Consider the following webpage which has numerous links:

Now, when you try to scrape the links as highlighted above, you find that that, only the relative links/paths are extracted instead of the entire absolute path. Let us have a look at the code given below which demonstrates what happens when you try to extract the ‘href’ elements normally.

Output:

The above output is not what you desired. You wanted to extract the absolute paths as shown below:

Hence, without further delay let us go ahead and try to extract the absolute paths instead of the relative paths.

Method 1: Using urllib.parse.urljoin()

The easiest solution to our problem is to use the urllib.parse.urljoin() method.

According to the Python documentation: urllib.parse.urljoin() is used to construct a full/absolute URL by combining the “base URL” with another URL. The advantage of using the urljoin() is that it properly resolves the relative path, whether BASE_URL is the domain of the URL, or the absolute URL of the webpage.

Output:

Now that we have an idea about urljoin, let us have a look at the following code which successfully resolves our problem and helps us to extract the complete/absolute paths from the HTML page.

Solution:

Output:

Method 2: Concatenate The Base URL And Relative URL Manually

Another work-around to our problem is to concatenate the base part of the URL and the relative URLs manually just like two ordinary strings. The problem, in this case, is that manually adding the strings might lead to “one-off” errors ( spot the extra / below):

Therefore in order to ensure proper concatenation, you have to modify your code accordingly such that any extra character that might lead to errors is removed. Let us have a look at the following code that helps us to concatenate the base and the relative paths without the presence of any extra forward-slash.

Solution:

Output:

⚠️ Caution: This is not the recommended way of extracting the absolute path from a given HTML page. In situations, when you have an automated script that needs to resolve a URL but at the time of writing the script you don’t know what website your script is visiting, in that case, this method won’t serve your purpose and your go-to method would be to use urlljoin. Nevertheless, this method deserves to be mentioned because in our case it successfully serves the purpose and helps us to extract the absolute URLs.

Conclusion

In this article, we learned how to extract the absolute links from a given HTML page using BeautifulSoup. If you want to master the concepts of Pythons BeautifulSoup library and dive deep into the concepts along with examples and video lessons, please have a look at the following link and follow the articles one by one wherein you will find every aspect of BeautifulSoup explained in great details.