YouTube Video Scraping Python With Beautiful Soup

If you want to know, how to scrape data from youtube. then you are in right place. because In this post, we are going to see how you can scrape a website for video links with Beautiful Soup and Python. We will do YouTube Video scraping because it’s easy and will give you confidence as well. Do note, we are only processing the first page of search results. Let us go ahead and look at our first scrape youtube search results Python With Beautiful Soup tutorial.

YouTube Video Scraping python with beautiful soup

YouTube Video Scraping Python With Beautiful Soup

Disclaimer

This article is for Educational Purposes only. Please check the laws for web scraping for your country and the website you are scraping. We are not responsible for companies suing you or law enforcement, intelligence, or secret services knocking at your door.

YouTube Video Scraping Python With Beautiful Soup ( Youtube Scraper Python)

Source

You can find the source code for the Python script here.

Let’s Dive In

In this example, we will be scraping Youtube, based on the search term provided by us. You would need to know basic HTML tags. We will be using Beautiful Soup, a python library for getting the data we want from HTML and XML files or sources. As this is our first Youtube Web scraping example we decided to choose an easy one.

We need to import two Python libraries into our code.

import requests
from bs4 import BeautifulSoup

If you haven’t installed these libraries you can find steps on how to do that here.

Alert: Firestick and Kodi User

ISPs and the Government are constantly monitoring your online activities, If you are streaming copyrighted content through Firestick, Kodi or any other unauthorised streaming service ,It could lead you into trouble. Your IP Address:34.217.39.76 is publicly visible to everyone.

TheFirestickTV suggests to use VPN Service For Safe Streaming. We recommend you to use ExpressVPN, It's one of the safest and fast VPN available in the Market. You wil get 3 Months For Free if you buy 12 month plan.

3 Month Free on 1 year Plan at Just $6.67/Month

30-Days Money Back Guarantee

Step 1 – Open Youtube.com in your Browser

We go to youtube.com on our browser. We prefer Firefox as it’s easier for what we do in Step 3, but you can also use Chrome.

Python code to open youtube.com, and get its HTML will be

sb_get = requests.get("https://www.youtube.com")

To see the HTML, we need to use

sb_get.content

In this case, youtube doesn’t know what browser the request is coming from so we may be blocked out. Hence we define a variable with details of firefox, and then use it in the above command (general terminology headers). You can choose the header for any browser you want.

mozhdr = {'User-Agent': 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv:1.9.0.3) Gecko/2008092417 Firefox/3.0.3'}

requests.get("https://www.youtube.com", headers = mozhdr)

Step 2 – Enter Search Term

As we are Game of Thrones fans, we enter the search term Game of Thrones and then click the search icon. A new page opens up and we get a list of results. The important thing to note here is how the URL changes from

https://www.youtube.com

to

https://www.youtube.com/results?search_query=game+of+thrones
Step 2 – Enter Search Term

Let us now search for Breaking Bad (another one of our favorite TV Shows). The URL changes to

https://www.youtube.com/results?search_query=breaking+bad

So, we know that space is replaced by a + and the search term is added after the below URL. Spaces are replaced with a + sign as URLs cannot have spaces.

https://www.youtube.com/results?search_query=

Equipped with all this knowledge, we can define three variables in our Python code.

scrape_url="https://www.youtube.com"
search_url="/results?search_query="
search_hardcode = "game+of+thrones"

We can now combine all the above terms, and we will get our search URL.

sb_url = scrape_url + search_url + search_hardcode

Now, if we want the HTML for the search URL page, we will give the below command.

sb_get = requests.get(sb_url, headers = mozhdr)

sb_get now has the response from youtube.com (200 if all is good), and we can find the HTML in

sb_get.content

The HTML source has a lot of stuff, what we need is the link to the video. This is where HTML basics come into the picture. All links on a page are enclosed with <a> tag. We suggest you learn more about the <a> tag from the below link.

HTML a tag Explained

We go into Firefox while we are on the search results page and enable Inspector.

Tools – Web Developer – Inspector

Web Developer – Inspector

Now, we hover the mouse cursor over the link to a video, we get the below details.

 mouse cursor over the link to a video

Notice, the <a> tag has all the details we need

  • the link to the video (in href)
  • the Title of the video (in title)

Step 4 – Filter out <a> Tags with Videos

Our page has a lot of <a> tags, but we only need the ones which have our video content. So, we need to figure out a way to identify all these <a> tags and filter them out. We need to look for something common in all these <a> tags. It is easy for youtube, but for some websites, you need to filter the <div> tag within which the <a> tag is enclosed. Again, this varies from site to site.

When we move our mouse over the video links we get details of the <a> tag. All the video links we need have below <a> tag details (marked in green below).

Video Scraping 005

The details marked in the green rectangle are class of the <a> tag.

Video scraping 007

This is what we are going to use to pull out all the <a> tags we need from the HTML source.

Step 5 – Beautiful Soup and find_all

To use Beautiful Soup functions we first need to ensure that the HTML we have is in a format recognized by Beautiful Soup. The below command takes care of it.

 soupeddata = BeautifulSoup(sb_get.content, "html.parser")

The variable soupeddata has HTML content in a format that is recognized by Beautiful Soup.

We now need to find all <a> tags with a specific class, as those are the <a> tags of interest to us.

 yt_links = soupeddata.find_all("a", class_ = "yt-uix-tile-link")

We will use the find_all function to get all <a> tags, which have class it-uix-tile-link. All these <a> tags are stored in a variable called yt_links (which will eventually be a list in Python).

We now have a Python list of <a> tags which has all the information we need. We still have some information to filter out as we only need the URL and title. So, we need to get href and title attributes of the <a> tag. Since yt_links is a list, we use a Python loop to process the list and grab the href and title attributes.

for x in yt_links:
 yt_href = x.get("href")
 yt_title = x.get("title")

Let’s pick one of the <a> tags we filtered out.

<a href=”/watch?v=pE2wcBeyNdk” class=”yt-uix-tile-link yt-ui-ellipsis yt-ui-ellipsis-2 yt-uix-sessionlink spf-link ” data-sessionlink=”itct=CIwBENwwGAEiEwiIhOnNt8rVAhXZxFUKHe9uDzMo9CRSD2dhbWUgb2YgdGhyb25lcw” title=”Game of Thrones: The Loot Train Attack (HBO)” aria-describedby=”description-id-568845″ rel=”spf-prefetch” dir=”ltr”>Game of Thrones: The Loot Train Attack (HBO)</a>

In this case

yt_href will be /watch?v=pE2wcBeyNdk

and

yt_title will be Game of Thrones: The Loot Train Attack (HBO)

Our video URL is still not complete, but all we need to do is add

https://www.youtube.com

before it, which we have in variable scrape_url

yt_final = scrape_url + yt_href

Will give us the complete link in variable yt_final

Step 7 – Done

That’s it peeps, we now have the Youtube link, and Youtube Title of the Video. If you execute the code in IDLE and print the variables yt_final and yt_title you should get an output similar to below.

Video Scraping 009

References

https://www.crummy.com/software/BeautifulSoup/bs4/doc/

In this guide, you have learned about youtube video scraper. I hope this guide was helpful. If you liked this guide, share it with your friends.

Warning

TheFirestickTV.com Does Not Promote Or Encourage Any Illegal Use Of Kodi, FireStick Or any streaming services. Users Are Responsible For Their Actions.

Razi Haider

Hi! I'm Razi Haider, a Professional Electronic Engineer and part-time blogger. Being an avid cinephile and other content streamers myself, I started TheFirestickTV.com in 2020 to help others to access and stream good content on any platform. Here I do share all the tutorials and compilations of best services, write how-to articles, and cover all the major and minor updates regarding streaming, firestick, Kodi, Roku and other devices and platforms.