If you want to know, how to scrape data from youtube. then you are in right place. because In this post, we are going to see how you can scrape a website for video links with Beautiful Soup and Python. We will do YouTube Video scraping because it’s easy and will give you confidence as well. Do note, we are only processing the first page of search results. Let us go ahead and look at our first scrape youtube search results Python With Beautiful Soup tutorial.
YouTube Video Scraping Python With Beautiful Soup
Disclaimer
This article is for Educational Purposes only. Please check the laws for web scraping for your country and the website you are scraping. We are not responsible for companies suing you or law enforcement, intelligence, or secret services knocking at your door.
YouTube Video Scraping Python With Beautiful Soup ( Youtube Scraper Python)
Source
You can find the source code for the Python script here.
Let’s Dive In
In this example, we will be scraping Youtube, based on the search term provided by us. You would need to know basic HTML tags. We will be using Beautiful Soup, a python library for getting the data we want from HTML and XML files or sources. As this is our first Youtube Web scraping example we decided to choose an easy one.
We need to import two Python libraries into our code.
import requests from bs4 import BeautifulSoup
If you haven’t installed these libraries you can find steps on how to do that here.
Alert: Firestick and Kodi User
ISPs and the Government are constantly monitoring your online activities, If you are streaming copyrighted content through Firestick, Kodi or any other unauthorised streaming service ,It could lead you into trouble. Your IP Address:34.217.39.76 is publicly visible to everyone.
TheFirestickTV suggests to use VPN Service For Safe Streaming. We recommend you to use ExpressVPN, It's one of the safest and fast VPN available in the Market. You wil get 3 Months For Free if you buy 12 month plan.3 Month Free on 1 year Plan at Just $6.67/Month
30-Days Money Back Guarantee
Step 1 – Open Youtube.com in your Browser
We go to youtube.com on our browser. We prefer Firefox as it’s easier for what we do in Step 3, but you can also use Chrome.
Python code to open youtube.com, and get its HTML will be
sb_get = requests.get("https://www.youtube.com")
To see the HTML, we need to use
sb_get.content
In this case, youtube doesn’t know what browser the request is coming from so we may be blocked out. Hence we define a variable with details of firefox, and then use it in the above command (general terminology headers). You can choose the header for any browser you want.
mozhdr = {'User-Agent': 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv:1.9.0.3) Gecko/2008092417 Firefox/3.0.3'} requests.get("https://www.youtube.com", headers = mozhdr)
Step 2 – Enter Search Term
As we are Game of Thrones fans, we enter the search term Game of Thrones and then click the search icon. A new page opens up and we get a list of results. The important thing to note here is how the URL changes from
https://www.youtube.com
to
https://www.youtube.com/results?search_query=game+of+thrones
Let us now search for Breaking Bad (another one of our favorite TV Shows). The URL changes to
https://www.youtube.com/results?search_query=breaking+bad
So, we know that space is replaced by a + and the search term is added after the below URL. Spaces are replaced with a + sign as URLs cannot have spaces.
https://www.youtube.com/results?search_query=
Equipped with all this knowledge, we can define three variables in our Python code.
scrape_url="https://www.youtube.com" search_url="/results?search_query=" search_hardcode = "game+of+thrones"
We can now combine all the above terms, and we will get our search URL.
sb_url = scrape_url + search_url + search_hardcode
Now, if we want the HTML for the search URL page, we will give the below command.
sb_get = requests.get(sb_url, headers = mozhdr)
sb_get now has the response from youtube.com (200 if all is good), and we can find the HTML in
sb_get.content
Step 3 – Identify Video link from HTML Source
The HTML source has a lot of stuff, what we need is the link to the video. This is where HTML basics come into the picture. All links on a page are enclosed with <a> tag. We suggest you learn more about the <a> tag from the below link.
We go into Firefox while we are on the search results page and enable Inspector.
Tools – Web Developer – Inspector
Now, we hover the mouse cursor over the link to a video, we get the below details.
Notice, the <a> tag has all the details we need
- the link to the video (in href)
- the Title of the video (in title)
Step 4 – Filter out <a> Tags with Videos
Our page has a lot of <a> tags, but we only need the ones which have our video content. So, we need to figure out a way to identify all these <a> tags and filter them out. We need to look for something common in all these <a> tags. It is easy for youtube, but for some websites, you need to filter the <div> tag within which the <a> tag is enclosed. Again, this varies from site to site.
When we move our mouse over the video links we get details of the <a> tag. All the video links we need have below <a> tag details (marked in green below).
The details marked in the green rectangle are class of the <a> tag.
This is what we are going to use to pull out all the <a> tags we need from the HTML source.
Step 5 – Beautiful Soup and find_all
To use Beautiful Soup functions we first need to ensure that the HTML we have is in a format recognized by Beautiful Soup. The below command takes care of it.
soupeddata = BeautifulSoup(sb_get.content, "html.parser")
The variable soupeddata has HTML content in a format that is recognized by Beautiful Soup.
We now need to find all <a> tags with a specific class, as those are the <a> tags of interest to us.
yt_links = soupeddata.find_all("a", class_ = "yt-uix-tile-link")
We will use the find_all function to get all <a> tags, which have class it-uix-tile-link. All these <a> tags are stored in a variable called yt_links (which will eventually be a list in Python).
Step 6 – Grab Links and Title From <a> Tag
We now have a Python list of <a> tags which has all the information we need. We still have some information to filter out as we only need the URL and title. So, we need to get href and title attributes of the <a> tag. Since yt_links is a list, we use a Python loop to process the list and grab the href and title attributes.
for x in yt_links: yt_href = x.get("href") yt_title = x.get("title")
Let’s pick one of the <a> tags we filtered out.
<a href=”/watch?v=pE2wcBeyNdk” class=”yt-uix-tile-link yt-ui-ellipsis yt-ui-ellipsis-2 yt-uix-sessionlink spf-link ” data-sessionlink=”itct=CIwBENwwGAEiEwiIhOnNt8rVAhXZxFUKHe9uDzMo9CRSD2dhbWUgb2YgdGhyb25lcw” title=”Game of Thrones: The Loot Train Attack (HBO)” aria-describedby=”description-id-568845″ rel=”spf-prefetch” dir=”ltr”>Game of Thrones: The Loot Train Attack (HBO)</a>
In this case
yt_href will be /watch?v=pE2wcBeyNdk
and
yt_title will be Game of Thrones: The Loot Train Attack (HBO)
Our video URL is still not complete, but all we need to do is add
https://www.youtube.com
before it, which we have in variable scrape_url
yt_final = scrape_url + yt_href
Will give us the complete link in variable yt_final
Step 7 – Done
That’s it peeps, we now have the Youtube link, and Youtube Title of the Video. If you execute the code in IDLE and print the variables yt_final and yt_title you should get an output similar to below.
References
https://www.crummy.com/software/BeautifulSoup/bs4/doc/
In this guide, you have learned about youtube video scraper. I hope this guide was helpful. If you liked this guide, share it with your friends.
TheFirestickTV.com Does Not Promote Or Encourage Any Illegal Use Of Kodi, FireStick Or any streaming services. Users Are Responsible For Their Actions.