↩ en programming language golang go package 非公開: Web Scraping with Python: A Step-by-Step Guide

Web Scraping with Python: A Step-by-Step Guide

Web scraping is the idea of extracting information from websites and using it for specific use cases.

Suppose you want to extract a table from a web page, convert it to a JSON file, and then use that JSON file to build an internal tool. Web scraping allows you to target specific elements within a web page and extract the data you need. Web scraping using Python is a very popular choice as Python provides multiple libraries such as BeautifulSoup and Scrapy to extract data effectively.

Having the skills to extract data efficiently is also very important as a developer or data scientist. This article will help you understand how to effectively scrape a website and retrieve the content you need to operate the website as needed. This tutorial uses the BeautifulSoup package. This is a trendy package for scraping data in Python.

Web Scraping with Python: A Step-by-Step Guide

Why use Python for web scraping?

Python is the first choice for many developers when building web scrapers. There are many reasons why Python should be your first choice, but in this article, we will discuss the three main reasons why Python is used for data scraping.

Library and community support: There are some great libraries such as BeautifulSoup, Scrapy, and Selenium that provide great functionality for effectively scraping web pages. We are building a great ecosystem for web scraping. Plus, many developers around the world are already using Python, so you can get help quickly if you get stuck.

Automation: Python is famous for its automation capabilities. If you’re looking to create complex tools that rely on scraping, you’ll need more than just web scraping. For example, if you want to build a tool to track the prices of products in your online store, you’ll need to add automation so you can track daily prices and add them to your database. Python makes it easy to automate such processes.

Data visualization: Web scraping is frequently used by data scientists. Data scientists often need to extract data from web pages. With libraries like Pandas, Python makes data visualization from raw data easier.

Libraries for web scraping in Python

In Python, there are several libraries available to make web scraping easier. Here we will discuss three of the most popular libraries.

#1.Beautiful soup

One of the most popular libraries for web scraping. BeautifulSoup has been helping developers scrape web pages since 2004. This provides an easy way to navigate, search, and modify the parse tree. Beautifulsoup itself also encodes incoming and outgoing data. It’s well maintained and has a great community.

# 2.Scrapie

Another popular framework for data extraction. Scrapy has over 43,000 stars on GitHub . It can also be used to collect data from APIs. It also has some interesting built-in support, such as sending emails.

#3.Selenium

Selenium is not primarily a web scraping library. Instead, this is a browser automation package. However, you can easily extend its functionality to scrape web pages. Uses the WebDriver protocol to control different browsers. Selenium has been on the market for nearly 20 years. However, Selenium makes it easy to automate and collect data from web pages.

Python web scraping challenges

When trying to collect data from a website, you can face many challenges. Issues include slow networks, anti-scraping tools, IP-based blocks, and capture blocks. These issues can cause major problems when trying to scrape a website.

However, you can effectively avoid the problem by following some methods. For example, in most cases, a website will block an IP address if it receives more than a certain amount of requests in a certain time interval. To avoid IP blocking, you must code your scraper to cool down after sending a request.

Developers also tend to set up honeypot traps for scrapers. These traps are usually invisible to the human eye, but can be crawled with a scraper. If you are scraping a website that places such honeypot traps, you will need to code your scraper accordingly.

Capturing is another serious problem with scrapers. Most websites now use captchas to protect their pages from being accessed by bots. In these cases, you may need to use a capture solver.

Scrape websites using Python

As I mentioned earlier, use BeautifulSoup to scrap your website. In this tutorial, we will scrape historical Ethereum data from Coingecko and save the table data as a JSON file. Let’s move on to building the scraper.

The first step is to install BeautifulSoup and Requests. This tutorial uses Pipenv . Pipenv is a virtual environment manager for Python. You can use Venv if you want, but I prefer Pipenv. A discussion of Pipenv is beyond the scope of this tutorial. However, if you want to know how to use Pipenv, follow this guide . Or, if you want to understand Python virtual environments, follow this guide.

Run the command Pipenv shell to start pipenv shell in your project directory. A subshell is launched in the virtual environment. Next, run the following command to install BeautifulSoup:

 pipenv install beautifulsoup4

Alternatively, for installation requests, run a command similar to the one above.

 pipenv install requests

Once the installation is complete, import the required packages into the main file. Create a file called main.py and import the package as below.

 from bs4 import BeautifulSoup
import requests
import json

The next step is to retrieve the contents of the historical data pages and parse them using the HTML parser available in BeautifulSoup.

 r = requests.get('https://www.coingecko.com/en/coins/ethereum/historical_data#panel')

soup = BeautifulSoup(r.content, 'html.parser')

The above code uses the get method available in the requests library to access the page. The parsed content is saved in a variable called soup .

This is where the original scraping part begins. First, we need to correctly identify the table in the DOM. If you open this page and examine it using the developer tools available in your browser, you will see that there is a class called table table-striped text-sm text-lg-normal .

coin gecko — Coingecko Ethereum historical data table

To properly target this table, you can use the find method.

 table = soup.find('table', attrs={'class': 'table table-striped text-sm text-lg-normal'})

table_data = table.find_all('tr')

table_headings = []

for th in table_data[0].find_all('th'):
    table_headings.append(th.text)

In the above code, first the table is searched using the soup.find method and then all the tr elements in the table are searched using the find_all method. These tr elements are stored in a variable called table_data . The table has several th elements in the title. A new variable called table_headings is initialized to hold the titles in the list.

Then a for loop is executed for the first row of the table. This line finds all elements with th and adds their text values to table_headings list. Text is extracted using the text method. Now print table_headings variable and you will see the following output:

 ['Date', 'Market Cap', 'Volume', 'Open', 'Close']

The next step is to get the remaining elements, generate a dictionary for each row, and add the rows to the list.

 for tr in table_data:
    th = tr.find_all('th')
    td = tr.find_all('td')

    data = {}

    for i in range(len(td)):
        data.update({table_headings[0]: th[0].text})
        data.update({table_headings[i+1]: td[i].text.replace('\n', '')})

    if data.__len__() > 0:
        table_details.append(data)

This is an important part of the code. For each tr in table_data variable, th element is searched first. th element is the date shown in the table. These th elements are stored within the variable th . Similarly, all td elements are stored in td variables.

An empty dictionary data is initialized. After initialization, loop over the range of td elements. For each row, first update the first field of the dictionary with the first item of th . Code table_headings[0]: th[0].text Assigns date and key-value pair for first th element.

After initializing the first element, other elements are assigned using data.update({table_headings[i+1]: td[i].text.replace('\\n', '')}) Masu. Here first the text of td element is extracted using text method and then all \\n are replaced using replace method. The value is assigned to the i i+1 element of table_headings list because the ith element has already been assigned.

Then, if the length of data dictionary is greater than zero, add it to table_details list. You can check by printing table_details list. However, it writes the values to a JSON file. Let’s take a look at this code.

 with open('table.json', 'w') as f:
    json.dump(table_details, f, indent=2)
    print('Data saved to json file...')

Here we use the json.dump method to write the values to a JSON file called table.json . Once the writing is complete, print Data saved to json file... to the console.

Then run the file using the following command:

 python run main.py

After a while, you should see the text “Data saved in JSON file” in the console. A new file called table.json also appears in the working files directory. The file should look like the following JSON file.

 [
  {
    "Date": "2022-11-27",
    "Market Cap": "$145,222,050,633",
    "Volume": "$5,271,100,860",
    "Open": "$1,205.66",
    "Close": "N/A"
  },
  {
    "Date": "2022-11-26",
    "Market Cap": "$144,810,246,845",
    "Volume": "$5,823,202,533",
    "Open": "$1,198.98",
    "Close": "$1,205.66"
  },
  {
    "Date": "2022-11-25",
    "Market Cap": "$145,091,739,838",
    "Volume": "$6,955,523,718",
    "Open": "$1,204.21",
    "Close": "$1,198.98"
  },
// ...
// ... 
]

You have successfully implemented a web scraper using Python. To view the complete code, please visit this GitHub repository .

conclusion

In this article, we learned how to implement simple Python scraping. We discussed how to use BeautifulSoup to quickly scrape data from websites. We also discussed other available libraries and why Python is the first choice for many developers when scraping websites.

Easy-to-understand explanation of “Web Scraping with Python: A Step-by-Step Guide”! Best 2 videos you must watch

【PythonでWebスクレイピング】Beautiful Soupの使い方解説！〜初心者向け〜プログラミング入門

https://www.youtube.com/watch?v=rDVrf9sCOW8&pp=ygVhIFB5dGhvbiDjgpLkvb_nlKjjgZfjgZ8gV2ViIOOCue OCr-ODrOOCpOODlOODs-OCsDog44K544OG44OD44OX44OQ44Kk44K544OG44OD44OXIOOCrOOCpOODiSZobD1KQQ%3D%3D

【Webスクレイピング超入門】2時間で基礎を完全マスター！PythonによるWebスクレイピング入門連結版

https://www.youtube.com/watch?v=VRFfAeW30qE&pp=ygVhIFB5dGhvbiDjgpLkvb_nlKjjgZfjgZ8gV2ViIOOCue OCr-ODrOOCpOODlOODs-OCsDog44K544OG44OD44OX44OQ44Kk44K544OG44OD44OXIOOCrOOCpOODiSZobD1KQQ%3D%3D

go package