Python парсер ошибка 403 - TopOshibok.ru - решение и исправление самых разных ошибок

Пытался написать парсер для выгрузки себе картинок с artstation.com, взял рандомный профиль, практически весь контент там подгружается json-ом, нашёл GET запрос, в браузере он открывается норм, а через requests.get выдает 403. В гугле все советуют указать заголовок User-Agent и Cookie, использовал requests.sessions и указал User-Agent, но всё равно картина та же, ЧЯДНТ?

import requests

url = 'https://www.artstation.com/users/kuvshinov_ilya'
json_url = 'https://www.artstation.com/users/kuvshinov_ilya/projects.json?page=1'
header = {'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:32.0) Gecko/20100101 Firefox/32.0',}

session = requests.Session()
r = session.get(url, headers=header)
json_r = session.get(json_url, headers=header)
print(json_r)
> Response [403]

Вопрос задан

более трёх лет назад
9458 просмотров

Виной 403 коду является cloudflare.
Для обхода мне помог cfscrape

def get_session():
    session = requests.Session()
    session.headers = {
        'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:69.0)   Gecko/20100101 Firefox/69.0',
        'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
        'Accept-Language':'ru,en-US;q=0.5',
        'Accept-Encoding':'gzip, deflate, br',
        'DNT':'1',
        'Connection':'keep-alive',
        'Upgrade-Insecure-Requests':'1',
        'Pragma':'no-cache',
        'Cache-Control':'no-cache'}
    return cfscrape.create_scraper(sess=session)
session = get_session() # Дальше работать как с обычной requests.Session

Немного кода о выдёргивании прямых ссылок на хайрес пикчи:

Код

import requests
import cfscrape

def get_session():
    session = requests.Session()
    session.headers = {
        'Host':'www.artstation.com',
        'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:69.0)   Gecko/20100101 Firefox/69.0',
        'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
        'Accept-Language':'ru,en-US;q=0.5',
        'Accept-Encoding':'gzip, deflate, br',
        'DNT':'1',
        'Connection':'keep-alive',
        'Upgrade-Insecure-Requests':'1',
        'Pragma':'no-cache',
        'Cache-Control':'no-cache'}
    return cfscrape.create_scraper(sess=session)

def artstation():
    url = 'https://www.artstation.com/kyuyongeom'
    page_url = 'https://www.artstation.com/users/kyuyongeom/projects.json'
    post_pattern = 'https://www.artstation.com/projects/{}.json'
    session = get_session()
    absolute_links = []

    response = session.get(page_url, params={'page':1}).json()
    pages, modulo = divmod(response['total_count'], 50)
    if modulo: pages += 1

    for page in range(1, pages+1):
        if page != 1:
            response = session.get(page_url, params={'page':page}).json()
        for post in response['data']:
            shortcode = post['permalink'].split('/')[-1]
            inner_resp = session.get(post_pattern.format(shortcode)).json()
            for img in inner_resp['assets']:
                if img['asset_type'] == 'image':
                    absolute_links.append(img['image_url'])

    with open('links.txt', 'w') as file:
        file.write('\n'.join(absolute_links))

if __name__ == '__main__':
    artstation()

нужно больше полей в Header
вставил все что отправляет Chrome и появился результат, см:

import requests

url = 'https://www.artstation.com/users/kuvshinov_ilya'
json_url = 'https://www.artstation.com/users/kuvshinov_ilya/projects.json?page=1'
header = {
  'accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3',
  'accept-encoding':'gzip, deflate, br',
  'accept-language':'ru-RU,ru;q=0.9,en-US;q=0.8,en;q=0.7',
  'cache-control':'no-cache',
  'dnt': '1',
  'pragma': 'no-cache',
  'sec-fetch-mode': 'navigate',
  'sec-fetch-site': 'none',
  'sec-fetch-user': '?1',
  'upgrade-insecure-requests': '1',
  'user-agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.100 Safari/537.36'}

session = requests.Session()
session.headers = header
r = session.get(url)
if r.status_code == 200:
  json_r = session.get(json_url)
  if json_r.status_code == 200:
    print(json_r.text)
  else:
    print(json_r.status_code)

Пригласить эксперта

Есть смысл указать все поля из Header, а не только User-Agent

потому что , братуха, AJAX запрос имеет другие хидеры

серверу пофиг, но ты пытаешься получить аякс неаяксовым клиентом — вот и 403

Показать ещё
Загружается…

21 сент. 2023, в 19:28

10000 руб./за проект

21 сент. 2023, в 19:06

11111 руб./за проект

21 сент. 2023, в 19:00

6000000 руб./за проект

Минуточку внимания

Источник

I was trying to scrape a website for practice, but I kept on getting the HTTP Error 403 (does it think I’m a bot)?

Here is my code:

#import requests
import urllib.request
from bs4 import BeautifulSoup
#from urllib import urlopen
import re

webpage = urllib.request.urlopen('http://www.cmegroup.com/trading/products/#sortField=oi&sortAsc=false&venues=3&page=1&cleared=1&group=1').read
findrows = re.compile('<tr class="- banding(?:On|Off)>(.*?)</tr>')
findlink = re.compile('<a href =">(.*)</a>')

row_array = re.findall(findrows, webpage)
links = re.finall(findlink, webpate)

print(len(row_array))

iterator = []

The error I get is:

 File "C:\Python33\lib\urllib\request.py", line 160, in urlopen
    return opener.open(url, data, timeout)
  File "C:\Python33\lib\urllib\request.py", line 479, in open
    response = meth(req, response)
  File "C:\Python33\lib\urllib\request.py", line 591, in http_response
    'http', request, response, code, msg, hdrs)
  File "C:\Python33\lib\urllib\request.py", line 517, in error
    return self._call_chain(*args)
  File "C:\Python33\lib\urllib\request.py", line 451, in _call_chain
    result = func(*args)
  File "C:\Python33\lib\urllib\request.py", line 599, in http_error_default
    raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 403: Forbidden

Источник

HTTP Error 403 is a common error encountered while web scraping using Python 3. It indicates that the server is refusing to fulfill the request made by the client, as the request lacks sufficient authorization or the server considers the request to be invalid. This error can be encountered for a variety of reasons, including the presence of IP blocking, CAPTCHAs, or rate limiting restrictions. In order to resolve the issue, there are several methods that can be implemented, including changing the User Agent, using proxies, and implementing wait time between requests.

Method 1: Changing the User Agent

If you encounter HTTP error 403 while web scraping with Python 3, it means that the server is denying you access to the webpage. One common solution to this problem is to change the user agent of your web scraper. The user agent is a string that identifies the web scraper to the server. By changing the user agent, you can make your web scraper appear as a regular web browser to the server.

Here is an example code that shows how to change the user agent of your web scraper using the requests library:

import requests

url = 'https://example.com'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'}

response = requests.get(url, headers=headers)

print(response.content)

In this example, we set the User-Agent header to a string that mimics the user agent of the Google Chrome web browser. You can find the user agent string of your favorite web browser by searching for «my user agent» on Google.

By setting the User-Agent header, we can make our web scraper appear as a regular web browser to the server. This can help us bypass HTTP error 403 and access the webpage we want to scrape.

That’s it! By changing the user agent of your web scraper, you should be able to fix the problem of HTTP error 403 in Python 3 web scraping.

Method 2: Using Proxies

If you are encountering HTTP error 403 while web scraping with Python 3, it is likely that the website is blocking your IP address due to frequent requests. One way to solve this problem is by using proxies. Proxies allow you to make requests to the website from different IP addresses, making it difficult for the website to block your requests. Here is how you can fix HTTP error 403 in Python 3 web scraping with proxies:

Step 1: Install Required Libraries

You need to install the requests and bs4 libraries to make HTTP requests and parse HTML respectively. You can install them using pip:

pip install requests
pip install bs4

Step 2: Get a List of Proxies

You need to get a list of proxies that you can use to make requests to the website. There are many websites that provide free proxies, such as https://free-proxy-list.net/. You can scrape the website to get a list of proxies:

import requests
from bs4 import BeautifulSoup

url = 'https://free-proxy-list.net/'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
table = soup.find('table', {'id': 'proxylisttable'})
rows = table.tbody.find_all('tr')

proxies = []
for row in rows:
    cols = row.find_all('td')
    if cols[6].text == 'yes':
        proxy = cols[0].text + ':' + cols[1].text
        proxies.append(proxy)

This code scrapes the website and gets a list of HTTP proxies that support HTTPS. The proxies are stored in the proxies list.

Step 3: Make Requests with Proxies

You can use the requests library to make requests to the website with a proxy. Here is an example code that makes a request to https://www.example.com with a random proxy from the proxies list:

import random

url = 'https://www.example.com'
proxy = random.choice(proxies)

response = requests.get(url, proxies={'https': proxy})
if response.status_code == 200:
    print(response.text)
else:
    print('Request failed with status code:', response.status_code)

This code selects a random proxy from the proxies list and makes a request to https://www.example.com with the proxy. If the request is successful, it prints the response text. Otherwise, it prints the status code of the failed request.

Step 4: Handle Exceptions

You need to handle exceptions that may occur while making requests with proxies. Here is an example code that handles exceptions and retries the request with a different proxy:

import requests
import random
from requests.exceptions import ProxyError, ConnectionError, Timeout

url = 'https://www.example.com'

while True:
    proxy = random.choice(proxies)
    try:
        response = requests.get(url, proxies={'https': proxy}, timeout=5)
        if response.status_code == 200:
            print(response.text)
            break
        else:
            print('Request failed with status code:', response.status_code)
    except (ProxyError, ConnectionError, Timeout):
        print('Proxy error. Retrying with a different proxy...')

This code uses a while loop to keep retrying the request with a different proxy until it succeeds. It handles ProxyError, ConnectionError, and Timeout exceptions that may occur while making requests with proxies.

Method 3: Implementing Wait Time between Requests

When you are scraping a website, you might encounter an HTTP error 403, which means that the server is denying your request. This can happen when the server detects that you are sending too many requests in a short period of time, and it wants to protect itself from being overloaded.

One way to fix this problem is to implement wait time between requests. This means that you will wait a certain amount of time before sending the next request, which will give the server time to process the previous request and prevent it from being overloaded.

Here is an example code that shows how to implement wait time between requests using the time module:

import requests
import time

url = 'https://example.com'

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'}

for i in range(5):
    response = requests.get(url, headers=headers)
    print(response.status_code)
    time.sleep(1)

In this example, we are sending a request to https://example.com with headers that mimic a browser request. We are then using a for loop to send 5 requests with a 1 second delay between requests using the time.sleep() function.

Another way to implement wait time between requests is to use a random delay. This will make your requests less predictable and less likely to be detected as automated. Here is an example code that shows how to implement a random delay using the random module:

import requests
import random
import time

url = 'https://example.com'

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'}

for i in range(5):
    response = requests.get(url, headers=headers)
    print(response.status_code)
    time.sleep(random.randint(1, 5))

In this example, we are sending a request to https://example.com with headers that mimic a browser request. We are then using a for loop to send 5 requests with a random delay between 1 and 5 seconds using the random.randint() function.

By implementing wait time between requests, you can prevent HTTP error 403 and ensure that your web scraping code runs smoothly.

Источник

I was trying to scrap a website for practice, but I kept on getting the HTTP Error 403 (does it think I’m a bot)?

Here is my code:

#import requests
import urllib.request
from bs4 import BeautifulSoup
#from urllib import urlopen
import re

webpage = urllib.request.urlopen('http://www.cmegroup.com/trading/products/#sortField=oi&sortAsc=false&venues=3&page=1&cleared=1&group=1').read
findrows = re.compile('<tr class="- banding(?:On|Off)>(.*?)</tr>')
findlink = re.compile('<a href =">(.*)</a>')

row_array = re.findall(findrows, webpage)
links = re.finall(findlink, webpate)

print(len(row_array))

iterator = []

The error I get is:

 File "C:\Python33\lib\urllib\request.py", line 160, in urlopen
    return opener.open(url, data, timeout)
  File "C:\Python33\lib\urllib\request.py", line 479, in open
    response = meth(req, response)
  File "C:\Python33\lib\urllib\request.py", line 591, in http_response
    'http', request, response, code, msg, hdrs)
  File "C:\Python33\lib\urllib\request.py", line 517, in error
    return self._call_chain(*args)
  File "C:\Python33\lib\urllib\request.py", line 451, in _call_chain
    result = func(*args)
  File "C:\Python33\lib\urllib\request.py", line 599, in http_error_default
    raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 403: Forbidden

This question is related to
python
http
web
http-status-code-403

This is probably because of mod_security or some similar server security feature which blocks known spider/bot user agents (urllib uses something like python urllib/3.3.0, it’s easily detected). Try setting a known browser user agent with:

from urllib.request import Request, urlopen

req = Request('http://www.cmegroup.com/trading/products/#sortField=oi&sortAsc=false&venues=3&page=1&cleared=1&group=1', headers={'User-Agent': 'Mozilla/5.0'})
webpage = urlopen(req).read()

This works for me.

By the way, in your code you are missing the () after .read in the urlopen line, but I think that it’s a typo.

TIP: since this is exercise, choose a different, non restrictive site. Maybe they are blocking urllib for some reason…

Definitely it’s blocking because of your use of urllib based on the user agent. This same thing is happening to me with OfferUp. You can create a new class called AppURLopener which overrides the user-agent with Mozilla.

import urllib.request

class AppURLopener(urllib.request.FancyURLopener):
    version = "Mozilla/5.0"

opener = AppURLopener()
response = opener.open('http://httpbin.org/user-agent')

Source

«This is probably because of mod_security or some similar server security feature which blocks known

spider/bot

user agents (urllib uses something like python urllib/3.3.0, it’s easily detected)» — as already mentioned by Stefano Sanfilippo

from urllib.request import Request, urlopen
url="https://stackoverflow.com/search?q=html+error+403"
req = Request(url, headers={'User-Agent': 'Mozilla/5.0'})

web_byte = urlopen(req).read()

webpage = web_byte.decode('utf-8')

The web_byte is a byte object returned by the server and the content type present in webpage is mostly utf-8.
Therefore you need to decode web_byte using decode method.

This solves complete problem while I was having trying to scrap from a website using PyCharm

P.S -> I use python 3.4

Based on the previous answer,

from urllib.request import Request, urlopen       
#specify url
url = 'https://xyz/xyz'
req = Request(url, headers={'User-Agent': 'XYZ/3.0'})
response = urlopen(req, timeout=20).read()

This worked for me by extending the timeout.

Based on previous answers this has worked for me with Python 3.7

from urllib.request import Request, urlopen

req = Request('Url_Link', headers={'User-Agent': 'XYZ/3.0'})
webpage = urlopen(req, timeout=10).read()

print(webpage)

Since the page works in browser and not when calling within python program, it seems that the web app that serves that url recognizes that you request the content not by the browser.

Demonstration:

curl --dump-header r.txt http://www.cmegroup.com/trading/products/#sortField=oi&sortAsc=false&venues=3&page=1&cleared=1&group=1

...
<HTML><HEAD>
<TITLE>Access Denied</TITLE>
</HEAD><BODY>
<H1>Access Denied</H1>
You don't have permission to access ...
</HTML>

and the content in r.txt has status line:

HTTP/1.1 403 Forbidden

Try posting header ‘User-Agent’ which fakes web client.

NOTE: The page contains Ajax call that creates the table you probably want to parse. You’ll need to check the javascript logic of the page or simply using browser debugger (like Firebug / Net tab) to see which url you need to call to get the table’s content.

You can try in two ways. The detail is in this link.

1) Via pip

pip install —upgrade certifi

2) If it doesn’t work, try to run a Cerificates.command that comes bundled with Python 3.* for Mac:(Go to your python installation location and double click the file)

open /Applications/Python\ 3.*/Install\ Certificates.command

If you feel guilty about faking the user-agent as Mozilla (comment in the top answer from Stefano), it could work with a non-urllib User-Agent as well. This worked for the sites I reference:

    req = urlrequest.Request(link, headers={'User-Agent': 'XYZ/3.0'})
    urlrequest.urlopen(req, timeout=10).read()

My application is to test validity by scraping specific links that I refer to, in my articles. Not a generic scraper.

Источник

import requests

url = 'https://www.artstation.com/users/kuvshinov_ilya'
json_url = 'https://www.artstation.com/users/kuvshinov_ilya/projects.json?page=1'
header = {'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:32.0) Gecko/20100101 Firefox/32.0',}

session = requests.Session()
r = session.get(url, headers=header)
json_r = session.get(json_url, headers=header)
print(json_r)
> Response [403]

Вопрос задан

более трёх лет назад
8608 просмотров

Виной 403 коду является cloudflare.
Для обхода мне помог cfscrape

def get_session():
    session = requests.Session()
    session.headers = {
        'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:69.0)   Gecko/20100101 Firefox/69.0',
        'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
        'Accept-Language':'ru,en-US;q=0.5',
        'Accept-Encoding':'gzip, deflate, br',
        'DNT':'1',
        'Connection':'keep-alive',
        'Upgrade-Insecure-Requests':'1',
        'Pragma':'no-cache',
        'Cache-Control':'no-cache'}
    return cfscrape.create_scraper(sess=session)
session = get_session() # Дальше работать как с обычной requests.Session

Немного кода о выдёргивании прямых ссылок на хайрес пикчи:

Код

import requests
import cfscrape

def get_session():
    session = requests.Session()
    session.headers = {
        'Host':'www.artstation.com',
        'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:69.0)   Gecko/20100101 Firefox/69.0',
        'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
        'Accept-Language':'ru,en-US;q=0.5',
        'Accept-Encoding':'gzip, deflate, br',
        'DNT':'1',
        'Connection':'keep-alive',
        'Upgrade-Insecure-Requests':'1',
        'Pragma':'no-cache',
        'Cache-Control':'no-cache'}
    return cfscrape.create_scraper(sess=session)

def artstation():
    url = 'https://www.artstation.com/kyuyongeom'
    page_url = 'https://www.artstation.com/users/kyuyongeom/projects.json'
    post_pattern = 'https://www.artstation.com/projects/{}.json'
    session = get_session()
    absolute_links = []

    response = session.get(page_url, params={'page':1}).json()
    pages, modulo = divmod(response['total_count'], 50)
    if modulo: pages += 1

    for page in range(1, pages+1):
        if page != 1:
            response = session.get(page_url, params={'page':page}).json()
        for post in response['data']:
            shortcode = post['permalink'].split('/')[-1]
            inner_resp = session.get(post_pattern.format(shortcode)).json()
            for img in inner_resp['assets']:
                if img['asset_type'] == 'image':
                    absolute_links.append(img['image_url'])

    with open('links.txt', 'w') as file:
        file.write('n'.join(absolute_links))

if __name__ == '__main__':
    artstation()

нужно больше полей в Header
вставил все что отправляет Chrome и появился результат, см:

import requests

url = 'https://www.artstation.com/users/kuvshinov_ilya'
json_url = 'https://www.artstation.com/users/kuvshinov_ilya/projects.json?page=1'
header = {
  'accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3',
  'accept-encoding':'gzip, deflate, br',
  'accept-language':'ru-RU,ru;q=0.9,en-US;q=0.8,en;q=0.7',
  'cache-control':'no-cache',
  'dnt': '1',
  'pragma': 'no-cache',
  'sec-fetch-mode': 'navigate',
  'sec-fetch-site': 'none',
  'sec-fetch-user': '?1',
  'upgrade-insecure-requests': '1',
  'user-agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.100 Safari/537.36'}

session = requests.Session()
session.headers = header
r = session.get(url)
if r.status_code == 200:
  json_r = session.get(json_url)
  if json_r.status_code == 200:
    print(json_r.text)
  else:
    print(json_r.status_code)

Пригласить эксперта

Есть смысл указать все поля из Header, а не только User-Agent

потому что , братуха, AJAX запрос имеет другие хидеры

серверу пофиг, но ты пытаешься получить аякс неаяксовым клиентом — вот и 403

Показать ещё
Загружается…

05 июн. 2023, в 15:13

2000 руб./в час

05 июн. 2023, в 15:02

7000 руб./за проект

05 июн. 2023, в 15:00

170000 руб./за проект

Минуточку внимания

I was trying to scrape a website for practice, but I kept on getting the HTTP Error 403 (does it think I’m a bot)?

Here is my code:

#import requests
import urllib.request
from bs4 import BeautifulSoup
#from urllib import urlopen
import re

webpage = urllib.request.urlopen('http://www.cmegroup.com/trading/products/#sortField=oi&sortAsc=false&venues=3&page=1&cleared=1&group=1').read
findrows = re.compile('<tr class="- banding(?:On|Off)>(.*?)</tr>')
findlink = re.compile('<a href =">(.*)</a>')

row_array = re.findall(findrows, webpage)
links = re.finall(findlink, webpate)

print(len(row_array))

iterator = []

The error I get is:

 File "C:Python33liburllibrequest.py", line 160, in urlopen
    return opener.open(url, data, timeout)
  File "C:Python33liburllibrequest.py", line 479, in open
    response = meth(req, response)
  File "C:Python33liburllibrequest.py", line 591, in http_response
    'http', request, response, code, msg, hdrs)
  File "C:Python33liburllibrequest.py", line 517, in error
    return self._call_chain(*args)
  File "C:Python33liburllibrequest.py", line 451, in _call_chain
    result = func(*args)
  File "C:Python33liburllibrequest.py", line 599, in http_error_default
    raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 403: Forbidden

Скажу сразу, что я редко использую библиотеку urllib / 3. Однако я попытался использовать команду терминала оболочки scrapy, а также использовать библиотеку запросов без пользовательского агента и получил ответ 200.

Я заметил, что вы не объявляли тип парсера при объявлении «суп».

 soup = BeautifulSoup(resp, from_encoding=resp.info().get_param('charset'))

Хотя мне гораздо удобнее использовать парсер scrapy, несмотря на то, что он тяжелее, но если вы правильно помните, вы должны объявить тип парсера, например

soup = BeautifulSoup(resp, "lxml")

Битто Бенни-чан говорит, что ему удалось получить ответ с помощью 200 urllib.request, так что попробуйте его изменения. Это просто вводило полное имя пользовательского агента.

Я предлагаю использовать библиотеку запросов. Я думаю, это было бы достаточно простое изменение.

from bs4 import BeautifulSoup
import requests

listoflinks = ['https://www.spectatornews.com/page/6/?s=band', 'https://www.spectatornews.com/page/7/?s=band']

getarticles = []

for i in listoflinks:
    resp = requests.get(i)
    soup = BeautifulSoup(resp.content, "lxml")

    for link in soup.find_all('a', href=True):

        getarticles.append(link['href'])

список getarticles выводил это:

'https://www.spectatornews.com/category/showcase/',
'https://www.spectatornews.com/showcase/2003/02/06/minneapolis-band-trips-into-eau-claire/',
'https://www.spectatornews.com/category/showcase/',
'https://www.spectatornews.com/page/5/?s=band',
'https://www.spectatornews.com/?s=band',
'https://www.spectatornews.com/page/2/?s=band',
'https://www.spectatornews.com/page/3/?s=band',
'https://www.spectatornews.com/page/4/?s=band',
'https://www.spectatornews.com/page/5/?s=band',
'https://www.spectatornews.com/page/7/?s=band',
'https://www.spectatornews.com/page/8/?s=band',
'https://www.spectatornews.com/page/9/?s=band',
'https://www.spectatornews.com/page/127/?s=band',
'https://www.spectatornews.com/page/7/?s=band',
'https://www.spectatornews.com',
'https://www.spectatornews.com/feed/rss/',
'#',
'https://www.youtube.com/channel/UC1SM8q3lk_fQS1KuY77bDgQ',
'https://www.snapchat.com/add/spectator news',
'https://www.instagram.com/spectatornews/',
'http://twitter.com/spectatornews',
'http://facebook.com/spectatornews',
'/',
'https://snosites.com/why-sno/',
'http://snosites.com',
'https://www.spectatornews.com/wp-login.php',
'#top',
'/',
'https://www.spectatornews.com/category/campus-news/',
'https://www.spectatornews.com/category/currents/',
'https://www.spectatornews.com/category/sports/',
'https://www.spectatornews.com/category/opinion/',
'https://www.spectatornews.com/category/multimedia-2/',
'https://www.spectatornews.com/ads/banner-advertise-with-the-spectator/',
'https://www.spectatornews.com/category/campus-news/',
'https://www.spectatornews.com/category/currents/',
'https://www.spectatornews.com/category/sports/',
'https://www.spectatornews.com/category/opinion/',
'https://www.spectatornews.com/category/multimedia-2/',
'/',
'https://www.spectatornews.com/about/',
'https://www.spectatornews.com/about/editorial-policy/',
'https://www.spectatornews.com/about/correction-policy/',
'https://www.spectatornews.com/about/bylaws/',
'https://www.spectatornews.com/advertise/',
'https://www.spectatornews.com/contact/',
'https://www.spectatornews.com/staff/',
'https://www.spectatornews.com/submit-a-letter/',
'https://www.spectatornews.com/submit-a-news-tip/',
'/',
'https://www.spectatornews.com',
'https://www.spectatornews.com/category/campus-news/',
'https://www.spectatornews.com/category/currents/',
'https://www.spectatornews.com/category/sports/',
'https://www.spectatornews.com/category/opinion/',
'https://www.spectatornews.com/category/multimedia-2/',
'/',
'https://www.spectatornews.com/feed/rss/',
'#',
'https://www.youtube.com/channel/UC1SM8q3lk_fQS1KuY77bDgQ',
'https://www.snapchat.com/add/spectator news',
'https://www.instagram.com/spectatornews/',
'http://twitter.com/spectatornews',
'http://facebook.com/spectatornews',
'https://www.spectatornews.com/campus-news/2002/05/09/late-night-bus-service-idea-abandoned-due-to-expense/',
'https://www.spectatornews.com/category/campus-news/',
'https://www.spectatornews.com/opinion/2002/03/21/yates-deserved-what-she-got-husband-also-to-blame/',
'https://www.spectatornews.com/category/opinion/',
'https://www.spectatornews.com/opinion/2001/11/29/air-force-concert-band-inspires-zorn-arena-audience/',
'https://www.spectatornews.com/category/opinion/',
'https://www.spectatornews.com/campus-news/2001/10/25/goth-style-bands-will-entertain-at-halloween-costume-concert/',
'https://www.spectatornews.com/category/campus-news/',
'https://www.spectatornews.com/campus-news/2001/04/19/campus-group-will-host-hemp-event-with-bands-information/',
'https://www.spectatornews.com/category/campus-news/',
'https://www.spectatornews.com/currents/2018/12/10/geekin-out/',
'https://www.spectatornews.com/currents/2018/12/10/geekin-out/',
'https://www.spectatornews.com/staff/?writer=Alanna%20Huggett',
'https://www.spectatornews.com/category/currents/',
'https://www.spectatornews.com/tag/geekcon/',
'https://www.spectatornews.com/tag/tv10/',
'https://www.spectatornews.com/tag/uwec/',
'https://www.spectatornews.com/opinion/2018/12/07/keeping-up-with-the-kar-fashions-11/',
'https://www.spectatornews.com/opinion/2018/12/07/keeping-up-with-the-kar-fashions-11/',
'https://www.spectatornews.com/staff/?writer=Kar%20Wei%20Cheng',
'https://www.spectatornews.com/category/column-2/',
'https://www.spectatornews.com/category/multimedia-2/',
'https://www.spectatornews.com/category/opinion/',
'https://www.spectatornews.com/tag/accessories/',
'https://www.spectatornews.com/tag/fashion/',
'https://www.spectatornews.com/tag/multimedia/',
'https://www.spectatornews.com/tag/winter/',
'https://www.spectatornews.com/multimedia-2/2018/12/07/a-magical-night/',
'https://www.spectatornews.com/multimedia-2/2018/12/07/a-magical-night/',
'https://www.spectatornews.com/staff/?writer=Julia%20Van%20Allen',
'https://www.spectatornews.com/category/multimedia-2/',
'https://www.spectatornews.com/tag/dancing/',
'https://www.spectatornews.com/tag/harry-potter/',
'https://www.spectatornews.com/tag/smom/',
'https://www.spectatornews.com/tag/student-ministry-of-magic/',
'https://www.spectatornews.com/tag/uwec/',
'https://www.spectatornews.com/tag/yule/',
'https://www.spectatornews.com/tag/yule-ball/',
'https://www.spectatornews.com/campus-news/2018/11/26/old-news-5/',
'https://www.spectatornews.com/campus-news/2018/11/26/old-news-5/',
'https://www.spectatornews.com/staff/?writer=Madeline%20Fuerstenberg',
'https://www.spectatornews.com/category/column-2/',
'https://www.spectatornews.com/category/campus-news/',
'https://www.spectatornews.com/tag/1950/',
'https://www.spectatornews.com/tag/1975/',
'https://www.spectatornews.com/tag/2000/',
'https://www.spectatornews.com/tag/articles/',
'https://www.spectatornews.com/tag/spectator/',
'https://www.spectatornews.com/tag/throwback/',
'https://www.spectatornews.com/currents/2018/11/21/boss-women-highlighting-businesswomen-in-eau-claire-6/',
'https://www.spectatornews.com/currents/2018/11/21/boss-women-highlighting-businesswomen-in-eau-claire-6/',
'https://www.spectatornews.com/staff/?writer=Taylor%20Reisdorf',
'https://www.spectatornews.com/category/column-2/',
'https://www.spectatornews.com/category/currents/',
'https://www.spectatornews.com/tag/altoona/',
'https://www.spectatornews.com/tag/boss-women/',
'https://www.spectatornews.com/tag/business-women/',
'https://www.spectatornews.com/tag/cherish-woodford/',
'https://www.spectatornews.com/tag/crossfit/',
'https://www.spectatornews.com/tag/crossfit-river-prairie/',
'https://www.spectatornews.com/tag/eau-claire/',
'https://www.spectatornews.com/tag/fitness/',
'https://www.spectatornews.com/tag/gym/',
'https://www.spectatornews.com/tag/local/',
'https://www.spectatornews.com/tag/nicole-randall/',
'https://www.spectatornews.com/tag/river-prairie/',
'https://www.spectatornews.com/currents/2018/11/20/bad-art-good-music/',
'https://www.spectatornews.com/currents/2018/11/20/bad-art-good-music/',
'https://www.spectatornews.com/staff/?writer=Lea%20Kopke',
'https://www.spectatornews.com/category/currents/',
'https://www.spectatornews.com/tag/bad-art/',
'https://www.spectatornews.com/tag/fmdown/',
'https://www.spectatornews.com/tag/ghosts-of-the-sun/',
'https://www.spectatornews.com/tag/music/',
'https://www.spectatornews.com/tag/pablo-center/',
'https://www.spectatornews.com/opinion/2018/11/14/the-tator-21/',
'https://www.spectatornews.com/opinion/2018/11/14/the-tator-21/',
'https://www.spectatornews.com/staff/?writer=Stephanie%20Janssen',
'https://www.spectatornews.com/category/column-2/',
'https://www.spectatornews.com/category/opinion/',
'https://www.spectatornews.com/tag/satire/',
'https://www.spectatornews.com/tag/sleepy/',
'https://www.spectatornews.com/tag/tator/',
'https://www.spectatornews.com/tag/uw-eau-claire/',
'https://www.spectatornews.com/tag/uwec/',
'https://www.spectatornews.com/page/6/?s=band',
'https://www.spectatornews.com/?s=band',
'https://www.spectatornews.com/page/2/?s=band',
'https://www.spectatornews.com/page/3/?s=band',
'https://www.spectatornews.com/page/4/?s=band',
'https://www.spectatornews.com/page/5/?s=band',
'https://www.spectatornews.com/page/6/?s=band',
'https://www.spectatornews.com/page/8/?s=band',
'https://www.spectatornews.com/page/9/?s=band',
'https://www.spectatornews.com/page/10/?s=band',
'https://www.spectatornews.com/page/127/?s=band',
'https://www.spectatornews.com/page/8/?s=band',
'https://www.spectatornews.com',
'https://www.spectatornews.com/feed/rss/',
'#',
'https://www.youtube.com/channel/UC1SM8q3lk_fQS1KuY77bDgQ',
'https://www.snapchat.com/add/spectator news',
'https://www.instagram.com/spectatornews/',
'http://twitter.com/spectatornews',
'http://facebook.com/spectatornews',
'/',
'https://snosites.com/why-sno/',
'http://snosites.com',
'https://www.spectatornews.com/wp-login.php',
'#top',
'/',
'https://www.spectatornews.com/category/campus-news/',
'https://www.spectatornews.com/category/currents/',
'https://www.spectatornews.com/category/sports/',
'https://www.spectatornews.com/category/opinion/',
'https://www.spectatornews.com/category/multimedia-2/']

Python is a rather easy-to-learn but powerful language enough to be used in a variety of applications ranging from AI and Machine Learning all the way to something as simple as creating web scraping bots.

That said, random bugs and glitches are still the order of the day in a Python programmer’s life. In this article, we’re talking about the “urllib.error.httperror: HTTP error 403: Forbidden” when trying to scrape sites with Python and what you can to do fix the problem.

Also read: Is Python case sensitive when dealing with identifiers?

Why does this happen?

While the error can be triggered by anything from a simple runtime error in the script to server issues on the website, the most likely reason is the presence of some sort of server security feature to prevent bots or spiders from crawling the site. In this case, the security feature might be blocking urllib, a library used to send requests to websites.

How to fix this?

Here are two fixes you can try out.

**Disable mod_security or equivalent security features**

As mentioned before, server-side security features can cause problems with web scrapers. Try setting your browser agent as follows to see if you can avoid the issue.

from urllib.request import Request, urlopen
req = Request(
url='enter request URL here', 
headers={'User-Agent': 'Mozilla/5.0'}
)
webpage = urlopen(req).read()

A correctly defined browser agent should be able to scrape data from any site.

Set a timeout

If you aren’t getting a response, try setting a timeout to prevent the server from mistaking your bot for a DDoS attacking and hence blocking all requests altogether.

from urllib.request import Request, 
urlopen req = Request('enter request URL here', headers={'User-Agent': 'Mozilla/5.0'}) 
webpage = urlopen(req,timeout=10).read()

The aforementioned example sets a 10-second timeout between requests to not overload the server while maintaining good request frequency.

Also read: How to fix Javascript error: ipython is not defined?

helhel20

0 / 0 / 1

Регистрация: 21.04.2019

Сообщений: 29

15.06.2021, 10:41. Показов 7530. Ответов 9

Метки нет (Все метки)

Здравствуйте, нужно написать парсер сайта. Хочу получить html страницы, но запрос выдает ошибку 403, я так понимаю на сайте какая-то защита стоит. Как ее обойти?

Python

import requests
from bs4 import BeautifulSoup
 
URL = '***'
HEADERS = {
    'Host': '***',
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:89.0) Gecko/20100101 Firefox/89.0',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
    'Accept-Language': 'en-US,en;q=0.5',
    'Accept-Encoding': 'gzip, deflate, br',
    'Connection': 'keep-alive',
    'Upgrade-Insecure-Requests': '1',
    'Cache-Control': 'max-age=0'}
 
 
def get_html(url, params=None):
    r = requests.get(url, headers=HEADERS, params=params)
    return r
 
 
def parse():
    html = get_html(URL)
    print(html)
 
 
parse()

Автоматизируй это!

7105 / 4403 / 1182

Регистрация: 30.03.2015

Сообщений: 12,848

Записей в блоге: 29

15.06.2021, 10:50

helhel20, почти все эти заголовки не нужны (не обязательны), 403 это

возможно

отсутствие авторизации, залогиниться не надо сначала на сайте?

0 / 0 / 1

Регистрация: 21.04.2019

Сообщений: 29

15.06.2021, 11:47

[ТС]

Нет, не нужно логиниться

Автоматизируй это!

7105 / 4403 / 1182

Регистрация: 30.03.2015

Сообщений: 12,848

Записей в блоге: 29

15.06.2021, 14:06

helhel20, я достаю хрустальный шар. Так, или ты не верно формируешь запрос, или не туда его шлешь, или добавляешь не те хедеры, или не добавляешь нужные куки. Вот.

0 / 0 / 1

Регистрация: 21.04.2019

Сообщений: 29

15.06.2021, 17:43

[ТС]

А какие хедеры еще могут быть нужны? И какие куки надо добавить?

Автоматизируй это!

7105 / 4403 / 1182

Регистрация: 30.03.2015

Сообщений: 12,848

Записей в блоге: 29

15.06.2021, 17:45

helhel20, те, которые нужны для этого запроса конечно, смотрим в браузере что и куда уходит, повторяем запрос как там.

helhel20

0 / 0 / 1

Регистрация: 21.04.2019

Сообщений: 29

15.06.2021, 20:00

[ТС]

Я правильно куки отправляю? Что-то не работает, все уже перепробовал

Python

import requests
from bs4 import BeautifulSoup
import cfscrape
 
URL = 'https://www.computeruniverse.net/en'
HEADERS = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:89.0) Gecko/20100101 Firefox/89.0',
    'Accept-Encoding': 'gzip, deflate, br',
    'Accept-Language': 'en-US,en;q=0.5',
    'Alt-Used': 'www.computeruniverse.net',
    'Cache-Control': 'max-age=0',
    'Connection': 'keep-alive',
    'Host': 'www.computeruniverse.net',
    'TE': 'Trailers',
    'Upgrade-Insecure-Requests': '1',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
}
 
response = requests.get('https://www.computeruniverse.net/en')
cookies = response.cookies.get_dict()
print(cookies)
 
 
def get_html(url, params=None):
    r = requests.get(url, headers=HEADERS, params=params, cookies=cookies)
    return r
 
 
def parse():
    html = get_html(URL)
    print(html)
 
 
parse()

Автоматизируй это!

7105 / 4403 / 1182

Регистрация: 30.03.2015

Сообщений: 12,848

Записей в блоге: 29

16.06.2021, 12:09

helhel20, сайтик не простой, очень сильно заколдовано) Если ты хочешь его содержимое парсить, то сразу рекомендую на селениум переходить, тут очень много динамики и подгрузок.

АмигоСП

295 / 108 / 57

Регистрация: 07.12.2016

Сообщений: 209

16.06.2021, 15:10

Сообщение было отмечено Welemir1 как решение

Решение

helhel20, как написал Уважаемый Welemir1, сайт действительно непростой. И, если вы только пытаетесь изучить парсинг, то сложно будет. И вдогонку такой вопрос ещё — а нужна ли вам главная страница? Информативности в ней не то, чтобы очень. Обычно вытягивают по разделам информацию о продукте конечную.
Поковыряйтесь в инструментах разработчика в браузере, посмотрите куда запросы уходят при переходе по той или иной ссылке. Вот для примера вам — отсюда только что вытащил данные по ноутбукам(1 или нулевая страница)

Python

import requests
 
url = 'https://s94rts9o9n-dsn.algolia.net/1/indexes/*/queries'
params = {
    'x-algolia-agent': 'Algolia for JavaScript (4.8.3); Browser (lite); JS Helper (3.3.4); react (16.14.0); react-instantsearch (6.8.2)',
    'x-algolia-api-key': '68254877d0fccdeaa3e9b24d2814a827',
    'x-algolia-application-id': 'S94RTS9O9N'
}
dataset = {"requests": [{"indexName": "Prod-ComputerUniverse",
                         "params": "highlightPreTag=%3Cais-highlight-0000000000%3E&highlightPostTag=%3C%2Fais-highlight-0000000000%3E&filters=(categoryid%3A38%20OR%20categoryids%3A38)%20%20AND%20(parentproductid%3A0%20OR%20isparenteol%3Atrue)%20AND%20(productchannel.1.published%3Atrue)&ruleContexts=%5B%22facet_category_38%22%5D&distinct=true&maxValuesPerFacet=1000&clickAnalytics=true&query=&hitsPerPage=20&page=0&facets=%5B%22price_ag_floored%22%2C%22isnew%22%2C%22deliverydatepreorder%22%2C%22deliverydatenow%22%2C%22manufacturer%22%2C%22facets.Ger%25c3%25a4tetyp.values%22%2C%22facets.Displaygr%25c3%25b6sse.valuesDisplayOrder%22%2C%22facets.Aufl%25c3%25b6sung.valuesDisplayOrder%22%2C%22facets.Touchscreen.values%22%2C%22facets.Displayoberfl%25c3%25a4che.values%22%2C%22facets.DisplayFeatures.values%22%2C%22facets.Betriebssystem.values%22%2C%22facets.BetriebssystemArchitektur.values%22%2C%22facets.SSDSpeicher.valuesDisplayOrder%22%2C%22facets.Festplattenspeicher.valuesDisplayOrder%22%2C%22facets.optischesLaufwerk.values%22%2C%22facets.Arbeitsspeicher%2528RAM%2529.valuesDisplayOrder%22%2C%22facets.ProzessorHersteller.values%22%2C%22facets.ProzessorModell.values%22%2C%22facets.ProzessorKerne.valuesDisplayOrder%22%2C%22facets.ProzessorGeschwindigkeit.valuesDisplayOrder%22%2C%22facets.TurboGeschwindigkeit.valuesDisplayOrder%22%2C%22facets.LeistungsstufeCPU.values%22%2C%22facets.Grafikchip.values%22%2C%22facets.Grafikspeicher.valuesDisplayOrder%22%2C%22facets.LeistungsstufeGrafikkarte.values%22%2C%22facets.WirelessLAN.values%22%2C%22facets.USBTypCAnschl%25c3%25bcsse.values%22%2C%22facets.USB31Gen2Anschl%25c3%25bcsse%2528USB31TypA%2529.valuesDisplayOrder%22%2C%22facets.USB31Gen1Anschl%25c3%25bcsse%2528USB30TypA%2529.valuesDisplayOrder%22%2C%22facets.USB20Anschl%25c3%25bcsse%2528TypA%2529.valuesDisplayOrder%22%2C%22facets.Monitoranschluss%2528extern%2529.values%22%2C%22facets.ThunderboltAnschl%25c3%25bcsse.valuesDisplayOrder%22%2C%22facets.Bluetooth.values%22%2C%22facets.MobilesInternet.values%22%2C%22facets.Webcam.values%22%2C%22facets.Tastaturbeleuchtung.values%22%2C%22facets.Dockinganschluss.values%22%2C%22facets.Gewicht%2528ca%2529.valuesDisplayOrder%22%2C%22facets.Farbe.values%22%2C%22facets.Besonderheiten.values%22%2C%22facets.maxAkkulaufzeit%2528ca%2529.valuesDisplayOrder%22%2C%22facets.Arbeitsspeichertyp.values%22%2C%22facets.RAMSteckpl%25c3%25a4tzefrei.values%22%2C%22facets.RAMSteckpl%25c3%25a4tzegesamt.values%22%2C%22facets.maxRAM.values%22%5D&tagFilters="}]}
 
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:89.0) Gecko/20100101 Firefox/89.0',
    'Referer': 'https://www.computeruniverse.net/'
}
 
response = requests.post(url, params=params, json=dataset, headers=headers)
products = response.json()
for product in products['results'][0]['hits']:
    print(product)

0 / 0 / 1

Регистрация: 21.04.2019

Сообщений: 29

16.06.2021, 19:46

[ТС]

Благодарю за помощь, селениум выручил.

Источник

Минуточку внимания

Method 1: Changing the User Agent

Method 2: Using Proxies

Step 1: Install Required Libraries

Step 2: Get a List of Proxies

Step 3: Make Requests with Proxies

Step 4: Handle Exceptions

Method 3: Implementing Wait Time between Requests

Минуточку внимания

Why does this happen?

How to fix this?

Disable mod_security or equivalent security features

Set a timeout

Решение

Интересное по теме:

**Disable mod_security or equivalent security features**