Пытался написать парсер для выгрузки себе картинок с artstation.com, взял рандомный профиль, практически весь контент там подгружается json-ом, нашёл GET запрос, в браузере он открывается норм, а через requests.get выдает 403. В гугле все советуют указать заголовок User-Agent и Cookie, использовал requests.sessions и указал User-Agent, но всё равно картина та же, ЧЯДНТ?
import requests
url = 'https://www.artstation.com/users/kuvshinov_ilya'
json_url = 'https://www.artstation.com/users/kuvshinov_ilya/projects.json?page=1'
header = {'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:32.0) Gecko/20100101 Firefox/32.0',}
session = requests.Session()
r = session.get(url, headers=header)
json_r = session.get(json_url, headers=header)
print(json_r)
> Response [403]
-
Вопрос задан
-
9458 просмотров
Виной 403 коду является cloudflare.
Для обхода мне помог cfscrape
def get_session():
session = requests.Session()
session.headers = {
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:69.0) Gecko/20100101 Firefox/69.0',
'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language':'ru,en-US;q=0.5',
'Accept-Encoding':'gzip, deflate, br',
'DNT':'1',
'Connection':'keep-alive',
'Upgrade-Insecure-Requests':'1',
'Pragma':'no-cache',
'Cache-Control':'no-cache'}
return cfscrape.create_scraper(sess=session)
session = get_session() # Дальше работать как с обычной requests.Session
Немного кода о выдёргивании прямых ссылок на хайрес пикчи:
Код
import requests
import cfscrape
def get_session():
session = requests.Session()
session.headers = {
'Host':'www.artstation.com',
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:69.0) Gecko/20100101 Firefox/69.0',
'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language':'ru,en-US;q=0.5',
'Accept-Encoding':'gzip, deflate, br',
'DNT':'1',
'Connection':'keep-alive',
'Upgrade-Insecure-Requests':'1',
'Pragma':'no-cache',
'Cache-Control':'no-cache'}
return cfscrape.create_scraper(sess=session)
def artstation():
url = 'https://www.artstation.com/kyuyongeom'
page_url = 'https://www.artstation.com/users/kyuyongeom/projects.json'
post_pattern = 'https://www.artstation.com/projects/{}.json'
session = get_session()
absolute_links = []
response = session.get(page_url, params={'page':1}).json()
pages, modulo = divmod(response['total_count'], 50)
if modulo: pages += 1
for page in range(1, pages+1):
if page != 1:
response = session.get(page_url, params={'page':page}).json()
for post in response['data']:
shortcode = post['permalink'].split('/')[-1]
inner_resp = session.get(post_pattern.format(shortcode)).json()
for img in inner_resp['assets']:
if img['asset_type'] == 'image':
absolute_links.append(img['image_url'])
with open('links.txt', 'w') as file:
file.write('\n'.join(absolute_links))
if __name__ == '__main__':
artstation()
нужно больше полей в Header
вставил все что отправляет Chrome и появился результат, см:
import requests
url = 'https://www.artstation.com/users/kuvshinov_ilya'
json_url = 'https://www.artstation.com/users/kuvshinov_ilya/projects.json?page=1'
header = {
'accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3',
'accept-encoding':'gzip, deflate, br',
'accept-language':'ru-RU,ru;q=0.9,en-US;q=0.8,en;q=0.7',
'cache-control':'no-cache',
'dnt': '1',
'pragma': 'no-cache',
'sec-fetch-mode': 'navigate',
'sec-fetch-site': 'none',
'sec-fetch-user': '?1',
'upgrade-insecure-requests': '1',
'user-agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.100 Safari/537.36'}
session = requests.Session()
session.headers = header
r = session.get(url)
if r.status_code == 200:
json_r = session.get(json_url)
if json_r.status_code == 200:
print(json_r.text)
else:
print(json_r.status_code)
Пригласить эксперта
Есть смысл указать все поля из Header, а не только User-Agent
потому что , братуха, AJAX запрос имеет другие хидеры
серверу пофиг, но ты пытаешься получить аякс неаяксовым клиентом — вот и 403
-
Показать ещё
Загружается…
21 сент. 2023, в 19:28
10000 руб./за проект
21 сент. 2023, в 19:06
11111 руб./за проект
21 сент. 2023, в 19:00
6000000 руб./за проект
Минуточку внимания
I was trying to scrape a website for practice, but I kept on getting the HTTP Error 403 (does it think I’m a bot)?
Here is my code:
#import requests
import urllib.request
from bs4 import BeautifulSoup
#from urllib import urlopen
import re
webpage = urllib.request.urlopen('http://www.cmegroup.com/trading/products/#sortField=oi&sortAsc=false&venues=3&page=1&cleared=1&group=1').read
findrows = re.compile('<tr class="- banding(?:On|Off)>(.*?)</tr>')
findlink = re.compile('<a href =">(.*)</a>')
row_array = re.findall(findrows, webpage)
links = re.finall(findlink, webpate)
print(len(row_array))
iterator = []
The error I get is:
File "C:\Python33\lib\urllib\request.py", line 160, in urlopen
return opener.open(url, data, timeout)
File "C:\Python33\lib\urllib\request.py", line 479, in open
response = meth(req, response)
File "C:\Python33\lib\urllib\request.py", line 591, in http_response
'http', request, response, code, msg, hdrs)
File "C:\Python33\lib\urllib\request.py", line 517, in error
return self._call_chain(*args)
File "C:\Python33\lib\urllib\request.py", line 451, in _call_chain
result = func(*args)
File "C:\Python33\lib\urllib\request.py", line 599, in http_error_default
raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 403: Forbidden
HTTP Error 403 is a common error encountered while web scraping using Python 3. It indicates that the server is refusing to fulfill the request made by the client, as the request lacks sufficient authorization or the server considers the request to be invalid. This error can be encountered for a variety of reasons, including the presence of IP blocking, CAPTCHAs, or rate limiting restrictions. In order to resolve the issue, there are several methods that can be implemented, including changing the User Agent, using proxies, and implementing wait time between requests.
Method 1: Changing the User Agent
If you encounter HTTP error 403 while web scraping with Python 3, it means that the server is denying you access to the webpage. One common solution to this problem is to change the user agent of your web scraper. The user agent is a string that identifies the web scraper to the server. By changing the user agent, you can make your web scraper appear as a regular web browser to the server.
Here is an example code that shows how to change the user agent of your web scraper using the requests
library:
import requests
url = 'https://example.com'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'}
response = requests.get(url, headers=headers)
print(response.content)
In this example, we set the User-Agent
header to a string that mimics the user agent of the Google Chrome web browser. You can find the user agent string of your favorite web browser by searching for «my user agent» on Google.
By setting the User-Agent
header, we can make our web scraper appear as a regular web browser to the server. This can help us bypass HTTP error 403 and access the webpage we want to scrape.
That’s it! By changing the user agent of your web scraper, you should be able to fix the problem of HTTP error 403 in Python 3 web scraping.
Method 2: Using Proxies
If you are encountering HTTP error 403 while web scraping with Python 3, it is likely that the website is blocking your IP address due to frequent requests. One way to solve this problem is by using proxies. Proxies allow you to make requests to the website from different IP addresses, making it difficult for the website to block your requests. Here is how you can fix HTTP error 403 in Python 3 web scraping with proxies:
Step 1: Install Required Libraries
You need to install the requests
and bs4
libraries to make HTTP requests and parse HTML respectively. You can install them using pip:
pip install requests
pip install bs4
Step 2: Get a List of Proxies
You need to get a list of proxies that you can use to make requests to the website. There are many websites that provide free proxies, such as https://free-proxy-list.net/
. You can scrape the website to get a list of proxies:
import requests
from bs4 import BeautifulSoup
url = 'https://free-proxy-list.net/'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
table = soup.find('table', {'id': 'proxylisttable'})
rows = table.tbody.find_all('tr')
proxies = []
for row in rows:
cols = row.find_all('td')
if cols[6].text == 'yes':
proxy = cols[0].text + ':' + cols[1].text
proxies.append(proxy)
This code scrapes the website and gets a list of HTTP proxies that support HTTPS. The proxies are stored in the proxies
list.
Step 3: Make Requests with Proxies
You can use the requests
library to make requests to the website with a proxy. Here is an example code that makes a request to https://www.example.com
with a random proxy from the proxies
list:
import random
url = 'https://www.example.com'
proxy = random.choice(proxies)
response = requests.get(url, proxies={'https': proxy})
if response.status_code == 200:
print(response.text)
else:
print('Request failed with status code:', response.status_code)
This code selects a random proxy from the proxies
list and makes a request to https://www.example.com
with the proxy. If the request is successful, it prints the response text. Otherwise, it prints the status code of the failed request.
Step 4: Handle Exceptions
You need to handle exceptions that may occur while making requests with proxies. Here is an example code that handles exceptions and retries the request with a different proxy:
import requests
import random
from requests.exceptions import ProxyError, ConnectionError, Timeout
url = 'https://www.example.com'
while True:
proxy = random.choice(proxies)
try:
response = requests.get(url, proxies={'https': proxy}, timeout=5)
if response.status_code == 200:
print(response.text)
break
else:
print('Request failed with status code:', response.status_code)
except (ProxyError, ConnectionError, Timeout):
print('Proxy error. Retrying with a different proxy...')
This code uses a while
loop to keep retrying the request with a different proxy until it succeeds. It handles ProxyError
, ConnectionError
, and Timeout
exceptions that may occur while making requests with proxies.
Method 3: Implementing Wait Time between Requests
When you are scraping a website, you might encounter an HTTP error 403, which means that the server is denying your request. This can happen when the server detects that you are sending too many requests in a short period of time, and it wants to protect itself from being overloaded.
One way to fix this problem is to implement wait time between requests. This means that you will wait a certain amount of time before sending the next request, which will give the server time to process the previous request and prevent it from being overloaded.
Here is an example code that shows how to implement wait time between requests using the time
module:
import requests
import time
url = 'https://example.com'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'}
for i in range(5):
response = requests.get(url, headers=headers)
print(response.status_code)
time.sleep(1)
In this example, we are sending a request to https://example.com
with headers that mimic a browser request. We are then using a for
loop to send 5 requests with a 1 second delay between requests using the time.sleep()
function.
Another way to implement wait time between requests is to use a random delay. This will make your requests less predictable and less likely to be detected as automated. Here is an example code that shows how to implement a random delay using the random
module:
import requests
import random
import time
url = 'https://example.com'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'}
for i in range(5):
response = requests.get(url, headers=headers)
print(response.status_code)
time.sleep(random.randint(1, 5))
In this example, we are sending a request to https://example.com
with headers that mimic a browser request. We are then using a for
loop to send 5 requests with a random delay between 1 and 5 seconds using the random.randint()
function.
By implementing wait time between requests, you can prevent HTTP error 403 and ensure that your web scraping code runs smoothly.
I was trying to scrap a website for practice, but I kept on getting the HTTP Error 403 (does it think I’m a bot)?
Here is my code:
#import requests
import urllib.request
from bs4 import BeautifulSoup
#from urllib import urlopen
import re
webpage = urllib.request.urlopen('http://www.cmegroup.com/trading/products/#sortField=oi&sortAsc=false&venues=3&page=1&cleared=1&group=1').read
findrows = re.compile('<tr class="- banding(?:On|Off)>(.*?)</tr>')
findlink = re.compile('<a href =">(.*)</a>')
row_array = re.findall(findrows, webpage)
links = re.finall(findlink, webpate)
print(len(row_array))
iterator = []
The error I get is:
File "C:\Python33\lib\urllib\request.py", line 160, in urlopen
return opener.open(url, data, timeout)
File "C:\Python33\lib\urllib\request.py", line 479, in open
response = meth(req, response)
File "C:\Python33\lib\urllib\request.py", line 591, in http_response
'http', request, response, code, msg, hdrs)
File "C:\Python33\lib\urllib\request.py", line 517, in error
return self._call_chain(*args)
File "C:\Python33\lib\urllib\request.py", line 451, in _call_chain
result = func(*args)
File "C:\Python33\lib\urllib\request.py", line 599, in http_error_default
raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 403: Forbidden
This question is related to
python
http
web
http-status-code-403
This is probably because of mod_security
or some similar server security feature which blocks known spider/bot user agents (urllib
uses something like python urllib/3.3.0
, it’s easily detected). Try setting a known browser user agent with:
from urllib.request import Request, urlopen
req = Request('http://www.cmegroup.com/trading/products/#sortField=oi&sortAsc=false&venues=3&page=1&cleared=1&group=1', headers={'User-Agent': 'Mozilla/5.0'})
webpage = urlopen(req).read()
This works for me.
By the way, in your code you are missing the ()
after .read
in the urlopen
line, but I think that it’s a typo.
TIP: since this is exercise, choose a different, non restrictive site. Maybe they are blocking urllib
for some reason…
Definitely it’s blocking because of your use of urllib based on the user agent. This same thing is happening to me with OfferUp. You can create a new class called AppURLopener which overrides the user-agent with Mozilla.
import urllib.request
class AppURLopener(urllib.request.FancyURLopener):
version = "Mozilla/5.0"
opener = AppURLopener()
response = opener.open('http://httpbin.org/user-agent')
Source
«This is probably because of mod_security or some similar server security feature which blocks known
spider/bot
user agents (urllib uses something like python urllib/3.3.0, it’s easily detected)» — as already mentioned by Stefano Sanfilippo
from urllib.request import Request, urlopen
url="https://stackoverflow.com/search?q=html+error+403"
req = Request(url, headers={'User-Agent': 'Mozilla/5.0'})
web_byte = urlopen(req).read()
webpage = web_byte.decode('utf-8')
The web_byte is a byte object returned by the server and the content type present in webpage is mostly utf-8.
Therefore you need to decode web_byte using decode method.
This solves complete problem while I was having trying to scrap from a website using PyCharm
P.S -> I use python 3.4
Based on the previous answer,
from urllib.request import Request, urlopen
#specify url
url = 'https://xyz/xyz'
req = Request(url, headers={'User-Agent': 'XYZ/3.0'})
response = urlopen(req, timeout=20).read()
This worked for me by extending the timeout.
Based on previous answers this has worked for me with Python 3.7
from urllib.request import Request, urlopen
req = Request('Url_Link', headers={'User-Agent': 'XYZ/3.0'})
webpage = urlopen(req, timeout=10).read()
print(webpage)
Since the page works in browser and not when calling within python program, it seems that the web app that serves that url recognizes that you request the content not by the browser.
Demonstration:
curl --dump-header r.txt http://www.cmegroup.com/trading/products/#sortField=oi&sortAsc=false&venues=3&page=1&cleared=1&group=1
...
<HTML><HEAD>
<TITLE>Access Denied</TITLE>
</HEAD><BODY>
<H1>Access Denied</H1>
You don't have permission to access ...
</HTML>
and the content in r.txt has status line:
HTTP/1.1 403 Forbidden
Try posting header ‘User-Agent’ which fakes web client.
NOTE: The page contains Ajax call that creates the table you probably want to parse. You’ll need to check the javascript logic of the page or simply using browser debugger (like Firebug / Net tab) to see which url you need to call to get the table’s content.
You can try in two ways. The detail is in this link.
1) Via pip
pip install —upgrade certifi
2) If it doesn’t work, try to run a Cerificates.command that comes bundled with Python 3.* for Mac:(Go to your python installation location and double click the file)
open /Applications/Python\ 3.*/Install\ Certificates.command
If you feel guilty about faking the user-agent as Mozilla (comment in the top answer from Stefano), it could work with a non-urllib User-Agent as well. This worked for the sites I reference:
req = urlrequest.Request(link, headers={'User-Agent': 'XYZ/3.0'})
urlrequest.urlopen(req, timeout=10).read()
My application is to test validity by scraping specific links that I refer to, in my articles. Not a generic scraper.
Пытался написать парсер для выгрузки себе картинок с artstation.com, взял рандомный профиль, практически весь контент там подгружается json-ом, нашёл GET запрос, в браузере он открывается норм, а через requests.get выдает 403. В гугле все советуют указать заголовок User-Agent и Cookie, использовал requests.sessions и указал User-Agent, но всё равно картина та же, ЧЯДНТ?
import requests
url = 'https://www.artstation.com/users/kuvshinov_ilya'
json_url = 'https://www.artstation.com/users/kuvshinov_ilya/projects.json?page=1'
header = {'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:32.0) Gecko/20100101 Firefox/32.0',}
session = requests.Session()
r = session.get(url, headers=header)
json_r = session.get(json_url, headers=header)
print(json_r)
> Response [403]
-
Вопрос заданболее трёх лет назад
-
8608 просмотров
Виной 403 коду является cloudflare.
Для обхода мне помог cfscrape
def get_session():
session = requests.Session()
session.headers = {
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:69.0) Gecko/20100101 Firefox/69.0',
'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language':'ru,en-US;q=0.5',
'Accept-Encoding':'gzip, deflate, br',
'DNT':'1',
'Connection':'keep-alive',
'Upgrade-Insecure-Requests':'1',
'Pragma':'no-cache',
'Cache-Control':'no-cache'}
return cfscrape.create_scraper(sess=session)
session = get_session() # Дальше работать как с обычной requests.Session
Немного кода о выдёргивании прямых ссылок на хайрес пикчи:
Код
import requests
import cfscrape
def get_session():
session = requests.Session()
session.headers = {
'Host':'www.artstation.com',
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:69.0) Gecko/20100101 Firefox/69.0',
'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language':'ru,en-US;q=0.5',
'Accept-Encoding':'gzip, deflate, br',
'DNT':'1',
'Connection':'keep-alive',
'Upgrade-Insecure-Requests':'1',
'Pragma':'no-cache',
'Cache-Control':'no-cache'}
return cfscrape.create_scraper(sess=session)
def artstation():
url = 'https://www.artstation.com/kyuyongeom'
page_url = 'https://www.artstation.com/users/kyuyongeom/projects.json'
post_pattern = 'https://www.artstation.com/projects/{}.json'
session = get_session()
absolute_links = []
response = session.get(page_url, params={'page':1}).json()
pages, modulo = divmod(response['total_count'], 50)
if modulo: pages += 1
for page in range(1, pages+1):
if page != 1:
response = session.get(page_url, params={'page':page}).json()
for post in response['data']:
shortcode = post['permalink'].split('/')[-1]
inner_resp = session.get(post_pattern.format(shortcode)).json()
for img in inner_resp['assets']:
if img['asset_type'] == 'image':
absolute_links.append(img['image_url'])
with open('links.txt', 'w') as file:
file.write('n'.join(absolute_links))
if __name__ == '__main__':
artstation()
нужно больше полей в Header
вставил все что отправляет Chrome и появился результат, см:
import requests
url = 'https://www.artstation.com/users/kuvshinov_ilya'
json_url = 'https://www.artstation.com/users/kuvshinov_ilya/projects.json?page=1'
header = {
'accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3',
'accept-encoding':'gzip, deflate, br',
'accept-language':'ru-RU,ru;q=0.9,en-US;q=0.8,en;q=0.7',
'cache-control':'no-cache',
'dnt': '1',
'pragma': 'no-cache',
'sec-fetch-mode': 'navigate',
'sec-fetch-site': 'none',
'sec-fetch-user': '?1',
'upgrade-insecure-requests': '1',
'user-agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.100 Safari/537.36'}
session = requests.Session()
session.headers = header
r = session.get(url)
if r.status_code == 200:
json_r = session.get(json_url)
if json_r.status_code == 200:
print(json_r.text)
else:
print(json_r.status_code)
Пригласить эксперта
Есть смысл указать все поля из Header, а не только User-Agent
потому что , братуха, AJAX запрос имеет другие хидеры
серверу пофиг, но ты пытаешься получить аякс неаяксовым клиентом — вот и 403
-
Показать ещё
Загружается…
05 июн. 2023, в 15:13
2000 руб./в час
05 июн. 2023, в 15:02
7000 руб./за проект
05 июн. 2023, в 15:00
170000 руб./за проект
Минуточку внимания
I was trying to scrape a website for practice, but I kept on getting the HTTP Error 403 (does it think I’m a bot)?
Here is my code:
#import requests
import urllib.request
from bs4 import BeautifulSoup
#from urllib import urlopen
import re
webpage = urllib.request.urlopen('http://www.cmegroup.com/trading/products/#sortField=oi&sortAsc=false&venues=3&page=1&cleared=1&group=1').read
findrows = re.compile('<tr class="- banding(?:On|Off)>(.*?)</tr>')
findlink = re.compile('<a href =">(.*)</a>')
row_array = re.findall(findrows, webpage)
links = re.finall(findlink, webpate)
print(len(row_array))
iterator = []
The error I get is:
File "C:Python33liburllibrequest.py", line 160, in urlopen
return opener.open(url, data, timeout)
File "C:Python33liburllibrequest.py", line 479, in open
response = meth(req, response)
File "C:Python33liburllibrequest.py", line 591, in http_response
'http', request, response, code, msg, hdrs)
File "C:Python33liburllibrequest.py", line 517, in error
return self._call_chain(*args)
File "C:Python33liburllibrequest.py", line 451, in _call_chain
result = func(*args)
File "C:Python33liburllibrequest.py", line 599, in http_error_default
raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 403: Forbidden
Скажу сразу, что я редко использую библиотеку urllib / 3. Однако я попытался использовать команду терминала оболочки scrapy, а также использовать библиотеку запросов без пользовательского агента и получил ответ 200.
Я заметил, что вы не объявляли тип парсера при объявлении «суп».
soup = BeautifulSoup(resp, from_encoding=resp.info().get_param('charset'))
Хотя мне гораздо удобнее использовать парсер scrapy, несмотря на то, что он тяжелее, но если вы правильно помните, вы должны объявить тип парсера, например
soup = BeautifulSoup(resp, "lxml")
Битто Бенни-чан говорит, что ему удалось получить ответ с помощью 200 urllib.request, так что попробуйте его изменения. Это просто вводило полное имя пользовательского агента.
Я предлагаю использовать библиотеку запросов. Я думаю, это было бы достаточно простое изменение.
from bs4 import BeautifulSoup
import requests
listoflinks = ['https://www.spectatornews.com/page/6/?s=band', 'https://www.spectatornews.com/page/7/?s=band']
getarticles = []
for i in listoflinks:
resp = requests.get(i)
soup = BeautifulSoup(resp.content, "lxml")
for link in soup.find_all('a', href=True):
getarticles.append(link['href'])
список getarticles выводил это:
'https://www.spectatornews.com/category/showcase/',
'https://www.spectatornews.com/showcase/2003/02/06/minneapolis-band-trips-into-eau-claire/',
'https://www.spectatornews.com/category/showcase/',
'https://www.spectatornews.com/page/5/?s=band',
'https://www.spectatornews.com/?s=band',
'https://www.spectatornews.com/page/2/?s=band',
'https://www.spectatornews.com/page/3/?s=band',
'https://www.spectatornews.com/page/4/?s=band',
'https://www.spectatornews.com/page/5/?s=band',
'https://www.spectatornews.com/page/7/?s=band',
'https://www.spectatornews.com/page/8/?s=band',
'https://www.spectatornews.com/page/9/?s=band',
'https://www.spectatornews.com/page/127/?s=band',
'https://www.spectatornews.com/page/7/?s=band',
'https://www.spectatornews.com',
'https://www.spectatornews.com/feed/rss/',
'#',
'https://www.youtube.com/channel/UC1SM8q3lk_fQS1KuY77bDgQ',
'https://www.snapchat.com/add/spectator news',
'https://www.instagram.com/spectatornews/',
'http://twitter.com/spectatornews',
'http://facebook.com/spectatornews',
'/',
'https://snosites.com/why-sno/',
'http://snosites.com',
'https://www.spectatornews.com/wp-login.php',
'#top',
'/',
'https://www.spectatornews.com/category/campus-news/',
'https://www.spectatornews.com/category/currents/',
'https://www.spectatornews.com/category/sports/',
'https://www.spectatornews.com/category/opinion/',
'https://www.spectatornews.com/category/multimedia-2/',
'https://www.spectatornews.com/ads/banner-advertise-with-the-spectator/',
'https://www.spectatornews.com/category/campus-news/',
'https://www.spectatornews.com/category/currents/',
'https://www.spectatornews.com/category/sports/',
'https://www.spectatornews.com/category/opinion/',
'https://www.spectatornews.com/category/multimedia-2/',
'/',
'https://www.spectatornews.com/about/',
'https://www.spectatornews.com/about/editorial-policy/',
'https://www.spectatornews.com/about/correction-policy/',
'https://www.spectatornews.com/about/bylaws/',
'https://www.spectatornews.com/advertise/',
'https://www.spectatornews.com/contact/',
'https://www.spectatornews.com/staff/',
'https://www.spectatornews.com/submit-a-letter/',
'https://www.spectatornews.com/submit-a-news-tip/',
'/',
'https://www.spectatornews.com',
'https://www.spectatornews.com/category/campus-news/',
'https://www.spectatornews.com/category/currents/',
'https://www.spectatornews.com/category/sports/',
'https://www.spectatornews.com/category/opinion/',
'https://www.spectatornews.com/category/multimedia-2/',
'/',
'https://www.spectatornews.com/feed/rss/',
'#',
'https://www.youtube.com/channel/UC1SM8q3lk_fQS1KuY77bDgQ',
'https://www.snapchat.com/add/spectator news',
'https://www.instagram.com/spectatornews/',
'http://twitter.com/spectatornews',
'http://facebook.com/spectatornews',
'https://www.spectatornews.com/campus-news/2002/05/09/late-night-bus-service-idea-abandoned-due-to-expense/',
'https://www.spectatornews.com/category/campus-news/',
'https://www.spectatornews.com/opinion/2002/03/21/yates-deserved-what-she-got-husband-also-to-blame/',
'https://www.spectatornews.com/category/opinion/',
'https://www.spectatornews.com/opinion/2001/11/29/air-force-concert-band-inspires-zorn-arena-audience/',
'https://www.spectatornews.com/category/opinion/',
'https://www.spectatornews.com/campus-news/2001/10/25/goth-style-bands-will-entertain-at-halloween-costume-concert/',
'https://www.spectatornews.com/category/campus-news/',
'https://www.spectatornews.com/campus-news/2001/04/19/campus-group-will-host-hemp-event-with-bands-information/',
'https://www.spectatornews.com/category/campus-news/',
'https://www.spectatornews.com/currents/2018/12/10/geekin-out/',
'https://www.spectatornews.com/currents/2018/12/10/geekin-out/',
'https://www.spectatornews.com/staff/?writer=Alanna%20Huggett',
'https://www.spectatornews.com/category/currents/',
'https://www.spectatornews.com/tag/geekcon/',
'https://www.spectatornews.com/tag/tv10/',
'https://www.spectatornews.com/tag/uwec/',
'https://www.spectatornews.com/opinion/2018/12/07/keeping-up-with-the-kar-fashions-11/',
'https://www.spectatornews.com/opinion/2018/12/07/keeping-up-with-the-kar-fashions-11/',
'https://www.spectatornews.com/staff/?writer=Kar%20Wei%20Cheng',
'https://www.spectatornews.com/category/column-2/',
'https://www.spectatornews.com/category/multimedia-2/',
'https://www.spectatornews.com/category/opinion/',
'https://www.spectatornews.com/tag/accessories/',
'https://www.spectatornews.com/tag/fashion/',
'https://www.spectatornews.com/tag/multimedia/',
'https://www.spectatornews.com/tag/winter/',
'https://www.spectatornews.com/multimedia-2/2018/12/07/a-magical-night/',
'https://www.spectatornews.com/multimedia-2/2018/12/07/a-magical-night/',
'https://www.spectatornews.com/staff/?writer=Julia%20Van%20Allen',
'https://www.spectatornews.com/category/multimedia-2/',
'https://www.spectatornews.com/tag/dancing/',
'https://www.spectatornews.com/tag/harry-potter/',
'https://www.spectatornews.com/tag/smom/',
'https://www.spectatornews.com/tag/student-ministry-of-magic/',
'https://www.spectatornews.com/tag/uwec/',
'https://www.spectatornews.com/tag/yule/',
'https://www.spectatornews.com/tag/yule-ball/',
'https://www.spectatornews.com/campus-news/2018/11/26/old-news-5/',
'https://www.spectatornews.com/campus-news/2018/11/26/old-news-5/',
'https://www.spectatornews.com/staff/?writer=Madeline%20Fuerstenberg',
'https://www.spectatornews.com/category/column-2/',
'https://www.spectatornews.com/category/campus-news/',
'https://www.spectatornews.com/tag/1950/',
'https://www.spectatornews.com/tag/1975/',
'https://www.spectatornews.com/tag/2000/',
'https://www.spectatornews.com/tag/articles/',
'https://www.spectatornews.com/tag/spectator/',
'https://www.spectatornews.com/tag/throwback/',
'https://www.spectatornews.com/currents/2018/11/21/boss-women-highlighting-businesswomen-in-eau-claire-6/',
'https://www.spectatornews.com/currents/2018/11/21/boss-women-highlighting-businesswomen-in-eau-claire-6/',
'https://www.spectatornews.com/staff/?writer=Taylor%20Reisdorf',
'https://www.spectatornews.com/category/column-2/',
'https://www.spectatornews.com/category/currents/',
'https://www.spectatornews.com/tag/altoona/',
'https://www.spectatornews.com/tag/boss-women/',
'https://www.spectatornews.com/tag/business-women/',
'https://www.spectatornews.com/tag/cherish-woodford/',
'https://www.spectatornews.com/tag/crossfit/',
'https://www.spectatornews.com/tag/crossfit-river-prairie/',
'https://www.spectatornews.com/tag/eau-claire/',
'https://www.spectatornews.com/tag/fitness/',
'https://www.spectatornews.com/tag/gym/',
'https://www.spectatornews.com/tag/local/',
'https://www.spectatornews.com/tag/nicole-randall/',
'https://www.spectatornews.com/tag/river-prairie/',
'https://www.spectatornews.com/currents/2018/11/20/bad-art-good-music/',
'https://www.spectatornews.com/currents/2018/11/20/bad-art-good-music/',
'https://www.spectatornews.com/staff/?writer=Lea%20Kopke',
'https://www.spectatornews.com/category/currents/',
'https://www.spectatornews.com/tag/bad-art/',
'https://www.spectatornews.com/tag/fmdown/',
'https://www.spectatornews.com/tag/ghosts-of-the-sun/',
'https://www.spectatornews.com/tag/music/',
'https://www.spectatornews.com/tag/pablo-center/',
'https://www.spectatornews.com/opinion/2018/11/14/the-tator-21/',
'https://www.spectatornews.com/opinion/2018/11/14/the-tator-21/',
'https://www.spectatornews.com/staff/?writer=Stephanie%20Janssen',
'https://www.spectatornews.com/category/column-2/',
'https://www.spectatornews.com/category/opinion/',
'https://www.spectatornews.com/tag/satire/',
'https://www.spectatornews.com/tag/sleepy/',
'https://www.spectatornews.com/tag/tator/',
'https://www.spectatornews.com/tag/uw-eau-claire/',
'https://www.spectatornews.com/tag/uwec/',
'https://www.spectatornews.com/page/6/?s=band',
'https://www.spectatornews.com/?s=band',
'https://www.spectatornews.com/page/2/?s=band',
'https://www.spectatornews.com/page/3/?s=band',
'https://www.spectatornews.com/page/4/?s=band',
'https://www.spectatornews.com/page/5/?s=band',
'https://www.spectatornews.com/page/6/?s=band',
'https://www.spectatornews.com/page/8/?s=band',
'https://www.spectatornews.com/page/9/?s=band',
'https://www.spectatornews.com/page/10/?s=band',
'https://www.spectatornews.com/page/127/?s=band',
'https://www.spectatornews.com/page/8/?s=band',
'https://www.spectatornews.com',
'https://www.spectatornews.com/feed/rss/',
'#',
'https://www.youtube.com/channel/UC1SM8q3lk_fQS1KuY77bDgQ',
'https://www.snapchat.com/add/spectator news',
'https://www.instagram.com/spectatornews/',
'http://twitter.com/spectatornews',
'http://facebook.com/spectatornews',
'/',
'https://snosites.com/why-sno/',
'http://snosites.com',
'https://www.spectatornews.com/wp-login.php',
'#top',
'/',
'https://www.spectatornews.com/category/campus-news/',
'https://www.spectatornews.com/category/currents/',
'https://www.spectatornews.com/category/sports/',
'https://www.spectatornews.com/category/opinion/',
'https://www.spectatornews.com/category/multimedia-2/']
Python is a rather easy-to-learn but powerful language enough to be used in a variety of applications ranging from AI and Machine Learning all the way to something as simple as creating web scraping bots.
That said, random bugs and glitches are still the order of the day in a Python programmer’s life. In this article, we’re talking about the “urllib.error.httperror: HTTP error 403: Forbidden” when trying to scrape sites with Python and what you can to do fix the problem.
Also read: Is Python case sensitive when dealing with identifiers?
Why does this happen?
While the error can be triggered by anything from a simple runtime error in the script to server issues on the website, the most likely reason is the presence of some sort of server security feature to prevent bots or spiders from crawling the site. In this case, the security feature might be blocking urllib, a library used to send requests to websites.
How to fix this?
Here are two fixes you can try out.
Disable mod_security or equivalent security features
As mentioned before, server-side security features can cause problems with web scrapers. Try setting your browser agent as follows to see if you can avoid the issue.
from urllib.request import Request, urlopen
req = Request(
url='enter request URL here',
headers={'User-Agent': 'Mozilla/5.0'}
)
webpage = urlopen(req).read()
A correctly defined browser agent should be able to scrape data from any site.
Set a timeout
If you aren’t getting a response, try setting a timeout to prevent the server from mistaking your bot for a DDoS attacking and hence blocking all requests altogether.
from urllib.request import Request,
urlopen req = Request('enter request URL here', headers={'User-Agent': 'Mozilla/5.0'})
webpage = urlopen(req,timeout=10).read()
The aforementioned example sets a 10-second timeout between requests to not overload the server while maintaining good request frequency.
Also read: How to fix Javascript error: ipython is not defined?
helhel20 0 / 0 / 1 Регистрация: 21.04.2019 Сообщений: 29 |
||||
1 |
||||
15.06.2021, 10:41. Показов 7530. Ответов 9 Метки нет (Все метки)
Здравствуйте, нужно написать парсер сайта. Хочу получить html страницы, но запрос выдает ошибку 403, я так понимаю на сайте какая-то защита стоит. Как ее обойти?
0 |
Автоматизируй это! 7105 / 4403 / 1182 Регистрация: 30.03.2015 Сообщений: 12,848 Записей в блоге: 29 |
|
15.06.2021, 10:50 |
2 |
helhel20, почти все эти заголовки не нужны (не обязательны), 403 это возможно отсутствие авторизации, залогиниться не надо сначала на сайте? 0 |
0 / 0 / 1 Регистрация: 21.04.2019 Сообщений: 29 |
|
15.06.2021, 11:47 [ТС] |
3 |
Нет, не нужно логиниться 0 |
Автоматизируй это! 7105 / 4403 / 1182 Регистрация: 30.03.2015 Сообщений: 12,848 Записей в блоге: 29 |
|
15.06.2021, 14:06 |
4 |
helhel20, я достаю хрустальный шар. Так, или ты не верно формируешь запрос, или не туда его шлешь, или добавляешь не те хедеры, или не добавляешь нужные куки. Вот. 0 |
0 / 0 / 1 Регистрация: 21.04.2019 Сообщений: 29 |
|
15.06.2021, 17:43 [ТС] |
5 |
А какие хедеры еще могут быть нужны? И какие куки надо добавить? 0 |
Автоматизируй это! 7105 / 4403 / 1182 Регистрация: 30.03.2015 Сообщений: 12,848 Записей в блоге: 29 |
|
15.06.2021, 17:45 |
6 |
helhel20, те, которые нужны для этого запроса конечно, смотрим в браузере что и куда уходит, повторяем запрос как там. 0 |
helhel20 0 / 0 / 1 Регистрация: 21.04.2019 Сообщений: 29 |
||||
15.06.2021, 20:00 [ТС] |
7 |
|||
Я правильно куки отправляю? Что-то не работает, все уже перепробовал
0 |
Автоматизируй это! 7105 / 4403 / 1182 Регистрация: 30.03.2015 Сообщений: 12,848 Записей в блоге: 29 |
|
16.06.2021, 12:09 |
8 |
helhel20, сайтик не простой, очень сильно заколдовано) Если ты хочешь его содержимое парсить, то сразу рекомендую на селениум переходить, тут очень много динамики и подгрузок. 0 |
АмигоСП 295 / 108 / 57 Регистрация: 07.12.2016 Сообщений: 209 |
||||
16.06.2021, 15:10 |
9 |
|||
Решениеhelhel20, как написал Уважаемый Welemir1, сайт действительно непростой. И, если вы только пытаетесь изучить парсинг, то сложно будет. И вдогонку такой вопрос ещё — а нужна ли вам главная страница? Информативности в ней не то, чтобы очень. Обычно вытягивают по разделам информацию о продукте конечную.
0 |
0 / 0 / 1 Регистрация: 21.04.2019 Сообщений: 29 |
|
16.06.2021, 19:46 [ТС] |
10 |
Благодарю за помощь, селениум выручил. 0 |