soup = BeautifulSoup(urls.text ,
"
html5lib"
)
content = soup.find(
"
div"
, {
"
class"
:
"
tt_article_useless_p_margin"
})
images = content.findAll(
'
img'
)
for
img
in
images:
img_url = img[
'
src'
]+
"
?original"
print
(img_url,file=im_link)
def
get_links():
count=1
for
line
in
tw_link:
print
(line,count)
count+=1
get_images(line)
get_links()
What I have tried:
<pre>The code seems to work fine
when
using
a single link, but
when
i pass the urls to the function i
'
m getting the following error.<br />
AttributeError Traceback (most recent call last) in () 23 count+=1 24 get_images(line) ---> 25 get_links()<br />
1 frames in get_links() 22 print(line,count) 23 count+=1 ---> 24 get_images(line) 25 get_links()<br />
in get_images(urli) 12 print(soup.prettify()) 13 content = soup.find("div", {"class": "tt_article_useless_p_margin"}) ---> 14 images = content.findAll('
img
'
) 15 for img in images: 16 img_url = img['
src
'
]+"?original"<br />
AttributeError: '
NoneType
'
object has no attribute '
findAll
'
My guess is that i'm triggering some sort of Bot Detection (because when passing a single link different page is loaded not the one that's being loaded currently), is there any way to bypass that..? I've tried using time.sleep(5) but that also didn't work
tw_link =
open
(
"
TW_Links.txt"
,
"
r"
, encoding =
'
utf-8'
)
im_link =
open
(
"
DCDN_Links.txt"
,
"
w+"
)
kak_link =
open
(
"
KCDN_Links.txt"
,
"
w+"
)
def
get_images(urlset):
for
x
in
urlset:
rs = requests.Session()
urls=rs.get(x)
soup = BeautifulSoup(urls.text ,
"
html5lib"
)
content = soup.find(
"
div"
, {
"
class"
:
"
tt_article_useless_p_margin"
})
images = content.findAll(
'
img'
)
for
img
in
images:
img_url = img[
'
src'
]+
"
?original"
if
"
blog"
in
img_url:
print
(img_url,file=kak_link)
print
(img_url)
print
(img_url,file=im_link)
print
(img_url)
time.sleep(
2
)
def
get_links():
count=1
linklist = []
for
line
in
tw_link:
line = line.replace(
"
\n"
,
"
"
)
linklist.append(line)
get_images(linklist)
get_links()
For those waiting for a solution, it was pretty simple, i was doubtful of the request module so i intercepted the traffic from the program using proxy and voila turns out the request module also included EOL symbol in the request as well, while it might've worked with most sites this particular site redirected to the
404 Page
, so a simple removal of "\n" from the lines read did the trick.
Read the question carefully.
Understand that English isn't everyone's first language so be lenient of bad
spelling and grammar.
If a question is poorly phrased then either ask for clarification, ignore it, or
edit the question
and fix the problem. Insults are not welcome.
Don't tell someone to read the manual. Chances are they have and don't get it.
Provide an answer or move on to the next question.
Let's work to help developers, not make them feel stupid.