I recently wrote a post about using BeautifulSoup and urllib2 to scrape html off webpages and parse it into useful text. The only issue was it was easy to get banned with.
This modification to the code does not make you ban proof, and the same warning applies.
from bs4 import BeautifulSoup import urllib2 import random import time #random integer to select user agent randomint = random.randint(0,7) #random interger to select sleep time randomtime = random.randint(1, 30) #urls to be scraped urls = ["http://www.hunterthornsberry.com", "http://huntert.me/"] #user agents user_agents = [ 'Mozilla/5.0 (Windows; U; Windows NT 5.1; it; rv:188.8.131.52) Gecko/20071127 Firefox/184.108.40.206', 'Opera/9.25 (Windows NT 5.1; U; en)', 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)', 'Mozilla/5.0 (compatible; Konqueror/3.5; Linux) KHTML/3.5.5 (like Gecko) (Kubuntu)', 'Mozilla/5.0 (Windows NT 5.1) AppleWebKit/535.19 (KHTML, like Gecko) Chrome/18.0.1025.142 Safari/535.19', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.7; rv:11.0) Gecko/20100101 Firefox/11.0', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:8.0.1) Gecko/20100101 Firefox/8.0.1', 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.19 (KHTML, like Gecko) Chrome/18.0.1025.151 Safari/535.19' ] index = 0 while len(urls) > index: opener = urllib2.build_opener() opener.addheaders = [('User-agent', user_agents[randomint])] response = opener.open(urls[index]) the_page = response.read() soup = BeautifulSoup(the_page) #Search criteria (is an html tag). Example <p>, <body>, <h1>, etc. text = soup.findAll("body") #Runs until it has an index out of range error and breaks, this will return every response while True: try: i = 0 while True: print text[i].text i = i + 1 except IndexError: print "--End--" break index = index + 1 time.sleep(randomtime)
What I've done here is taken a list of common user-agents and randomly selected one to be passed with our HTTP request, this makes our request look as if they are coming from different browsers. On top of that I've added a random wait period (1-30 seconds) after each request.
Subscribe to Adventures In Techland
Get the latest posts delivered right to your inbox