I recently saw a program that gathered the 10 most common words on a webpage and displayed them in a window along with their word count. I decided to build my own using Python and some code I had written before to scrape data from webpages.
# Gives a list of the most common words # Hunter Thornsberry - email@example.com from BeautifulSoup import BeautifulSoup import urllib2 import random import time #limit on the number of top words we want to know the count of limit = 10 #random integer to select user agent randomint = random.randint(0,7) #random interger to select sleep time randomtime = random.randint(1, 30) #urls to be scraped urls = ["http://raw.adventuresintechland.com/freedom.html"] #user agents user_agents = [ 'Mozilla/5.0 (Windows; U; Windows NT 5.1; it; rv:18.104.22.168) Gecko/20071127 Firefox/22.214.171.124', 'Opera/9.25 (Windows NT 5.1; U; en)', 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)', 'Mozilla/5.0 (compatible; Konqueror/3.5; Linux) KHTML/3.5.5 (like Gecko) (Kubuntu)', 'Mozilla/5.0 (Windows NT 5.1) AppleWebKit/535.19 (KHTML, like Gecko) Chrome/18.0.1025.142 Safari/535.19', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.7; rv:11.0) Gecko/20100101 Firefox/11.0', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:8.0.1) Gecko/20100101 Firefox/8.0.1', 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.19 (KHTML, like Gecko) Chrome/18.0.1025.151 Safari/535.19' ] words =  index = 0 while len(urls) > index: opener = urllib2.build_opener() opener.addheaders = [('User-agent', user_agents[randomint])] response = opener.open(urls[index]) the_page = response.read() soup = BeautifulSoup(the_page) #Search criteria (is an html tag). Example <p>, <body>, <h1>, etc. text = soup.findAll("body") #Runs until it has an index out of range error and breaks, this will return every response while True: try: i = 0 while True: #print text[i].text words.append(text[i].text) i = i + 1 except IndexError: print "--End--" break index = index + 1 words = words.split(" ") words = [element.lower() for element in words] sort =  for word in set(words): sort.append(str(words.count(word)) + " " + word) x = 0 for item in sorted(sort, reverse=True): print item if x == limit: break x = x + 1
This code basically comes in two parts, the first part gets the data from the webpage. I've got a whole blog post dedicated just to that.
This is the second part of the code:
words = words.split(" ") words = [element.lower() for element in words] sort =  for word in set(words): sort.append(str(words.count(word)) + " " + word) x = 0 for item in sorted(sort, reverse=True): print item if x == limit: break x = x + 1
Here I am using .split(" ") to find all of the words. Then I am making every word lower case (as to get a true count of the words, since technically "The" and "the" are two different words). Next the first for loop uses set(words) to get the unique words and appends a string representation of the number of times that word appears in the words list and the word itself.
The second for loop sorts the list and prints the results. Notice sorted() is not a defined function, it is actually built into Python, and we are also passing "reverse=True" so the word with the highest count returns first.
--End-- 9 programmers 9 other 9 one 9 new 9 few 9 code 8 when 8 says 8 print 8 first 8 didn't
Subscribe to Adventures In Techland
Get the latest posts delivered right to your inbox