-
-
Save gjreda/f3e6875f869779ec03db to your computer and use it in GitHub Desktop.
from bs4 import BeautifulSoup | |
from urllib2 import urlopen | |
from time import sleep # be nice | |
BASE_URL = "http://www.chicagoreader.com" | |
def make_soup(url): | |
html = urlopen(url).read() | |
return BeautifulSoup(html, "lxml") | |
def get_category_links(section_url): | |
soup = make_soup(section_url) | |
boccat = soup.find("dl", "boccat") | |
category_links = [BASE_URL + dd.a["href"] for dd in boccat.findAll("dd")] | |
return category_links | |
def get_category_winner(category_url): | |
soup = make_soup(category_url) | |
category = soup.find("h1", "headline").string | |
winner = [h2.string for h2 in soup.findAll("h2", "boc1")] | |
runners_up = [h2.string for h2 in soup.findAll("h2", "boc2")] | |
return {"category": category, | |
"category_url": category_url, | |
"winner": winner, | |
"runners_up": runners_up} | |
if __name__ == '__main__': | |
food_n_drink = ("http://www.chicagoreader.com/chicago/" | |
"best-of-chicago-2011-food-drink/BestOf?oid=4106228") | |
categories = get_category_links(food_n_drink) | |
data = [] # a list to store our dictionaries | |
for category in categories: | |
winner = get_category_winner(category) | |
data.append(winner) | |
sleep(1) # be nice | |
print data |
Thanks, this helped more than anything I've read.
Line 31,12,9, incorrect
Can you fix it
Thanks very much for your instructions! Spent a little time playing around with this script today and learned a lot.
I had to sign in to thank you for your example code and the explanation to go with it. I was able to write my first python application pretty quick based on your example.
Probably the best example I've read that helped me the most.
thanks, really helpful approach.
Yeah I am getting the same issue as @tkeville ,so I tried putting in jibberish arguments for the tags and got the same result. turns out its returning an empty object with type=none ... super weird.
@aznyellojersey that is also a flaw, but I dont think thats the only flaw in this code
Any one got help for some noobs?
ah lol either the site has changed or the dl elements are gone/renamed, at least at the url i am working with....
Yes, the site has changed. the "h1", "headline" structure is now a div id="storyline" class="boc1". You need to read the BeautifulSoup docs to determine what the right search string needs to be. However, the process stays the same.
It works...and return "u" in front of the text, anyone know why and how we can remove it? Thanks for assisting.
Above script returns:
[{'category': u"Best restaurant that's been around forever and is still worth the trip\xa0", 'runners_up': [u'Frontera Grill', u'Chicago Diner ', u'Sabatino\u2019s', u'Twin Anchors'], 'winner': [u'Lula Cafe'], 'category_url': 'http://www.chicagoreader.com/chicago/BestOf?category=1979894&year=2011'}, {'category': u'Best fancy restaurant in Chicago\xa0', ...}]
@NicholasBravobi the "u" denotes that the string has been represented as unicode.
You can get rid of them by returning category.encode('utf-8'), winner.encode('utf-8) and so on as values for "category" and other keys in get_category_winner function.
This is my 1st attempt at Python.
Here's what I get when I run the code provided:
Traceback (most recent call last):
File "python-tut1.py", line 31, in
categories = get_category_links(food_n_drink)
File "python-tut1.py", line 12, in get_category_links
soup = make_soup(section_url)
File "python-tut1.py", line 9, in make_soup
return BeautifulSoup(html, "lxml")
File "/usr/local/lib/python2.7/site-packages/bs4/init.py", line 156, in init
% ",".join(features))
bs4.FeatureNotFound: Couldn't find a tree builder with the features you requested: lxml. Do you need to install a parser library?
Why we need to use "if name == 'main':" and what does this mean?
I found that if i delete this the program still works.
I am super new to this programming lark so apologies if I am completely missing something obvious I was trying to run the above just to see I could get to run and I get the following,
Traceback (most recent call last):
File "/home/tom/scraping.py", line 34, in
winner = get_category_winner(category)
File "/home/tom/scraping.py", line 19, in get_category_winner
category = soup.find("h1", "headline").string
AttributeError: 'NoneType' object has no attribute 'string'
I don't know what that is about?