ClanKiller.com
http://forums.clankiller.com/

plasmasky
http://forums.clankiller.com/viewtopic.php?f=24&t=2103
Page 1 of 1

Author:  Satis [ Fri Dec 08, 2006 7:58 pm ]
Post subject:  plasmasky

So, my alternate domain, http://plasmasky.com , now has actual content. I decided to make an aggregation engine for local concert venues. Basically, there're probably 10-20 clubs and such that will hold concerts...but there's no single place to show what's all going on. And it's irritating to check 10-20 websites, and even more irritating to miss a good show because you didn't hear about it. So, I'm using PHP and curl to pull the concert listing pages from websites and parse out the tour dates. I've got the base class built and am now parsing 2 venues. It's just a matter of coding up the rest.

Anyway, I won't be sharing the complete code for the class, but I may share the curl code. The parsing portions are what takes the most time and is the most easily broken, since I expect the page layout to remain relatively static. I consider the parsing part to be proprietary.

So, for anyone in the Dallas area, there's FINALLY going to be a page to go to see all concerts in the DFW area. And for anyone else, I may post some basic CURL stuff to show you how it's done.

Author:  Pig [ Sat Dec 09, 2006 1:52 am ]
Post subject: 

You shouldn't need cURL unless you have to pass headers, post data, cookies, etc. Do these sites really require all that? Most sites you can just use file_get_contents() to scrape. I agree about the parsing of the page. A good regex can take a while to build. If the content is programmatically created, then you are usually safe. If it is posted by hand, good luck. :/

Author:  Satis [ Sat Dec 09, 2006 9:58 am ]
Post subject: 

The first page I parsed (Gypsy Tea Room) could've been grabbed with a regular file_get_contents, but the second one (American Airlines Center) required me to pass a post var to pull the data I want... in the interests of keeping things simple and modular, I'm just doing it all with Curl.

As for the parsing...thankfully the 2 I've done aren't hand-coded, but the Gypsy Tea Room one was a bitch. The developer has apparently never heard of CSS...there're just a crapload of antiquated html formatting commands scattered throughout. It made for some serious pain. Thankfully it does follow a pattern. The AA Center one was really clean...good, clean, css driven code. After the first one, it was almost a treat parsing that one.

What boggles my mind is why someone hasn't already done this. And why these damn venues don't have RSS feeds.

Author:  Pig [ Sat Dec 09, 2006 1:03 pm ]
Post subject: 

Yeah, if you need to post then cURL is best. Yeah, it's amazing how many people don't use RSS. It's the future, yo. It would probably take half an hour to set up if they already have it stored programmatically.

Page 1 of 1 All times are UTC - 6 hours
Powered by phpBB © 2000, 2002, 2005, 2007 phpBB Group
http://www.phpbb.com/