It is currently Thu Mar 28, 2024 5:11 pm



Reply to topic  [ 4 posts ] 
plasmasky 
Author Message
Felix Rex
User avatar

Joined: Fri Mar 28, 2003 6:01 pm
Posts: 16646
Location: On a slope
Reply with quote
Post plasmasky
So, my alternate domain, http://plasmasky.com , now has actual content. I decided to make an aggregation engine for local concert venues. Basically, there're probably 10-20 clubs and such that will hold concerts...but there's no single place to show what's all going on. And it's irritating to check 10-20 websites, and even more irritating to miss a good show because you didn't hear about it. So, I'm using PHP and curl to pull the concert listing pages from websites and parse out the tour dates. I've got the base class built and am now parsing 2 venues. It's just a matter of coding up the rest.

Anyway, I won't be sharing the complete code for the class, but I may share the curl code. The parsing portions are what takes the most time and is the most easily broken, since I expect the page layout to remain relatively static. I consider the parsing part to be proprietary.

So, for anyone in the Dallas area, there's FINALLY going to be a page to go to see all concerts in the DFW area. And for anyone else, I may post some basic CURL stuff to show you how it's done.

_________________
They who can give up essential liberty to obtain a little temporary safety, deserve neither liberty nor safety.


Fri Dec 08, 2006 7:58 pm
Profile WWW
Duke
User avatar

Joined: Mon Mar 31, 2003 8:59 am
Posts: 1358
Location: right behind you
Reply with quote
Post 
You shouldn't need cURL unless you have to pass headers, post data, cookies, etc. Do these sites really require all that? Most sites you can just use file_get_contents() to scrape. I agree about the parsing of the page. A good regex can take a while to build. If the content is programmatically created, then you are usually safe. If it is posted by hand, good luck. :/


Sat Dec 09, 2006 1:52 am
Profile YIM WWW
Felix Rex
User avatar

Joined: Fri Mar 28, 2003 6:01 pm
Posts: 16646
Location: On a slope
Reply with quote
Post 
The first page I parsed (Gypsy Tea Room) could've been grabbed with a regular file_get_contents, but the second one (American Airlines Center) required me to pass a post var to pull the data I want... in the interests of keeping things simple and modular, I'm just doing it all with Curl.

As for the parsing...thankfully the 2 I've done aren't hand-coded, but the Gypsy Tea Room one was a bitch. The developer has apparently never heard of CSS...there're just a crapload of antiquated html formatting commands scattered throughout. It made for some serious pain. Thankfully it does follow a pattern. The AA Center one was really clean...good, clean, css driven code. After the first one, it was almost a treat parsing that one.

What boggles my mind is why someone hasn't already done this. And why these damn venues don't have RSS feeds.

_________________
They who can give up essential liberty to obtain a little temporary safety, deserve neither liberty nor safety.


Sat Dec 09, 2006 9:58 am
Profile WWW
Duke
User avatar

Joined: Mon Mar 31, 2003 8:59 am
Posts: 1358
Location: right behind you
Reply with quote
Post 
Yeah, if you need to post then cURL is best. Yeah, it's amazing how many people don't use RSS. It's the future, yo. It would probably take half an hour to set up if they already have it stored programmatically.


Sat Dec 09, 2006 1:03 pm
Profile YIM WWW
Display posts from previous:  Sort by  
Reply to topic   [ 4 posts ] 

Who is online

Users browsing this forum: No registered users and 5 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Jump to:  
cron
Powered by phpBB © 2000, 2002, 2005, 2007 phpBB Group.
Designed by STSoftware.