Can you give a few more examples. Would be helpful to understand the url pattern if you show a page from the actual site you are trying to scrape.
Sure, so here is: http://www.nfl.com/player/chrisjohnson/262/profile and the variable would be the 262, and another url: http://www.nfl.com/player/russellwilson/2532975/profile. so the variable is always between player name and profile. Ideally, I'd like the url to be a base url of http://www.nfl.com/player with a request for { $playername }/\S*/profile. I just found something in guzzle docs, just don't know how to syntax it: http://guzzle.readthedocs.org/en/latest/http-client/uri-templates.html
Owk.
Consider the url : http://www.nfl.com/player/chrisjohnson/262/profile
When the server receives this url it looks in the database for this player by id not name. So, chrisjohnson
is not the player name for NFL site. Its 262
.
Try this :
http://www.nfl.com/player/idontcare/262/profile
and you will still see Chrish Johnson's profile.
So, to scrape you will have to use player ids.
And the docs url you mentioned is only about helper to build query by passing variable parameters. Like for to get the url you would do :
$player = new Guzzle\Http\Client('http://www.nfl.com/player/xyz/{id}/', array(
'id' => $player_id,
));
$profile = $player->get('profile');
Player name in the url is just for seo & user help.
Awesome, that makes a lot more sense now. That does lead me to one more question, how should I start about automating the collection of player ID's that I will now need? Thanks for your time on this, by the way things are making more sense.
There are many ways you can go on this. I will list two of them.
You can use NFL's search functionality :
/search?category=name&filter={ player_name }&playerType=current
example
$player = new Guzzle\Http\Client('http://www.nfl.com/players/');
$results = $player->get(['/search{?data*}', [
'data' => [
'category' => 'name',
'playerType' => 'current',
'filter' => $player_name,
]
]]);
// $player_id = select the player url and parse for id
$profile = $player->get(['/idontcare/{id}/profile',
'id' => $player_id,
]);
Scrape the player roaster page : ex : /search?category=team&filter=3430&playerType=current
Hmmm, tried to use the example above, but it sent me to the results page with "Displaying 1 - 25 of 275089", So I tried working on a different query that would better suit my needs (filtering by position), but got the same issue. Here is what I had:
$position = 'quarterback';
$player = new Guzzle\Http\Client('http://www.nfl.com/players/');
$results = $player->get(['/search{?data*}', [
'category' => 'position',
'playerType' => 'current',
'conference' => 'ALL',
'd-447263-p' => '1',
'filter' => $position,
'conferenceAbbr' => 'null'
]]);
return $results->send();
Any way I could barrow you on IRC for a minute or so? (IRC name is same as here)
EDIT: Still looking into this, and through docs and articles it seems like getQuery() method is the way to do this, but when I put:
$player->getQuery()
->set('category', 'position')
->set('playerType', 'current')
->set('conference', 'ALL')
->set('d-447263-p', 1)
->set('filter', $position)
->set('conferenceAbbr', 'null');
I get "Call to undefined method Guzzle\Http\Client::getQuery()"
Okay, took a while, but I've got the syntax for the query in Guzzle right now:
$position = 'quarterback';
$client = new Guzzle\Http\Client('http://www.nfl.com/players/');
$request = $client->get();
$q = $request->getQuery();
$q->set('category', 'position');
$q->set('playerType', 'current');
$q->set('conference', 'ALL');
$q->set('d-447263-p', 1);
$q->set('filter', $position);
$q->set('conferenceAbbr', 'null');
return $request->send();
Now, I just need to figure out how to loop it for the 'd-447263-p' => 1,
to include however many pages there may be. Maybe a loop?
O ! just saw I made a mistake.
Because we are referring {?data*}
, we have to cap the query parameters in a array with data
key.
####I have previous rectified the answer.
Gr8 that figured out the solution. Wold like to add a bit:
Goutte provides a method which can select a link with the text. So, just look for the text Next
in search section footer and recursively run the scraper until the next word goes away !
Sign in to participate in this thread!
The Laravel portal for problem solving, knowledge sharing and community building.
The community