Variable URLs in web scraping

replied 11 years ago

Can you give a few more examples. Would be helpful to understand the url pattern if you show a page from the actual site you are trying to scrape.

Last updated 3 years ago.

0

chrisbinge

replied 11 years ago

Sure, so here is: http://www.nfl.com/player/chrisjohnson/262/profile and the variable would be the 262, and another url: http://www.nfl.com/player/russellwilson/2532975/profile. so the variable is always between player name and profile. Ideally, I'd like the url to be a base url of http://www.nfl.com/player with a request for { $playername }/\S*/profile. I just found something in guzzle docs, just don't know how to syntax it: http://guzzle.readthedocs.org/en/latest/http-client/uri-templates.html

Last updated 3 years ago.

0

mcraz

replied 11 years ago

Owk.

Consider the url : http://www.nfl.com/player/chrisjohnson/262/profile

When the server receives this url it looks in the database for this player by id not name. So, chrisjohnson is not the player name for NFL site. Its 262.

Try this :

http://www.nfl.com/player/idontcare/262/profile and you will still see Chrish Johnson's profile.

So, to scrape you will have to use player ids.

And the docs url you mentioned is only about helper to build query by passing variable parameters. Like for to get the url you would do :

$player = new Guzzle\Http\Client('http://www.nfl.com/player/xyz/{id}/', array(
	'id' => $player_id,
));

$profile = $player->get('profile');

Player name in the url is just for seo & user help.

Last updated 3 years ago.

0

chrisbinge

replied 11 years ago

Awesome, that makes a lot more sense now. That does lead me to one more question, how should I start about automating the collection of player ID's that I will now need? Thanks for your time on this, by the way things are making more sense.

Last updated 3 years ago.

0

mcraz

replied 11 years ago

There are many ways you can go on this. I will list two of them.

You know player name (NOT username, name)

You can use NFL's search functionality :

/search?category=name&filter={ player_name }&playerType=current example

$player = new Guzzle\Http\Client('http://www.nfl.com/players/');

$results = $player->get(['/search{?data*}', [
    'data' => [
	    'category'   => 'name',
	    'playerType' => 'current',
	    'filter'     => $player_name,
    ]
]]);

// $player_id = select the player url and parse for id

$profile = $player->get(['/idontcare/{id}/profile',
    'id'    =>  $player_id,
]);

Get id of every players in a team in one go

Scrape the player roaster page : ex : /search?category=team&filter=3430&playerType=current

Last updated 3 years ago.

0

chrisbinge

replied 11 years ago

Hmmm, tried to use the example above, but it sent me to the results page with "Displaying 1 - 25 of 275089", So I tried working on a different query that would better suit my needs (filtering by position), but got the same issue. Here is what I had:

$position = 'quarterback';
	$player = new Guzzle\Http\Client('http://www.nfl.com/players/');

	$results = $player->get(['/search{?data*}', [
	    'category'   => 'position',
	    'playerType' => 'current',
	    'conference' => 'ALL',
	    'd-447263-p' => '1',
	    'filter'     => $position,
	    'conferenceAbbr' => 'null'
	]]);

	return $results->send();

Any way I could barrow you on IRC for a minute or so? (IRC name is same as here)

EDIT: Still looking into this, and through docs and articles it seems like getQuery() method is the way to do this, but when I put:

$player->getQuery()
    ->set('category', 'position')
    ->set('playerType', 'current')
	->set('conference', 'ALL')
	->set('d-447263-p', 1)
	->set('filter', $position)
	->set('conferenceAbbr', 'null');

I get "Call to undefined method Guzzle\Http\Client::getQuery()"

Last updated 3 years ago.

0

chrisbinge

replied 11 years ago

Okay, took a while, but I've got the syntax for the query in Guzzle right now:

$position = 'quarterback';
	$client = new Guzzle\Http\Client('http://www.nfl.com/players/');    
	$request = $client->get();

	$q = $request->getQuery();

	$q->set('category', 'position');
	$q->set('playerType', 'current');
	$q->set('conference', 'ALL');
	$q->set('d-447263-p', 1);
	$q->set('filter', $position);
	$q->set('conferenceAbbr', 'null');

return $request->send();

Now, I just need to figure out how to loop it for the 'd-447263-p' => 1, to include however many pages there may be. Maybe a loop?

Last updated 3 years ago.

0

mcraz

replied 11 years ago

O ! just saw I made a mistake.

Because we are referring {?data*}, we have to cap the query parameters in a array with data key.

####I have previous rectified the answer.

Gr8 that figured out the solution. Wold like to add a bit:

Goutte provides a method which can select a link with the text. So, just look for the text Next in search section footer and recursively run the scraper until the next word goes away !

Last updated 3 years ago.

0

You know player name (NOT username, name)

Get id of every players in a team in one go

Moderators