I’ve been playing around with the Packagist API recently. Part of what I needed to was to grab some basic metadata about each package that is available through Packagist (https://packagist.org/apidoc#get-package-data) - from every single package that exists.
That’s a lot of packages - 157,911 at last count.
My go-to library for making concurrent HTTP requests is Guzzle. So naturally, my first thought was to use Guzzle’s Pool handling for handling multiple requests.
Making use of a pool
Normally, I make use of GuzzleHttp\Pool
in order to make requests concurrently. More specifically, I make use of GuzzleHttp\Pool::batch()
.
The code below is a simplification of the code I normally would use:
<?php
$client = new GuzzleHttp\Client(['base_uri' => 'https://packagist.org']);
$packageNames = json_decode(
$client->get('packages/list.json')
->getBody()
->getContents()
)->packageNames;
$requests = function() use ($packageNames, &$requestIndexToPackageVendorPairMap) {
foreach ($packageNames as $packageVendorPair) {
$requestIndexToPackageVendorPairMap[] = $packageVendorPair;
yield new GuzzleHttp\Psr7\Request('GET', "https://packagist.org/p/{$packageVendorPair}.json");
}
};
GuzzleHttp\Pool::batch(
$this->client,
$requests(),
[
'concurrency' => 50,
'fulfilled' => function(Psr\Http\Message\ResponseInterface $response, $index) use ($callback, &$requestIndexToPackageVendorPairMap) {
// Do something with the response.
},
'rejected' => function($reason, $index) use ($callback, &$requestIndexToPackageVendorPairMap) {
// Handle the failed request.
}
]
);
Where did all my memory go?
I’ve never run into issues using GuzzleHttp\Pool
class in this way previously. However, I’ve also never had a need to
use so many requests in a pool before.
Something that was pointed out to me in a GitHub issue that I opened when
I (mistakenly) thought this was a memory leak bug within Guzzle itself, is that GuzzleHttp\Pool::batch()
actually collects
all the responses that it makes requests for, and returns them in an array.
This means that for every request made in the pool, the corresponding response is kept around in memory. If you have a large request pool, it means that this becomes problematic.
Stop collecting the responses!
Thankfully, the solution to this is a really simple one - access the Pool’s promise directly, and wait for it to complete. An example of this is shown below:
<?php
(new GuzzleHttp\Pool(
$this->client,
$requests(),
[
'concurrency' => 50,
'fulfilled' => function(Psr\Http\Message\ResponseInterface $response, $index) use ($callback, &$requestIndexToPackageVendorPairMap) {
// Do something with the response.
},
'rejected' => function($reason, $index) use ($callback, &$requestIndexToPackageVendorPairMap) {
// Handle the failed request.
}
]
))->promise()->wait();
With this small change (which, admittedly, is exactly how the Quick Start documentation explains), you no longer have runaway memory issues.
I’ll take this as a lesson to read and understand the code I’m using a bit better!