[PHP] Faster array lookup than using in_array()

If you use arrays in PHP, one of the most common tasks you’ll find yourself doing is determining if Item A is in Array X. The function you would probably use in this case is PHP’s in_array.

bool in_array ( mixed $needle , array $haystack [, bool $strict = FALSE ] )

This function works great and I recommend sticking to it when it makes sense. However, when you’re dealing with a very large haystack and need to run in_array() on thousands of values, you’ll discover that in_array isn’t particularly fast when cumulated over thousands of calls. Having recently run into this situation, I set up a little experiment to try two different approaches to in_array().


The haystack in my experiment was an array containing 60,000 strings that were 50 characters in length as values.

$arr = array("String1","String2","String3", etc...)

The needle was a string of 50 characters.

Method A – Using in_array()

if (in_array($needle, $haystack))
{
	echo("Method A : needle " . $needle . " found in haystack<BR>");
}

Method B – Using isset()
Basically, I reformatted the haystack so that the values of my original array became keys instead and the new value for each key was set to 1.

foreach(array_values($haystack) as $v)
	$new_haystack[$v] = 1;

So my haystack became :

$arr["String1"] = 1;
$arr["String2"] = 1;
$arr["String3"] = 1;
etc.

Then, all you need to do is look up the key:

if (isset($haystack[$needle]))
{
	echo("Method B : needle " . $needle . " found in haystack<BR>");
}

Method C – Using array_intersect()
When all you really need to know is if needle is in haystack, using array_intersect() can also work.

if (count(array_intersect(array($needle), $haystack))>0)
{
	echo("Method C : needle " . $needle . " found in haystack<BR>");
}

With these different methods in place, I executed them against the same $haystack and $needle and the results were clear :

Method A : 0.003180980682373 seconds
Method B : 0.0000109672546 seconds
Method C : 0.045687913894653 seconds

Method B wins! Keep in mind that this only really becomes interesting with very large data sets. For those of you wondering how long it took to re-arrange the haystack for Method B to use, the answer is 0.025528907775879 seconds.

In this experiment, determining if 100,000 strings are or are not in the data set went from 318.098 seconds with in_array() to 1.1222 seconds using isset(). That’s pretty decent.

20 thoughts on “[PHP] Faster array lookup than using in_array()”

  1. This is lightning fast, love it!

    But I think you mixed up your variable names in Method B, here’s what I did:
    $arrA = array(); /* fill $arrA like with only key, no value */

    $arrB = array();
    foreach ( $arrA as $key ) $arrB[$key] = 1;

    foreach( $arrA as $key ) {
    if ( !isset($arrB[$key]) ) { echo “$key not in $arrB!”; }
    }

  2. How you count time results

    i make similar test – and in_array – really more faster than others!!!!
    <?
    $a = [];
    $mc_default = [];
    $mc_my = [];

    $d = 'nonexisted key';

    $testcnt = 20;

    function myin_array($val, array $arr)
    {
    $newarr = array_flip($arr);
    return isset($newarr[$val]);
    }

    public function randomString($len = 10)
    {
    $key_chars = 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789';
    $num_chars = strlen($key_chars);
    $key = '';
    for ($i = 0; $i < $len; $i++) {
    $key .= substr($key_chars, rand(1, $num_chars) – 1, 1);
    }

    return $key;
    }

    for ($i = 0; $i < 100000; $i++) {
    $len = 50;
    $a[] = \insolita\things\helpers\Helper::randomString($len,true);
    }

    for ($i = 0; $i < $testcnt; $i++) {
    $start = microtime(true);
    $c = myin_array($d, $a);
    $end = microtime(true);
    $mc_my[] = ($end – $start);
    }

    for ($i = 0; $i < $testcnt; $i++) {
    $start = microtime();
    $c = in_array($d, $a);
    $end = microtime();
    $mc_default[] = ($end – $start);
    }

    $def_avg = array_sum($mc_default) / count($mc_default);
    $my_avg = array_sum($mc_my) / count($mc_my);

    echo 'Defaults (in_array)=’ . implode(‘, ‘, $mc_default) . ‘ Avg: ‘ . $def_avg;
    echo ‘My func=’ . implode(‘, ‘, $mc_my) . ‘ Avg: ‘ . $my_avg;

    ?>

    And my results look such as
    Defaults (in_array)=0.003767, 0.00376, 0.003816, 0.003758, 0.0038010000000001, 0.003768, 0.004038, 0.003969, 0.0038659999999999, 0.003829, 0.00381, 0.003772, 0.00376, 0.003746, 0.003783, 0.00381, 0.003758, 0.0037470000000001, 0.003772, 0.004051
    Avg: 0.00381905

    My func=0.02696418762207, 0.023251056671143, 0.024040937423706, 0.023049116134644, 0.023766040802002, 0.023875951766968, 0.025014162063599, 0.02332615852356, 0.023674011230469, 0.023051977157593, 0.023787021636963, 0.023398876190186, 0.023919105529785, 0.023699045181274, 0.023690938949585, 0.023056983947754, 0.023715019226074, 0.024850130081177, 0.023258924484253, 0.023529052734375
    Avg: 0.023845934867859

    php -v
    PHP 5.5.9-1ubuntu4.4 (cli) (built: Sep 4 2014 06:57:30)
    Linux Mint 32-bit 4Gb RAM
    php as mod_apache without any optimizers and cachers

    1. Hi Donna,

      The times you recorded for the default PHP “in_array” function seem on par with what I had found. However, your custom method is very slow which leads me to believe that perhaps array_flip is doing something funky behind-the-scenes. In my test, I remember I explicitly set each key to 1 – no array_flip.

      I’ll try this again as soon as I get some free time.

      By the way, I’m pretty sure I was using PHP 5.3.10 during my experiment.

  3. Thanks, nice.

    Small correction though, “Basically, I reformatted the haystack so that the keys became values and I set the value to 1.”
    Actually you’re turning values to keys!

  4. Awesomepants. Your tip just made my script run about 50x faster. (I was regularly having to compare a list of 300,000 email addresses against a list of up to 300,000 email addresses that had already been contacted…)

  5. I regularly have to determine whether up to 200.000 “needles” are present in a “haystack” of 2 million strings. This tip (combined with the one from Sven) made my script gain 50% (2 times faster). Thanks!

  6. I’m having to work with 90k+ rows of data, and in_array was so slow the script wouldn’t execute. I used Method B and it saved the day. Thanks!

  7. Well, if you include rearranging time in method B. The total time becomes more than method A. Although in last time you have not considered it considering 100,000 records.

Have a question? Like what you read? Leave a Reply!