Parallel processing is all around nowadays. Because of the increase of the number of cpu cores and the lower hardware cost which allows cheaper cluster-systems, parallel processing seems to be the next big thing.

Java 8 cares for this fact with the new stream API and the simplification of creating parallel processing on collections and arrays. Let’s have a look how this works.

Let’s say myList is a List of Integers, containing 500.000 Integer values. The way to sum-up this integer values in the pre-java 8 era was done using a for each loop.

`for (int i :myList)`

result+=i;

since java 8 we can do the same thing using streams

`myList.stream().sum();`

Parallelize this processing is very easy, we just use the keyword parallelStream() instead of stream or parallel() if we still have a stream.

So

myList.parallelStream().sum()

will do the trick. So it’s easy to spread a calculation to threads and to the cpu cores available. But as we know the overhead of multithreading and parallel processing is expensive. The question is when to use parallel-streams and when a serial stream would be better related to performance.

First of all let’s have a look what happens behind the scene. The parallel stream uses the Fork/Join Framework for processing. This means that the stream-source is getting forked (splitted) and hands over to the fork/join-pool workers for execution.

But here we find the first point to think about, not all stream-sources are splittable as good as others. Think about a ArrayList which internal data representation based upon an array. Splitting such a source is pretty easy, because is possible to calculate the index of the middle element and split-up the array.

If we have a LinkedList it’s more complicated to split the data elements. The implementation has to go through all elements from the first entry to find the element where the split can be done. So LinkedLists for example performs badly for parallel streams.

This is the first fact on parallel stream performance we can retain.

**S – source collection must be efficiently splittable**

Splitting a collection, managing the fork and join tasks, object creation and garbage collecting is an algorithmic overhead as well. This is only worthwhile when the work that has to be done on the cpu cores is non-trivial and/or the collection is large enough. And of course we have a lot of cpu cores.

One bad example would be the calculation of the max value of 5 integer values.

`IntStream.rangeClosed(1, 5).reduce( Math::max).getAsInt();`

The overhead of prepare and process the data for fork/join is so huge that a serial stream is much faster here. The Math.max function is not very CPU cost intensive here and we have less data elements.

But it’s getting more and more worthwhile when the function which is executed per element is more complex – to be exact “is more cpu intensive”. Calculating the sinus of every element instead of the max value for example.

When programming a chess-game the evaluation of every chess-move is also such an example. Many evaluations can be done in parallel. And we have a large number of possible next moves.

This is perfect for parallel processing.

And this is the second fact on parallel stream performance we can retain:

**N*Q – factor “number of elements * cost per element” should be large**

But this also means in reverse that the collection can be smaller when the operation per element is costlier.

Or when then operation per element is not so cpu intensive than we need a very large collection with many elements so that the usage of parallel streams pays off.

This directly depends on the third fact we can retain

**C – number of CPU Cores – more is better > 1 a must have**

On a single core machine parallel streams always perform worse than serial streams in cause of the management overhead. The same as with companies with many project leads and one man doing the work.

More is better – unfortunately in real world this is not correct for all cases, for example when the collection is too small and the CPU cores starts up – maybe from energy safe mode – just to find out that there is nothing to do.

And to determine wether to use a parallel-stream or not there are requirements to the per-element function as well. This is not so much a matter of performance but If parallel streams work as expected.

The function have to be …

- …independent which means that computation for each element must not rely on or impact that of any other element.
- …non-interference which means that the function won’t modify the underlying datasource when processing.
- …stateless.

Here we have an example of a stateful lambda function used in a parallel streams. This example is taken from the java JDK API and shows a simplified distinct() implementation.

`Set seen = Collections.synchronizedSet(new HashSet());`

stream.parallel().map(e -> { if (seen.add(e)) return 0; else return e; })...

So that’s lead us the to the fourth fact we can retain:

**F – the per-element function has to be independent**

to sum this up

Are there some other circumstances when we should not parallelize our streams? Yes, there are.

Always think about what your per element function is doing and if this suits in the world of parallel processing. When your function is calling some synchronized functionality, then you probably would have no benefit from parallelize your stream due that your parallel streams will often wait on this synchronization barrier.

The same problem occurs when you’re calling blocking i/o operations.

And for that matter using a I/O Based source as stream is also something known to not perform well because data is read sequential, so such a source is hardly splittable.

Thanks a ton for this lesson

Very clear and informative post on a complicated subject that has been misunderstood by so many developers

Is there a reason why you decided to include the spell checker markers in the output shown here?

Parallel Stream processing ??? run aCollection.stream().parallel.forEach(….). Yes processing will be asynchronous. Now count amount of threads. Yes, if you will put tasks where fork-join, required,you’ll be supported by more threads. But not in this case..

Pingback: Common parallel stream pitfalls – Khiem Blog

Hi, I’m a beginner in Java. I am a bit confused about the speed of parallel() method of stream in Java8 compared to sequential() method.

Here’s the situation, I found an example from the book “Java 8 Lambdas.

Functional Programming For The Masses” book to find an example.

The exact location is Chapter 6, Section 4

This example talks about parallelizing the simulation of a dice roll event using Monte Carlo simulation.

If you roll the dice twice fairly and then add up the points on the winning side, you will get a number from 2 to 12. The sum of the points is at least 2 because the smallest number of points on the six sides of the dice is 1 and we rolled the dice twice; the maximum sum of points is not more than 12 because the side of the dice with the most points is only 6 points. We want to find the probability that the number of dots will fall at each value between 2 and 12.

One way to solve this problem is to find all combinations of dice rolls, e.g., getting 2 points by rolling 1 point the first time and 1 point the second time. There are a total of 36 possible combinations, so the probability of getting a 2 is 1/36.

Another solution is to simulate a dice roll using random numbers from 1 to 6, and then divide the total number of rolls by the number of times each point is obtained. This is a simple Monte Carlo simulation. The more dice rolls you simulate, the more accurate the result will be, so we want to increase the number of simulations as much as possible.

Here is the pseudo-code from the book:

public Map parallelDiceRolls() {

double fraction = 1.0 / N;

return IntStream.range(0, N)

.parallel()

.mapToObj(twoDiceThrows())

.collect(groupingBy(side -> side,

summingDouble(n -> fraction)));

}

The book says that the parallel() method in this example can bring a speedup, faster than the sequential() method. But in practice, I found that sequential() is faster regardless of whether the data volume is 100,000 or 1 million. And I’ve tried it on different machines with more than 2 CPUs, as well as the latest M1 Macbook pro

In the example here, the conditions match what you said in your blog about

– Large data volume

– source collection must be efficiently splittable

– number of CPU Cores – more is better > 1 a must have

– factor “number of elements * cost per element” should be large

– the per-element function has to be independent

Why is this so, if you know why please let me know, thanks a lot