Pythonism

joy in simplicity – platonic world of code and math

Comparing ParallelPython and Picloud

with 3 comments

Picloud is a paid cloud computing platform using Python. After signing up you get two login keys and have to install a Python module on the local machine that you will use to send jobs to the cloud. You enter the keys in a config file and Python connects after it sees an “import cloud” line in your code. The fees run at 5 cents for a processor hour. Not a lot.

Parallel Python is a parallel computing module that works either for clusters or just multi core machines. After “import pp” you set up an instance of a parallel server and send computations to it using the methods of the server instance.

Both are very easy to use, so I thought I’d do some time testing and compare them.


My machine here at home is a 6 core AMD Phenom with 4GB of RAM (lucky me!). The cores you get from Picloud are 1.2 GHz Xeons. and each has 340MB RAM. Each job will get one of these processors, unless you set a high CPU option in which case each will get 2.5 and 850MB RAM. For these tests I used the high cpu option.

With parallel computing a lot of the art is decomposing a problem into slices that can run on one core each in parallel, usually with some merge code that combines the results from each job into a final answer. Both these systems work by passing the cores a function and a tuple of its inputs, the people at Picloud reckon that with this functional method you still have the power to do anything that an ordinary serial machine can do, and probably faster.

I chose a very simple calculation to perform – to find the sum of the squares of the first 10,000,000 integers. The range(1,6000000) was split into 10 sub-ranges, one and a bit for each core on my machine, or one for each 2.5 cores on Picloud. There were four tests:

  • local machine using parallel python to send one job to each of my 6 cores
  • local machine without any parallelisation, just Python computing a function
  • Picloud with 6 separate parallel jobs each on high CPU, decomposition done on client
  • Picloud with 6 separate parallel jobs each on high CPU using map function
  • Picloud without parallelisation, just one high CPU instance for the function

We’ll get to the results in a sec.

The commands used were similar for both systems, and I hope the code is fairly readable.

Here is the code:

import pp
import time
import cloud

a=pp.Server()

def sq(x):
    """add all squares between x[0] and x[1]"""
    tot=0
    for z in range(x[0],x[1]+1):
        tot += z**2
    return tot

inputs = [[1,1000000],
          [1000001,2000000],
          [2000001,3000000],
          [3000001,4000000],
          [4000001,5000000],
          [5000001,6000000],
          [6000001,7000000],
          [7000001,8000000],
          [8000001,9000000],
          [9000001,10000000]]

# test 1
ti=time.time()
answer = sq((1,10000000))
print "local naive method",answer
print "that took",time.time()-ti,"seconds"

#test 2
ti=time.time()
jobs = []
for x in inputs:
    jobs.append(a.submit(sq,(x,)))
answer = 0
for z in jobs:
    answer += z()
print "locally parallelised",answer
print "that took",time.time()-ti,"seconds"

# test 3
ti=time.time()

job3 = cloud.call(sq,(1,10000000),_high_cpu=True)
answer = cloud.result(job3)
print "non-parallel on picloud",answer
print "that took",time.time()-ti,"seconds"

# test 4
ti=time.time()
jobs2 = []
for x in inputs:
    jobs2.append(cloud.call(sq,x,_high_cpu=True))
answer=0

for z in jobs2:
    answer += cloud.result(z)

print "parallel on picloud  using cloud.call()",answer
print "that took",time.time()-ti,"seconds"

# test 5
ti=time.time()
jobs2 = cloud.map(sq,inputs,_high_cpu=True)
answer = sum(cloud.result(jobs2))

print "parallel on picloud using cloud.map()",answer
print "that took",time.time()-ti,"seconds"

and here is the output.

The slowest was the local machine without parallel, so that is not so surprising. My own machine with parallel code was similar to Picloud with badly designed parallel code that didn’;t take advantage of the cloud.map() function. Picloud used like it should be with map() was the best…whahey!!

This is far from complete and I would be interested if anyone wants to duplicate or extend what I have suggested here. I might suggest that if the scale of the task was increased these rankings might change, perhaps reducing any effects of latency and other bottlenecks. Also I didn’t push the RAM envelope.

Another thing worth mentioning is that Python, being interpreted, will be a lot slower than a multiple processor system using c++. I love the language though so I will continue to use both these systems, and dream about doing something original with them. Picloud also includes a simulator so you can test and debug code on the local machine without paying until everything’s perfect.

Cheers Picloud and how about cancelling my 1 cent bill as a token of thanks for the free advert !! and special thanks to Aaron for his feedback.

Cheers also to Parallel Python… it gets the Pythonisms simplicity award.

About these ads

Written by Trip Technician

July 11, 2011 at 1:28 am

3 Responses

Subscribe to comments with RSS.

  1. regarding c++, it may be interesting to mention shedskin, a (restricted!)-python-to-c++ compiler: http://shedskin.googlecode.com. http://shed-skin.blogspot.com. shedskin by itself can often give a speedup of 50 times or more, especially for computationally intensive code. but it can also be combined with parallel python or some other parallel processing library.

    mark dufour

    July 11, 2011 at 12:06 pm

  2. Hey,

    Aaron here from the PiCloud team. I saw this nice comparison post and wanted to give you some pointers on why you are seeing faster speeds with one job than in parallel.

    With every job call/result, there is some overhead experienced. There’s network overhead with the actual function as well as some internal overhead to run your jobs.

    What you are seeing is a case where the overhead exceeds the gains from parallelization, so we’ll need to do longer tests and minimize overhead.

    First off, there is additional overhead incurred on the first call to PiCloud, namely resolving the PiCloud server, opening a connection to it, and doing module dependencies analysis. For a more precise benchmark, do an untimed and quick cloud.call before testing PiCloud, so that all subsequent tested functions are starting “warm”.

    Secondly, you’ll want to minimize your chatting with PiCloud by using cloud.map in lieu of multiple calls and use batch results. Here’s the code:

    ti=time.time()
    jobs2 = cloud.map(sq,*zip(*inputs),_high_cpu=True) #replace call for loop
    answer = sum(cloud.result(jobs2)) #replace result for loop
    print “sum of squares from 1 to 6000000 parallel on picloud is”,answer

    (See our docs for more examples of batch queries: http://docs.picloud.com/basic_examples.html#correct

    Finally, consider running longer jobs to minimize the effects of overhead (this applies to parallel python as well).

    Oh and shoot me an email (aaron at picloud dot com) with the email address you use if you’d like some credits. :)

    Thanks,
    Aaron

    PiCloud Inc.

    July 14, 2011 at 12:40 am

    • thanks a lot for that, I’ll implement what you have told me soon. :-)

      Trip Technician

      July 14, 2011 at 4:46 am


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 45 other followers

%d bloggers like this: