R
R: Iterators
These are handy if you want to access a massive list/array/container without loading it all into memory first. They are and ugly bolt-on library though, and cannot be used from within a for loop! So they are sort of limited in utility.
R: Apply
I regularly confuse these. The apply functions repeatedly apply a function to a lot of individual elements. Many parallel routines are parallel versions of these higher-level functions.
- lapply: apply a function to each element of a list/vector.
- sapply: simplify the lapply return list to a vector or array if possible.
- apply: apply a function to rows, columns, or elements of an array.
- tapply: apply a function to subsets of a list/vector.
- mapply: apply a function to the ”transpose” of a list. Pass two lists of length three; apply function to first items of lists, then second, then third.
Examples below:
Performance tip: if you know the size and type that sapply will return, create such a vector/matrix and use vapply, passing it the example object as the third parameter (everything else stays the same). This can be substantially faster, and more memory-efficient for large outputs.
R: Plotting
- par(new = TRUE): This will keep R from overwriting your previous plot, which is the default behaviour.
- par(family = ’HersheySans’): change the font family to a vector-drawn font. Always use vector-drawn fonts for publication-quality plots.
- par(font = 2): change the font format. 1 - default, 2 - bold, 3 - italic, 4 - bold italic.
- par(ann = FALSE): do not annotate the plot. In this case you must label your axes after the plotting function is called.
- par(mfrow = c(2,2)): make a 2 x 2 plots. Allows you to put several plots together.
To save a plot after it has been made, use one of the dev.copy(...)
, dev.copy2pdf(...)
, etc. commands.
R: Decision Tree Example
Parallel Programming in Python
From what I saw today, it looks like the real package to learn is multiprocessing
and numexpr` for performance increases. We also talked about parallelizing across multiple ipython sessions, but I think it might be worth going over the R parallel implementations, as the python ones are pretty complicated.
Python: Garbage Collection
Python, like R, relies on a ”garbage collector” to clean up un-needed variables and limit memory usage. ”Every so often” a garbage collection task runs and deletes variables that won’t be used anymore. You can force the garbage collector to run at any time by running the command:
Python: Out of Code Computation
Out of core or ”external memory” computation leaves the data on disk, bringing into memory only what is needed, or what fits, at any given time. For some computations, this works out well (but note: disk access is always much slower than memory access). The np.memmap
class does this – other techniques: pytables, hdf5, numpy.
Python: Just in Time Compilation
Just in time compilation for python! can really speed up computations at the small cost of bizarre syntax:
Python: Forking Processes
The below (fork.py) uses a fork to run a separate piece of code () in a child process (child.py):
child.py
fork.py
Use os.waitpid
(child pid) if you need to wait for the child process to finish. Otherwise the parent will exit and the child will live on. fork()
is a Unix command. It doesn’t work on Windows, except under Cygwin.
This must be used very carefully. ALL the data is copied to the child process, including file handles, open sockets, database connections, etc. Be sure to exit using os._exit(0)
rather than os.exit(0)
, or else the child process will try to clean up resources that the parent process is still using. Because of this, fork()
can lead to code that is difficult to maintain long-term.
R: Threads (Multi?)
- process: resources needed to execute a program.
- thread: path of execution within a process. faster to create and destroy.
Threads within the same process share the same address space. This means they can share the same memory and can easily communicate with each other. However, the python Interpreter uses the Global Interpreter Lock (GIL). The GIL prevents race conditions by preventing threads from the same Python program from running simultaneously. As such, only one core is used at any given time. This prevents race conditions but also prevents proper multi-threading.
R: Multiprocessing
Unlike fork, multiprocessing works on Windows (better portability). Slightly longer start-up time than threads. Multiprocessing spawns separate processes, like fork, and as such they each have their own memory. Multiprocessing requires pickleability for its processes. Passing non-pickleable objects, such as sockets, to spawned processes is not possible.
Multiprocessing pools:
Shared memory:
Race condition above! We can prevent this by using Lock
(we had multiple processes talking to the same piece of memory in the above condition, so now we will explicitly prevent that using a lock).
Multiprocessing also allows you to share a block of memory through the Array ctypes wrapper. This allows us to work with 1D arrays, not just single floats.