Kerrighed: use cases

Transcription

Introducing Kerrighed
Load balancing
Checkpoint/restart
Conclusion
Kerrighed: use cases
Cyril Brulebois
[email protected]
Kerrighed
http://www.kerrighed.org/
Kerlabs
http://www.kerlabs.com/
1 / 23
Load balancing
Checkpoint/restart
Conclusion
What’s Kerrighed?
Single-System Image (SSI) cluster system
Patched Linux kernel, plus userland tools
Started at INRIA in 1999, collaboration with University of
Rennes 1 and EDF
Since 2006, mainly developed by Kerlabs, an INRIA spin-off
Released under the GPL
Last releases:
Kerrighed 2.4.4, based on Linux 2.6.20 (January 29th, 2010)
Kerrighed 3.0.0, based on Linux 2.6.30 (June 14th, 2010)
2 / 23
Load balancing
Checkpoint/restart
Conclusion
What’s new? What’s planned?
v2.4.4
Memory
CPU
IPC
Network
C/R
Misc
v3.0.0
in 2011
Sharing
Injection
Process migration
Thread migration
Pool of threads migration
Configurable scheduler
SysV
POSIX
Migratable streams
Cluster IP
High performance
Single process
Applications
Open files
File substitution
IPC
Node hotplug
3 / 23
Load balancing
Checkpoint/restart
Conclusion
Managing nodes (1/2)
To manage cluster and nodes, a single command: krgadm.
# krgadm cluster status
status: up on 4 nodes
# krgadm nodes status
1:online
2:online
3:online
4:online
5:present
6:present
7:present
8:present
4 / 23
Load balancing
Checkpoint/restart
Conclusion
Managing nodes (2/2)
Adding 2 nodes:
# krgadm nodes add --count 2
Waiting for 2 nodes to join... done
Adding nodes [5,6]... done
Adding 2 nodes, in a different way:
# krgadm nodes add --total 8
Waiting for 2 nodes to join... done
Adding nodes [7,8]... done
5 / 23
Load balancing
Checkpoint/restart
Conclusion
Managing features
Use of capabilities to turn Kerrighed features on/off as
appropriate. Examples:
krgcapset -d +DISTANT_FORK
krgcapset --pid 6291711 -e +CAN_MIGRATE
Also possible through libkerrighed.
Most common capabilities:
DISTANT FORK: may fork remotely.
CAN MIGRATE: may be migrated while running.
CHECKPOINTABLE: may be checkpointed.
SEE LOCAL PROC STAT: only see local resources.
6 / 23
Load balancing
Checkpoint/restart
Conclusion
1. Introducing Kerrighed
What’s Kerrighed?
Managing nodes
Managing features
2. Load balancing
Use case 1: build platform
Use case 2: network computing
Use case 3: distributed rendering
Use case 4: webservers
Use case 5: parallel computing
Use case 6: LTSP
There’s more!
3. Checkpoint/restart
Use case 1: long running computations
Use case 2: playing with sockets
7 / 23
Load balancing
Checkpoint/restart
Conclusion
Setup: $C cores
Capability: DISTANT FORK
Trivial:
make -j$C
8 / 23
Load balancing
Checkpoint/restart
Conclusion
Capabilities: DISTANT FORK and/or CAN MIGRATE
Example: BOINC (Berkeley project), running ∗@home, PrimeGrid,
etc.
How: Just start BOINC! It runs as many children as there are
cores. It starts new children as they return.
9 / 23
Load balancing
Checkpoint/restart
Conclusion
Setup: $C cores
Example: Blender, rendering a 1 → $N frame range.
How: Blender is able to render frames by batch, either a single
frame at once, or a frame range.
blender -b foo.blender -F PNG -o //render_####.png -f $i
10 / 23
Load balancing
Checkpoint/restart
Conclusion
Naive approach
Trivial implementation:
Spawn $C processes.
Wait for all of them to return.
Back to spawning until the last frame is rendered.
Issue: if some frames are quicker to render than others, the global
wait will leave some cores idle.
11 / 23
Load balancing
Checkpoint/restart
Conclusion
Smarter approach
A more efficient implementation:
Spawn $C processes.
Wait for one of them to return.
Spawn a new process unless the last frame has been reached.
Back to waiting.
That ensures $C processes running all the time until the end,
almost no idling.
Many other renderers and mostly anything scriptable can be run
this way, with this single and simple “job scheduler”. No need for
complex client/server solutions.
12 / 23
Load balancing
Checkpoint/restart
Conclusion
Example: Apache MPM worker (Multi-Processing Module), prefork
version (non-threaded, pre-forking).
Drawback: Socket handling.
Short-term solution: Enable an extra capability to have all sockets
listen on a given node, which acts as an entry point.
Long-term solution: Use cluster IP.
13 / 23
Load balancing
Checkpoint/restart
Conclusion
Example: R and its multicore package.
Code: Replace %do% with %dopar%
library("doMC")
registerDoMC()
x <- foreach(i=1:42) %dopar% svd(matrix(rnorm(1000*1000),ncol=1000))
Cores are automatically detected, but the worker count can be
tweaked by calling:
options(nodes=42)
14 / 23
Load balancing
Checkpoint/restart
Conclusion
Use case 6: LTSP
Example: Run one VNC server per user on the first node.
Launched applications get load-balanced over the whole cluster.
Possible issue with desktop environments: Heavy use of local
networking services (e.g. D-Bus).
Possible solutions:
Same as the web servers use case: use an extra capability to
direct all sockets to a given node.
Cluster IP?
Probably a bit more complex: loopback, global address space
for UNIX sockets, etc.
15 / 23
Load balancing
Checkpoint/restart
Conclusion
There’s more!
Schedulers for DISTANT FORK and CAN MIGRATE can be tweaked,
extended, or replaced. Configurable through configfs, thanks to:
Probes (e.g. free RAM).
Filters (e.g. reaching some threshold).
Policies.
Process sets and node sets.
Some possible policies, in addition to load balancing:
Swap avoidance
Disk I/O balancing
Slightly more complex: keep interactive applications local, move
others away, and welcome them back when there are no more
interactive applications (use case: Network of Workstations in
universities).
16 / 23
Load balancing
Checkpoint/restart
Conclusion
1. Introducing Kerrighed
What’s Kerrighed?
Managing nodes
Managing features
2. Load balancing
Use case 6: LTSP
There’s more!
3. Checkpoint/restart
17 / 23
Load balancing
Checkpoint/restart
Conclusion
Why: Even with parallel algorithms running on powerful clusters,
computations can take hours, days, weeks, or more.
Checkpoint/Restart useful in case of hardware failures, system
errors, etc.
Example: R.
How: Either enable the CHECKPOINTABLE capability, or use a
wrapper which also creates a new session for the program to be
checkpointed:
krgcr-run -- R
Create a checkpoint every hour:
while :; do checkpoint $(pgrep R|head -1); sleep 3600; done
18 / 23
Load balancing
Checkpoint/restart
Conclusion
Step by step (1/2)
Starting:
$ krgcr-run R
Running application 6291648
R version 2.11.1 (2010-05-31)
[...]
>
Checkpointing:
$ checkpoint 6291648
Freezing application in which process 6291648 is involved...
Checkpointing application in which process 6291648 is involved...
Identifier: 6291648
Version: 1
Description: No description
Date: Thu Jul 8 22:54:06 2010
Unfreezing application in which process 6291648 is involved...
19 / 23
Load balancing
Checkpoint/restart
Conclusion
Step by step (2/2)
Contents of the checkpoint:
$ ls /var/chkpt/6291648/v1/
description.txt node_1.bin
global.bin
shared_obj_1.bin
task_6291648.bin
user_info_1.txt
Restarting:
$ restart 6291648 1
Restarting application 6291648 (v1) ...
Application 6291648 has been successfully restarted
20 / 23
Load balancing
Checkpoint/restart
Conclusion
Problem: checkpointing sockets isn’t supported yet.
Solution:
Force checkpointing: -i option for checkpoint.
Use file descriptor substitution.
Plugging a given file descriptor on a given checkpoint identifier:
$ cat /var/chkpt/6291711/v1/user_info_1.txt
tty
|0001FFFF88003ED8AC48|/dev/pts/1|6291711:0,6291711:1,6291711:2
socket |0001FFFF88007E5F3168|socket:[162646]|6291711:3
$ restart -s 0001FFFF88007E5F3168,0 6291711 1
Restarting application 6291711 (v1) ...
Application 6291711 has been successfully restarted
Future: Use that to restore the Unix socket to the X server?
21 / 23
Load balancing
Checkpoint/restart
Conclusion
Conclusion
Kerrighed’s strong features right now:
Stability
Flexibility
Can be configured/tweaked to suit specific needs
General solution for many common use cases
Kerrighed’s next features (short to mid-term):
Performance
More flexibility
Partial thread support
22 / 23
Load balancing
Checkpoint/restart
Conclusion
The end
Thanks for your attention! Questions?
A few pointers:
Want to play?
→ http://www.kerrighed.org/
Want to talk?
→ kerrighed.{users,dev}@listes.irisa.fr
→ #kerrighed on Freenode
Want to apply for a nice job?
→ [email protected]
23 / 23

Kerrighed: use cases

Transcription

Documents pareils

Checkpoint 2.0 Pre-insurance Vehicle Inspection Solution

Press Release Zoya Denure Scratches in Tanana March

Complete Shaking Force and Shaking Moment Balancing of

Conduction with Internal Energy Generation

Peplink Balance - Services STE IP

tuning guide

Interactive Brokers Hong Kong Fully Disclosed Clearing Agreement

Th`ese de Doctorat de l`université Paris VI Pierre et Marie Curie M

thèse - Université Bordeaux I

Machines Sous 24 Telecharger Jeux De Casino 92

VMware vSphere Big Data Extensions Administrator`s and User`s

Synchronous multi-master clusters with MySQL: an introduction to

The Inventor Mentor

angalis mep 2009 - Le Cluster Maritime Français

studies