Saturday, 19 January 2013

Sneaky resources

What's the fuss all about?

Every so often an application has to deal with some sort of resources. There's been plenty of books and articles written about it and they are all useful. They even attempt to go beyond managing memory as this is only one of many resources. The resources often mentioned are network connections, database connections, file system objects and the like. Not only the application has to release them when it's done with them but also has to ensure that they are not held when errors occur and a different execution path is taken.

In C++ error conditions are very often signalled by throwing an exception which automatically unwinds the call stack. This is very handy if one knows how to leverage it. For unwary this could be pernicious though. In some ways C is more convenient in this regard as there could not be any exception thrown when a C function is called. The real problem I often see in code is when both worlds have to coexist:

  • C++ -> C call
    The most likely scenario is when a C++ code calls some C functions that return some sort of handles (very often pointers) to some resources (or simply memory) which are expected to be released with another (complementary) C function call.
  • C -> C++ call
    The other scenario is when a C code calls some C++ function (quite often some sort of C++ callback function registered with a C API). Care need to be taken to not let any exceptions back into C as this causes undefined behaviour. If you're lucky the program aborts immediately, otherwise you have to gape at meaningless call stacks and waste some time to figure out what happened. In this situation (if you know that C calls C++) a good idea is to set a breakpoint on a library routine that is responsible for delivering exceptions (for example __cxa_throw in glibc) and laboriously debug the program to find which exception gets into C.
In this article I want to focus on the C++ calling C scenario.

Most of resources are released by the operating system if the application exits or terminates. So network connections, file descriptors and all sort of sockets are closed, memory is released etc. There are some less obvious resources though which result in leaving undeleted temporary files behind, unreleased IPC resources or (for paranoid) unwiped memory (RSA encryption/decryption). I admit I haven't ever attempted to deal with application recovery and I'm usually happy if the application at least reports some error message and simply exits. But in the face of a less obvious resources leakage and simply for diligence I like to have all toys tidied up nicely no matter what bad course the application has taken. This does not include an immediate abortion which is really horrible and there's not much one can do about it. Even if releasing resources on application exit doesn't matter now, some bits of code or the entire application may get reused somewhere where it matters.

Some unwary programmer may write something like this:

void
foo()
{
  char *const p = get_some_c_string();
  std::string s(p);
  // do some fancy stuff with the string
  free(p);
}
Now this is a typical scenario when talking about exceptions. In this situation a lot of operations with std::string (including its creation) can throw. This will inevitably leak memory returned by get_some_c_string(). I won't dwell on it as a lot has been said about it elsewhere and it usually ends up using some sort of smart pointers.

What to do? What to do?

Let me introduce some small utility function that creates a scoped pointer that draws upon std::unique_ptr from C++11.
template<class T, class D = std::default_delete<T>>
constexpr std::unique_ptr<T, D>
make_scoped(T *const ptr, D deleter = D())
{
  return std::unique_ptr<T, D>(ptr, deleter);
}
Now let's assume that one wants to verify a certificate using OpenSSL. This task itself could probably make a lot of people nervous not to mention they would have to manage related resources properly. Now let's focus on adding some policies to the verification process:
void
addPolicy(X509_VERIFY_PARAM *const params, const std::string& policy)
{
  auto policyObj =
    make_scoped(OBJ_txt2obj(policy.c_str(), 1), ASN1_OBJECT_free);
    
  if (!policyObj) {
    throw std::runtime_error("Cannot create policy object");
  }

  if (!X509_VERIFY_PARAM_add0_policy(params, policyObj.get())) {
    throw std::runtime_error("Cannot add policy");
  }

  // committed
  (void)policyObj.release();
}
An interesting use case here is a "transactional" approach. Note that OpenSSL functions that have 0 in their name take ownership of objects passed to them. This is explained in the notes section here but it may appear to you to be the opposite if you read it for the first time. The addPolicy() function creates a managed policy object and do not hesitate to throw if something goes wrong. But once we successfully passed the ownership of the policy object to another object, we can give up its ownership in our scope (remember that C functions do not throw) and the "transaction" is "committed". Nice and easy. Then some part of initialization of the verification process could look like this:
const auto params =
  make_scoped(X509_VERIFY_PARAM_new(), X509_VERIFY_PARAM_free);
  
(void)X509_VERIFY_PARAM_set_flags(
  params.get(), X509_V_FLAG_POLICY_CHECK | X509_V_FLAG_EXPLICIT_POLICY);
(void)X509_VERIFY_PARAM_clear_flags(
  params.get(), X509_V_FLAG_INHIBIT_ANY | X509_V_FLAG_INHIBIT_MAP);
  
addPolicy(params.get(), "1.2.3.4");

// set up other stuff and do the verification
// ...
This is even nicer. It's a bliss and harmony.

I mentioned that letting the program exit without releasing memory shouldn't make any harm as the operating system would reclaim it anyway. But if you're paranoid and deal with some secrets in memory, then this becomes an issue. To address this problem just do the following:

void
someRsaStuff()
{
  auto rsa = make_scoped(RSA_new(), RSA_free);
  // do some stuff with the RSA key
  // ...
}
Note that RSA_free() function erases the memory.

It's also convenient to do some more complicated things with lambdas which in pre-C++11 would have to be wrapped into a functor:

const auto untrustedCerts =
  make_scoped(
    sk_X509_new_null(),
    [](STACK_OF(X509)* const s) { sk_X509_pop_free(s, X509_free); });
// now add untrusted certificates to the stack
// ...

No worries

Using some C++11 goodies can help you to manage different sort of resources and focus on the problem rather worrying about releasing stuff especially in error paths. It also gives a confidence and clarity of what is released where and how. In some scenarios you can also use the transactional approach.

Source code for the examples above is available here. On my Fedora 17 I built it as follows:

g++ -std=c++11 scoped-ptr-example.cpp -lcrypto -o /tmp/scoped
For those who can't afford using C++11 but can use Boost libraries, the following could be a replacement for a scoped pointer with a custom deleter:
#include <boost/function.hpp>
#include <boost/interprocess/smart_ptr/unique_ptr.hpp>

namespace detail {

template<class T>
class unique_ptr_deleter
{
public:
    template<class D>
    unique_ptr_deleter(const D& d) : deleter(d) {}

    void operator()(T* const p) throw() { deleter(p); }

private:
    const boost::function<void (T* const)> deleter;
};

} // namespace detail

template<class T>
struct unique_ptr
{
    typedef boost::interprocess::unique_ptr<
      T, detail::unique_ptr_deleter<T> > type;
};
This can be used as follows:
const unique_ptr<RSA>::type rsa(RSA_new(), RSA_free);

Saturday, 12 January 2013

Debugging threaded application crash

The story

Some time ago I had a nasty issue of an application crashing due to heap corruption. Quite quickly I discovered it was related to the intensively multi-threaded nature of the test case that reproduced the crash that happened in the field. The call stacks used to end up somewhere in malloc()/free() pair. I said they used to, as there where different ways it was crashing (different call stacks in the back trace). It was another reason to think that multiple threads were making harm to each other. To make this story more exciting (or painful to me at that time), the problem couldn't be reproduced on x86 PC. Only some specific timing conditions on the MIPSEL system could make the crash happen.

The malloc()/free() functions are thread safe (even in uClibc). I proved it to myself by looking at the back traces showing some internal locking functions called from within malloc()/free() (obviously this could also be inferred from the code). Once the memory is malloc-ed in one thread, there's no harm the other thread can do as there's no way malloc() can have internal structures corrupted from the other thread. The free() function is a different story. Although it is not possible to free memory simultaneously (corrupt allocator structures), it is possible to do it twice. Once per thread. This was my theory from the day one.

Tools

First thing to do is to throw some ready to use tools at the problem. Ideally I'd have looked at some reverse debugging tool, had I known at that time they existed for real. Recently someone I worked with referred me to UndoDB which I'd have at least evaluated, had I known about it and if it were available for MIPSEL (it's not as of the time of this writing). Most memory debugging tools have receded having their functionality taken over by Valgrind. As I was (un)lucky to observe the problem only on the MIPSEL platform, there was no chance to use it as it does not support MIPS(EL) and is not likely to support it in the near future. Another one is efence which is excellent in detecting memory corruption but it uses memory very intensively (every malloc() call results in at least two pages allocated due to the way it works). In my particular case the system was running out of memory when using efence.

It was time to get a grip and write my own rough and ready tool.

Let's crack on with it

The idea was to interpose malloc()/free() and their friends. Once they are interposed, I could put my own housekeeping information in allocated blocks to help analyse the cause of the crash. And even more, I could detect some conditions, e. g. double free() calls (thanks to Tish for the inspiration).

First of all I wanted to have some counter incremented on every free() call and once the non-zero value was detected I wanted it to abort immediately so I could analyse the state and the context. It was also useful to store some information (return address, thread ID) about the caller that malloc-ed the memory chunk and the caller that free-ed it.

To make easier finding my housekeeping information when examining memory, I wanted to put some magic and easy to spot values there. They would also serve as execution fences, i. e. with LSB set on a value interpreted as a function pointer, the call on many platforms would result in SIGBUS.

Implementation

I got to the final implementation in few iterations after making some mistakes. At least it might save someone's else time although getting through all these mistakes was enriching. The implementation is (intended to be) thread safe. Note that a naive use of mutexes here causes a lot of problems (atomic operations are used directly instead). Similarly calling printf()-like functions is not easy/possible as they call malloc()/free().

The intention of this implementation is to detect double free() call under the control of the debugger (similar to the way efence works). It is MIPSEL specific but it should be easy to adapt for other platforms. It's available here. Below I only present the gist of it.

template<class Fn>
Fn getNextFunction(char const* const name) {
  ::dlerror();
  if (void* const sym = ::dlsym(RTLD_NEXT, name)) {
    return reinterpret_cast<Fn>(sym);
  }
  
  ABORT_HERE;
}

void* malloc(size_t size) {
  size_t ra;
  asm volatile("move %0, $ra" : "=r" (ra)); // get the return address

  typedef void*(*fn_malloc_t)(size_t);
  static fn_malloc_t fn_malloc = getNextFunction<fn_malloc_t>("malloc");

  char* const ptr =
    static_cast<char*>(fn_malloc(sizeof(MallocInfo) +
    size +
    sizeof(MallocInfoBack)));

  if (!ptr) {
    return ptr;
  }

  const MallocInfo info = {
    { MAGIC1, MAGIC1, MAGIC1, MAGIC1 },
    ra, pthread_self(), 0u, 0u, size, 0u,
    { MAGIC2, MAGIC2, MAGIC2, MAGIC2 }
    };

  const MallocInfoBack infoBack = {
    { MAGIC3, MAGIC3, MAGIC3, MAGIC3 },
    pthread_self(), size,
    { MAGIC4, MAGIC4, MAGIC4, MAGIC4 }
  };

  *reinterpret_cast<MallocInfo*>(ptr) = info;
  *reinterpret_cast<MallocInfoBack*>(ptr + sizeof(MallocInfo) + size) =
    infoBack;

  return ptr + sizeof(MallocInfo);
}

void free(void* ptr) {
  size_t ra;
  asm volatile("move %0, $ra" : "=r" (ra)); // get the return address
  typedef void (*fn_free_t)(void*);
  static fn_free_t fn_free = getNextFunction<fn_free_t>("free");

  if (ptr) {
    ptr = static_cast<char*>(ptr) - sizeof(MallocInfo);
    MallocInfo* const mi = reinterpret_cast<MallocInfo*>(ptr);

    if (0 != __sync_fetch_and_add(&mi->freeCnt, 1)) {
      // this is it - someone alreade freed the memory

      // now these two bits of information are preserved in the global variables
      // as the compiler may (re)use registers and stack heavily and it's easier
      // to find out in the disassembly where these values are stored when they
      // are assigned to global variables
      gRaFree = mi->raFree;
      gTidTerminator = mi->tidTerminator;

      ABORT_HERE;
    }

    mi->raFree = ra;
    mi->tidTerminator = pthread_self();
  }

  fn_free(ptr);
}
To compile/build:
mipsel-linux-g++ \
  -shared -fPIC -O2 -o libkrismalloc.so \
  kris-malloc-mipsel.cpp -pthread -ldl
Now you can LD_PRELOAD it after transferring it onto the target system:
gdb \
  -ex "set exec-wrapper env LD_PRELOAD=/kris/libkrismalloc.so" \
  --args <app> <args>
You can also compile it into an object file and link it statically into your executable, although it's slightly less flexible.

Usage

Basically you need a bit of luck if the problem is strongly related to thread race conditions. You can make yourself luckier by reducing the test case to the very minimum and making it more thread intensive if possible. This will let you get to the crash point sooner so you can repeat attempts more often.

On my MIPSEL system I got something like this:

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x42361520 (LWP 28702)]
0x2aac09d0 in free () from /kris/libkrismalloc.so
Lucky me. I did x/i $pc and got:
0x2aac09d0 <free+196>:  sw  zero,0(zero)
I was very lucky. I could examine the housekeeping information in the allocated block and it's still not clobbered.

How to examine preserved information? The argument passed to free() should be passed in the $a0 register due to MIPS(EL) calling convention. If you compile the interposing library with -O2, which is useful to keep the speed and to not affect the race conditions between threads, then you may be a wee bit less lucky. The $a0 register might get reused for something else and you need to look around in the disassembly for the register where the $a0 is copied to (it has to be copied if $a0 gets reused). In my case it used to be $s0 register with additional bonus of offsetting the memory block pointer to the housekeeping information. Hence I was examining it like this (I marked some elements with (1), (2), ... to comment on them later on):

(gdb) x/20xw $s0
0xbd96c8: (1) 0xaaaaaaaa     0xaaaaaaaa     0xaaaaaaaa     0xaaaaaaaa
0xbd96d8: (2) 0x2b566d84 (3) 0x2cf21520 (4) 0x2b6ae248 (5) 0x2cf21520
0xbd96e8: (6) 0x00000088 (7) 0x00000002 (8) 0xbbbbbbbb     0xbbbbbbbb
0xbd96f8:     0xbbbbbbbb     0xbbbbbbbb     0x00000000     0x00000000
0xbd9708:     0x0025002f     0x00450037     0x006c0066     0x00730061

##
# get the name of the function that originally freed the memory
# (note that I have release build with debug information)
#

(gdb) x/i 0x2b6ae248
0x2b6ae248 <operator delete(void *, unsigned int, const char *, int, struct {...} *)+120>:  lw  gp,16(s8)
A legend (memory interpretation according to the MallocInfo structure): (1) - MallocInfo begin marker (2) - the address of the caller that malloc-ed memory (3) - the thread ID of the caller that malloc-ed memory (4) - the address of the caller that originally free-ed memory (5) - the thread ID of the caller that originally free-ed memory (6) - the size of the allocated block (7) - free() calls counter (8) - MallocInfo end marker Moving forwards and backwards with memory examination you should also find the MallocInfoBack structure. It helps a lot if you are unlucky and some part of any housekeeping information gets clobbered.

The happy end

This method helped me to track the problem down to a std::string variable being not locked properly. Adding a lock on a mutex when accessing it solved the problem. The code base was large enough to make it impossible to find the problem by simple code inspection and adding to that thread unsafe use of this particular variable wasn't obvious from the code at a first glance. So having a nice crash and a right approach can set you right on the track.

Tuesday, 8 January 2013

System V shared memory fun

Why, oh why?

Let me begin with a disclaimer that if you ever have a chance to deal with shared memory, you're better off with POSIX shared memory API or simply use files and mmap(2). You still need some means of synchronisation though, like atomic operations and good spinning strategies. System V shared memory (shmget(2), shmop(2), etc.) on the other hand is an old skool toolkit.

One of the reasons I mention it here is a nostalgia as my first exposure to parallel programming in the academia was with the use of processes and System V. If you're too cool for old skool, better start with threads instead. The very first *nix system I ever used was FreeBSD. The very first *nix system I happened to use in my professional career was Solaris. And I still use hjkl to move around in vi since once I was given access to a Solaris server terminal with no arrow keys on the keyboard and vi being the only editor available on the system (visiting a large IT company in Asia as a support engineer). For these and other reason I'm probably becoming an old grumpy man.

Nostalgia is of course not a good enough reason to mention System V shared memory. In fact I'm not that old and my career is less than ten years but I've come across a few projects where it was used. Although I was lucky to get a good start from my academia on this subject, I had to learn more about it the hard way. Maintaining and improving code around it has taught me some lessons. Whatever one may think of this API, it's still quite popular and more available in a variety of systems than the better POSIX alternative.

A smidgen of an introduction

Good old books about the theory of parallel programming talk about semaphores, processes and use a historical V and P notation for operations on semaphores. See an overview here. And if you're stuck to a fork(3) and execv(3) pair, better forget about the latter one here. It's mind bending sometimes when there's a lot of forking going on but no exec-ing. With some strategy applied we should be fine though.

Example

Let's have a look at a problem where there's a one server and multiple clients. All parties need to send (share) data somehow. They also need some exclusion mechanism to use a shared resource which in this case is a dazzling computing power (the server). A scenario where a client wants to use the server would be as in the following drama:

Scene:
  somewhere in an inter-process jungle

Characters:
  Master of puppets
  Server
  a number of clients

Appliances:
  3 semaphores: SemClient, SemServer, SemResultReady

Act I (prologue):
  Master of puppets enters the stage and initializes 1 semaphore to "up"
  (or "unlocked" if you're too cool for old skool)
  
  Master of puppets:
    V (SemServer)

  Two other semaphores are left "down" (or "locked" if you're too cool...)
  
  Master of puppets:
    Fork (Server)       [Let it be a server]
    Fork (Client)       [Let it be a client]
    Fork (Client)       [Let it be a client]
    ...

Act II:
  The server enters timidly the scene

  Server:
    P (SemClient)       [Is there anybody here?]
    
  A client enters the scene with a stately procession
  
  Client:
    P (SemServer)       [I'd like to speak to the server in private]
    
  The client gets private access to the server
  
  Client:
    (produces data in the shared memory area)
    V (SemClient)       [Hey server, here's some data I'd like you to munge]
    P (SemResultReady)  [Let me know when you're done]
    
  Server:
    (processes data in the shared memory area)
    V (SemResultReady)  [Mmm.. yummy data, here's the result]
    
  Client:
    V (SemServer)       [Thanks. Bye!]
    
Act III (epilogue):

  Master of puppets decides to cease everyone.
  
  Master of puppets:
    JoinAll ()
    
  Master of puppets leaves the scene
This boils down to the following lines:
Client:
  // get exclusive access to the server (and the shared memory area)
  P (SemServer)
  (generate some data in the shared memory area)
  // notify the server about the data to process
  V (SemClient)
    
  // wait for server
  P (SemResultReady)
  (consume the result)
  
  // release server access
  V (SemServer)
    
Server:
  // wait for a client
  P (SemClient)
  (process the data)
  // notify the client
  V (SemResultReady)

Implementation

I try to simplify things by isolating the mind bending bits and writing things in such a way that I can focus on the gist of the problem rather than nasty details. I also try to use concepts from the problem domain. That's the strategy I mentioned before to cope with the complexity.

Sources are available here. The way I compile the sources and run the program is as follows (x86 Linux/Fedora 17, GCC 4.7.2):

##
# Version without debugging messages
#
$ g++ -Wall -DNDEBUG -O3 -std=c++11 \
sysv-ipc-example/*.cpp -I./sysv-ipc-example -o /tmp/ipc-test

$ !$
/tmp/ipc-test
Client 4315
 result   : 863000
 expected : 863000 [OK]
Client 4316
 result   : 863200
 expected : 863200 [OK]
Client 4317
 result   : 863400
 expected : 863400 [OK]
Client 4318
 result   : 863600
 expected : 863600 [OK]
Client 4319
 result   : 863800
 expected : 863800 [OK]

##
# Version with some drama to follow (no -DNDEBUG)
#
$ g++ -Wall -O3 -std=c++11 \
sysv-ipc-example/*.cpp -I./sysv-ipc-example -o /tmp/ipc-test

$ !$
Start main 4345...
Starting server 4346...
Starting client 4347[Server] P (SemClient)

Starting client 4348[Client 4347] P (SemServer)

[Client 4347] V (SemClient)
[Client 4347] P (SemResultReady)
[Server] V (SemResultReady)
[Server] P (SemClient)

... (lots of drama)

Client 4351
 result   : 870200
 expected : 870200 [OK]
[Client 4351] V (SemConsole)
Terminating client 4351
Finished waiting for 4351
Terminating main...
~IpcManager() : freeing resources in 4345

Let's have a look at some code excerpts in client-server-example.cpp. Client() and Server() almost replicate what I sketched about their roles above. To make the program finite there's some expected (constant) number of clients and transactions to occur but in an undetermined order. The data "protocol" between the server and the clients is established by the Packet structure. Note that there's also additional semaphore SemConsole to avoid interleaving log messages from processes. The essence of the client implementation looks as follows:

void
Client ()
{
  /* ... */
  
  int sum = 0;
  Packet *const pData = static_cast (GetShm ());
  
  for (unsigned i = 0; i < NO_OF_PACKETS_PER_CLIENT; ++i) {
    // wait for server access
    CLIENT (P (SemServer))
    // generate data:
    // easy to predict the result but still individual for every client
    std::fill_n (pData->numbers, NO_OF_ITEMS_IN_PACKET, pid);
    // notify the server
    CLIENT (V (SemClient))
    
    // wait for the server and get the result
    CLIENT (P (SemResultReady))
    sum += pData->result;
    
    // free server access
    CLIENT (V (SemServer))
  }
  
  /* ... */
}
I use CLIENT and SERVER macros to add some debugging messages so they form a nice story on the console.

Now some juicy details of the API usage in IpcManager.cpp. For example the V operation looks like this:

void
IpcManager::V (const unsigned short sem) const
{
  ::sembuf sb = { sem, 1, 0 };
  const int status = ::semop (m_semId, &sb, 1);
  CHECK (0 == status);
}
That's it. I invite you to read semop(2) rather than repeating what's already explained there. And the last but not the least, how we initialize all that stuff:
// only the first instance of the manager (creator) is responsible
// for resources
IpcManager::IpcManager (const char *key, const size_t memSize, const int semNum)
  : m_bCreator (true), m_data (nullptr)
{
  assert (key && key[0]);
  m_key = ::ftok (key, key[0]);
  CHECK (-1 != m_key);
  
  m_memId = ::shmget (m_key, memSize, IPC_CREAT | 0600);
  CHECK (-1 != m_memId);
  
  m_semId = ::semget (m_key, semNum, IPC_CREAT | 0600);
  CHECK (-1 != m_semId);
  
  for (int i = 0; i < semNum; ++i) {
    const int status = ::semctl (m_semId, i, SETVAL, nullptr);
    CHECK (-1 != status);
  }
}
Note that the key passed to ftok(3) is literary a key that identifies a shared resource so that processes know which one to connect to in order to communicate (it's a bit like a token for a conference call). Note that the key must be actually a path to an existing file system object.

Bear in mind that all processes are forked from the very first one. This means that after forking all of them have their memory in the exactly same state as their parent. This also means that all variables have exactly the same values (which could be problematic if you fork from a multi-threaded process and have some mutexes locked and only one thread left in the child process that waits for some of the locked mutexes). As System V API resources have to be allocated and deallocated somewhere (see the Gotchas section), we declare only one process to be responsible for both operations. It doesn't have to be like this, it just have to be done once. The IpcManager object is initialized in the first process so it becomes a "creator". The same object manages forking new processes and ensures they won't attempt to de-initialize the resources.

Gotchas

Managing resources

To me the most difficult aspect of System V API is ensuring reliable shared memory deallocation. Bear in mind that if you don't deallocate it, the system (not just a single process) will be leaking and sooner or later the host will not be able to operate properly if there are new processes requesting System V memory. A modern system usually has only few megabytes of System V memory available so it's not something you can splurge. ipcs(1) and ipcrm(1) are your friends here.

What makes the deallocation actually tricky is deciding who should do it. If the process that was supposed to deal with it crashed, then we've got a problem. Different strategies can be applied depending on the situation but they tend to be complicated. In this regards aforementioned POSIX shared memory API seems to be much better.

Memory segment initialization

Although the shmget(2) documentation promises:

When a new shared memory segment is created, its contents are initialized to zero values [...]
that's certainly not true on all systems. I spent days chasing a problem on a server farm consisting of tens of machines of different makes, hardware, systems, versions etc. After contriving literary a hunt, it appeared that the problem was only on particular machines and was diagnosed to be non-conforming implementations not initializing shared segment.

Now I'm a bit sceptical about such promises if the software is intended to run on undefined hardware and system. Unless it's a promise made by a portable and well maintained library, I prefer to ensure that a resource I'm given is initialized properly.