407 lines
15 KiB
Text
407 lines
15 KiB
Text
|
Using RCU (Read-Copy-Update) for synchronization
|
||
|
================================================
|
||
|
|
||
|
Read-copy update (RCU) is a synchronization mechanism that is used to
|
||
|
protect read-mostly data structures. RCU is very efficient and scalable
|
||
|
on the read side (it is wait-free), and thus can make the read paths
|
||
|
extremely fast.
|
||
|
|
||
|
RCU supports concurrency between a single writer and multiple readers,
|
||
|
thus it is not used alone. Typically, the write-side will use a lock to
|
||
|
serialize multiple updates, but other approaches are possible (e.g.,
|
||
|
restricting updates to a single task). In QEMU, when a lock is used,
|
||
|
this will often be the "iothread mutex", also known as the "big QEMU
|
||
|
lock" (BQL). Also, restricting updates to a single task is done in
|
||
|
QEMU using the "bottom half" API.
|
||
|
|
||
|
RCU is fundamentally a "wait-to-finish" mechanism. The read side marks
|
||
|
sections of code with "critical sections", and the update side will wait
|
||
|
for the execution of all *currently running* critical sections before
|
||
|
proceeding, or before asynchronously executing a callback.
|
||
|
|
||
|
The key point here is that only the currently running critical sections
|
||
|
are waited for; critical sections that are started _after_ the beginning
|
||
|
of the wait do not extend the wait, despite running concurrently with
|
||
|
the updater. This is the reason why RCU is more scalable than,
|
||
|
for example, reader-writer locks. It is so much more scalable that
|
||
|
the system will have a single instance of the RCU mechanism; a single
|
||
|
mechanism can be used for an arbitrary number of "things", without
|
||
|
having to worry about things such as contention or deadlocks.
|
||
|
|
||
|
How is this possible? The basic idea is to split updates in two phases,
|
||
|
"removal" and "reclamation". During removal, we ensure that subsequent
|
||
|
readers will not be able to get a reference to the old data. After
|
||
|
removal has completed, a critical section will not be able to access
|
||
|
the old data. Therefore, critical sections that begin after removal
|
||
|
do not matter; as soon as all previous critical sections have finished,
|
||
|
there cannot be any readers who hold references to the data structure,
|
||
|
and these can now be safely reclaimed (e.g., freed or unref'ed).
|
||
|
|
||
|
Here is a picture:
|
||
|
|
||
|
thread 1 thread 2 thread 3
|
||
|
------------------- ------------------------ -------------------
|
||
|
enter RCU crit.sec.
|
||
|
| finish removal phase
|
||
|
| begin wait
|
||
|
| | enter RCU crit.sec.
|
||
|
exit RCU crit.sec | |
|
||
|
complete wait |
|
||
|
begin reclamation phase |
|
||
|
exit RCU crit.sec.
|
||
|
|
||
|
|
||
|
Note how thread 3 is still executing its critical section when thread 2
|
||
|
starts reclaiming data. This is possible, because the old version of the
|
||
|
data structure was not accessible at the time thread 3 began executing
|
||
|
that critical section.
|
||
|
|
||
|
|
||
|
RCU API
|
||
|
=======
|
||
|
|
||
|
The core RCU API is small:
|
||
|
|
||
|
void rcu_read_lock(void);
|
||
|
|
||
|
Used by a reader to inform the reclaimer that the reader is
|
||
|
entering an RCU read-side critical section.
|
||
|
|
||
|
void rcu_read_unlock(void);
|
||
|
|
||
|
Used by a reader to inform the reclaimer that the reader is
|
||
|
exiting an RCU read-side critical section. Note that RCU
|
||
|
read-side critical sections may be nested and/or overlapping.
|
||
|
|
||
|
void synchronize_rcu(void);
|
||
|
|
||
|
Blocks until all pre-existing RCU read-side critical sections
|
||
|
on all threads have completed. This marks the end of the removal
|
||
|
phase and the beginning of reclamation phase.
|
||
|
|
||
|
Note that it would be valid for another update to come while
|
||
|
synchronize_rcu is running. Because of this, it is better that
|
||
|
the updater releases any locks it may hold before calling
|
||
|
synchronize_rcu. If this is not possible (for example, because
|
||
|
the updater is protected by the BQL), you can use call_rcu.
|
||
|
|
||
|
void call_rcu1(struct rcu_head * head,
|
||
|
void (*func)(struct rcu_head *head));
|
||
|
|
||
|
This function invokes func(head) after all pre-existing RCU
|
||
|
read-side critical sections on all threads have completed. This
|
||
|
marks the end of the removal phase, with func taking care
|
||
|
asynchronously of the reclamation phase.
|
||
|
|
||
|
The foo struct needs to have an rcu_head structure added,
|
||
|
perhaps as follows:
|
||
|
|
||
|
struct foo {
|
||
|
struct rcu_head rcu;
|
||
|
int a;
|
||
|
char b;
|
||
|
long c;
|
||
|
};
|
||
|
|
||
|
so that the reclaimer function can fetch the struct foo address
|
||
|
and free it:
|
||
|
|
||
|
call_rcu1(&foo.rcu, foo_reclaim);
|
||
|
|
||
|
void foo_reclaim(struct rcu_head *rp)
|
||
|
{
|
||
|
struct foo *fp = container_of(rp, struct foo, rcu);
|
||
|
g_free(fp);
|
||
|
}
|
||
|
|
||
|
For the common case where the rcu_head member is the first of the
|
||
|
struct, you can use the following macro.
|
||
|
|
||
|
void call_rcu(T *p,
|
||
|
void (*func)(T *p),
|
||
|
field-name);
|
||
|
void g_free_rcu(T *p,
|
||
|
field-name);
|
||
|
|
||
|
call_rcu1 is typically used through these macro, in the common case
|
||
|
where the "struct rcu_head" is the first field in the struct. If
|
||
|
the callback function is g_free, in particular, g_free_rcu can be
|
||
|
used. In the above case, one could have written simply:
|
||
|
|
||
|
g_free_rcu(&foo, rcu);
|
||
|
|
||
|
typeof(*p) atomic_rcu_read(p);
|
||
|
|
||
|
atomic_rcu_read() is similar to atomic_load_acquire(), but it makes
|
||
|
some assumptions on the code that calls it. This allows a more
|
||
|
optimized implementation.
|
||
|
|
||
|
atomic_rcu_read assumes that whenever a single RCU critical
|
||
|
section reads multiple shared data, these reads are either
|
||
|
data-dependent or need no ordering. This is almost always the
|
||
|
case when using RCU, because read-side critical sections typically
|
||
|
navigate one or more pointers (the pointers that are changed on
|
||
|
every update) until reaching a data structure of interest,
|
||
|
and then read from there.
|
||
|
|
||
|
RCU read-side critical sections must use atomic_rcu_read() to
|
||
|
read data, unless concurrent writes are prevented by another
|
||
|
synchronization mechanism.
|
||
|
|
||
|
Furthermore, RCU read-side critical sections should traverse the
|
||
|
data structure in a single direction, opposite to the direction
|
||
|
in which the updater initializes it.
|
||
|
|
||
|
void atomic_rcu_set(p, typeof(*p) v);
|
||
|
|
||
|
atomic_rcu_set() is similar to atomic_store_release(), though it also
|
||
|
makes assumptions on the code that calls it in order to allow a more
|
||
|
optimized implementation.
|
||
|
|
||
|
In particular, atomic_rcu_set() suffices for synchronization
|
||
|
with readers, if the updater never mutates a field within a
|
||
|
data item that is already accessible to readers. This is the
|
||
|
case when initializing a new copy of the RCU-protected data
|
||
|
structure; just ensure that initialization of *p is carried out
|
||
|
before atomic_rcu_set() makes the data item visible to readers.
|
||
|
If this rule is observed, writes will happen in the opposite
|
||
|
order as reads in the RCU read-side critical sections (or if
|
||
|
there is just one update), and there will be no need for other
|
||
|
synchronization mechanism to coordinate the accesses.
|
||
|
|
||
|
The following APIs must be used before RCU is used in a thread:
|
||
|
|
||
|
void rcu_register_thread(void);
|
||
|
|
||
|
Mark a thread as taking part in the RCU mechanism. Such a thread
|
||
|
will have to report quiescent points regularly, either manually
|
||
|
or through the QemuCond/QemuSemaphore/QemuEvent APIs.
|
||
|
|
||
|
void rcu_unregister_thread(void);
|
||
|
|
||
|
Mark a thread as not taking part anymore in the RCU mechanism.
|
||
|
It is not a problem if such a thread reports quiescent points,
|
||
|
either manually or by using the QemuCond/QemuSemaphore/QemuEvent
|
||
|
APIs.
|
||
|
|
||
|
Note that these APIs are relatively heavyweight, and should _not_ be
|
||
|
nested.
|
||
|
|
||
|
Convenience macros
|
||
|
==================
|
||
|
|
||
|
Two macros are provided that automatically release the read lock at the
|
||
|
end of the scope.
|
||
|
|
||
|
RCU_READ_LOCK_GUARD()
|
||
|
|
||
|
Takes the lock and will release it at the end of the block it's
|
||
|
used in.
|
||
|
|
||
|
WITH_RCU_READ_LOCK_GUARD() { code }
|
||
|
|
||
|
Is used at the head of a block to protect the code within the block.
|
||
|
|
||
|
Note that 'goto'ing out of the guarded block will also drop the lock.
|
||
|
|
||
|
DIFFERENCES WITH LINUX
|
||
|
======================
|
||
|
|
||
|
- Waiting on a mutex is possible, though discouraged, within an RCU critical
|
||
|
section. This is because spinlocks are rarely (if ever) used in userspace
|
||
|
programming; not allowing this would prevent upgrading an RCU read-side
|
||
|
critical section to become an updater.
|
||
|
|
||
|
- atomic_rcu_read and atomic_rcu_set replace rcu_dereference and
|
||
|
rcu_assign_pointer. They take a _pointer_ to the variable being accessed.
|
||
|
|
||
|
- call_rcu is a macro that has an extra argument (the name of the first
|
||
|
field in the struct, which must be a struct rcu_head), and expects the
|
||
|
type of the callback's argument to be the type of the first argument.
|
||
|
call_rcu1 is the same as Linux's call_rcu.
|
||
|
|
||
|
|
||
|
RCU PATTERNS
|
||
|
============
|
||
|
|
||
|
Many patterns using read-writer locks translate directly to RCU, with
|
||
|
the advantages of higher scalability and deadlock immunity.
|
||
|
|
||
|
In general, RCU can be used whenever it is possible to create a new
|
||
|
"version" of a data structure every time the updater runs. This may
|
||
|
sound like a very strict restriction, however:
|
||
|
|
||
|
- the updater does not mean "everything that writes to a data structure",
|
||
|
but rather "everything that involves a reclamation step". See the
|
||
|
array example below
|
||
|
|
||
|
- in some cases, creating a new version of a data structure may actually
|
||
|
be very cheap. For example, modifying the "next" pointer of a singly
|
||
|
linked list is effectively creating a new version of the list.
|
||
|
|
||
|
Here are some frequently-used RCU idioms that are worth noting.
|
||
|
|
||
|
|
||
|
RCU list processing
|
||
|
-------------------
|
||
|
|
||
|
TBD (not yet used in QEMU)
|
||
|
|
||
|
|
||
|
RCU reference counting
|
||
|
----------------------
|
||
|
|
||
|
Because grace periods are not allowed to complete while there is an RCU
|
||
|
read-side critical section in progress, the RCU read-side primitives
|
||
|
may be used as a restricted reference-counting mechanism. For example,
|
||
|
consider the following code fragment:
|
||
|
|
||
|
rcu_read_lock();
|
||
|
p = atomic_rcu_read(&foo);
|
||
|
/* do something with p. */
|
||
|
rcu_read_unlock();
|
||
|
|
||
|
The RCU read-side critical section ensures that the value of "p" remains
|
||
|
valid until after the rcu_read_unlock(). In some sense, it is acquiring
|
||
|
a reference to p that is later released when the critical section ends.
|
||
|
The write side looks simply like this (with appropriate locking):
|
||
|
|
||
|
qemu_mutex_lock(&foo_mutex);
|
||
|
old = foo;
|
||
|
atomic_rcu_set(&foo, new);
|
||
|
qemu_mutex_unlock(&foo_mutex);
|
||
|
synchronize_rcu();
|
||
|
free(old);
|
||
|
|
||
|
If the processing cannot be done purely within the critical section, it
|
||
|
is possible to combine this idiom with a "real" reference count:
|
||
|
|
||
|
rcu_read_lock();
|
||
|
p = atomic_rcu_read(&foo);
|
||
|
foo_ref(p);
|
||
|
rcu_read_unlock();
|
||
|
/* do something with p. */
|
||
|
foo_unref(p);
|
||
|
|
||
|
The write side can be like this:
|
||
|
|
||
|
qemu_mutex_lock(&foo_mutex);
|
||
|
old = foo;
|
||
|
atomic_rcu_set(&foo, new);
|
||
|
qemu_mutex_unlock(&foo_mutex);
|
||
|
synchronize_rcu();
|
||
|
foo_unref(old);
|
||
|
|
||
|
or with call_rcu:
|
||
|
|
||
|
qemu_mutex_lock(&foo_mutex);
|
||
|
old = foo;
|
||
|
atomic_rcu_set(&foo, new);
|
||
|
qemu_mutex_unlock(&foo_mutex);
|
||
|
call_rcu(foo_unref, old, rcu);
|
||
|
|
||
|
In both cases, the write side only performs removal. Reclamation
|
||
|
happens when the last reference to a "foo" object is dropped.
|
||
|
Using synchronize_rcu() is undesirably expensive, because the
|
||
|
last reference may be dropped on the read side. Hence you can
|
||
|
use call_rcu() instead:
|
||
|
|
||
|
foo_unref(struct foo *p) {
|
||
|
if (atomic_fetch_dec(&p->refcount) == 1) {
|
||
|
call_rcu(foo_destroy, p, rcu);
|
||
|
}
|
||
|
}
|
||
|
|
||
|
|
||
|
Note that the same idioms would be possible with reader/writer
|
||
|
locks:
|
||
|
|
||
|
read_lock(&foo_rwlock); write_mutex_lock(&foo_rwlock);
|
||
|
p = foo; p = foo;
|
||
|
/* do something with p. */ foo = new;
|
||
|
read_unlock(&foo_rwlock); free(p);
|
||
|
write_mutex_unlock(&foo_rwlock);
|
||
|
free(p);
|
||
|
|
||
|
------------------------------------------------------------------
|
||
|
|
||
|
read_lock(&foo_rwlock); write_mutex_lock(&foo_rwlock);
|
||
|
p = foo; old = foo;
|
||
|
foo_ref(p); foo = new;
|
||
|
read_unlock(&foo_rwlock); foo_unref(old);
|
||
|
/* do something with p. */ write_mutex_unlock(&foo_rwlock);
|
||
|
read_lock(&foo_rwlock);
|
||
|
foo_unref(p);
|
||
|
read_unlock(&foo_rwlock);
|
||
|
|
||
|
foo_unref could use a mechanism such as bottom halves to move deallocation
|
||
|
out of the write-side critical section.
|
||
|
|
||
|
|
||
|
RCU resizable arrays
|
||
|
--------------------
|
||
|
|
||
|
Resizable arrays can be used with RCU. The expensive RCU synchronization
|
||
|
(or call_rcu) only needs to take place when the array is resized.
|
||
|
The two items to take care of are:
|
||
|
|
||
|
- ensuring that the old version of the array is available between removal
|
||
|
and reclamation;
|
||
|
|
||
|
- avoiding mismatches in the read side between the array data and the
|
||
|
array size.
|
||
|
|
||
|
The first problem is avoided simply by not using realloc. Instead,
|
||
|
each resize will allocate a new array and copy the old data into it.
|
||
|
The second problem would arise if the size and the data pointers were
|
||
|
two members of a larger struct:
|
||
|
|
||
|
struct mystuff {
|
||
|
...
|
||
|
int data_size;
|
||
|
int data_alloc;
|
||
|
T *data;
|
||
|
...
|
||
|
};
|
||
|
|
||
|
Instead, we store the size of the array with the array itself:
|
||
|
|
||
|
struct arr {
|
||
|
int size;
|
||
|
int alloc;
|
||
|
T data[];
|
||
|
};
|
||
|
struct arr *global_array;
|
||
|
|
||
|
read side:
|
||
|
rcu_read_lock();
|
||
|
struct arr *array = atomic_rcu_read(&global_array);
|
||
|
x = i < array->size ? array->data[i] : -1;
|
||
|
rcu_read_unlock();
|
||
|
return x;
|
||
|
|
||
|
write side (running under a lock):
|
||
|
if (global_array->size == global_array->alloc) {
|
||
|
/* Creating a new version. */
|
||
|
new_array = g_malloc(sizeof(struct arr) +
|
||
|
global_array->alloc * 2 * sizeof(T));
|
||
|
new_array->size = global_array->size;
|
||
|
new_array->alloc = global_array->alloc * 2;
|
||
|
memcpy(new_array->data, global_array->data,
|
||
|
global_array->alloc * sizeof(T));
|
||
|
|
||
|
/* Removal phase. */
|
||
|
old_array = global_array;
|
||
|
atomic_rcu_set(&new_array->data, new_array);
|
||
|
synchronize_rcu();
|
||
|
|
||
|
/* Reclamation phase. */
|
||
|
free(old_array);
|
||
|
}
|
||
|
|
||
|
|
||
|
SOURCES
|
||
|
=======
|
||
|
|
||
|
* Documentation/RCU/ from the Linux kernel
|