Page tree
Skip to end of metadata
Go to start of metadata

Introduction

Go的一个主要设计目标是改善传统的多线程编程语言,使得并发编程更加容易且更少出错,其主要通过两方面:

  • 让线程(goroutine)更加轻量化和容易创建
  • 在线程间使用特殊的消息通讯(channel)

论文中主要研究了六个开源软件,分别是:Docker,Kubernetes,etcd,gRPC,CockroachDB,BoltDB,总共包含171个并发Bug。通过分析bug的根源,重现,以及修复来研究这些bug。

研究问题主要关注:消息传递和共享内存哪种进程间通讯机制更不易于出错。相比共享内存,Go则更推崇消息传递的方式。

对于Bug,主要按照两个维度进行分类:

  • 按bug的原因分为:共享内存bug和消息传递bug;
  • 按bug的影响分为:阻塞bug和非阻塞bug;

Bug示例:

bug 1
// message passing, blocking bug
func finishReq(timeout time.Duration) ob {
   ch := make(chan ob)
   go func() {
      result := fn()
      ch <- result // block
   }()
   select {
   case result = <-ch:
      return result
   case <-time.After(timeout):
      return nil
   }
}
fix
// message passing, blocking bug
func finishReq(timeout time.Duration) ob {
   ch := make(chan ob, 1)
   go func() {
      result := fn()
      ch <- result // block
   }()
   select {
   case result = <-ch:
      return result
   case <-time.After(timeout):
      return nil
   }
}

说明:如果timeout了,则goroutine无法正常结束


Background and Applications

Goroutine

以goroutine作为并发单元,按 M-N 的方式绑定到内核线程。Go支持使用匿名函数来创建线程,在匿名函数前定义的所有本地变量都可以被子goroutine访问到,然而这很容易造成竞态。

Shared Memory

传统的方式 Mutex(lock/unlock), RWMutex(read/write lock), Cond(条件变量), atomic(read/write);新的方式 Once, WaitGroup。不过误用 waitgroup 很容易引发bug。

Message Passing

新的方式 channel,包含 buffered 和 unbuffered。使用 channel 具有一些潜规则(必须初始化、从nil channel读写导致阻塞、往closed channel发送数据或关闭一个closed channel将导致panic)。

select 带有随机性,而这种随机性也有可能引入并发bug。


Go Concurrency Usage Patterns

Goroutine Usages

在实际开发环境中:

  • 写的代码是否包含很多goroutine(静态);
  • Go程序在运行时是否创建很多goroutine(动态);

论文中统计了6个研究对象每千行代码创建的goroutine数量在0.18~0.83之间,同时除了Kubernetes和BoltDB它们也大量使用匿名函数。

对比 gRPC-Go 和 gRPC-C,在处理相同的请求数量的情况下(因为Go与C性能不同,不能以时间为参考),可以看到:

  • gRPC-Go goroutine的数量是 gRPC-C thread的数倍之多;
  • gRPC-Go goroutine的运行时间是整个程序运行时间的60%~90%, gRPC-C thread则基本是100%;

观察结果1:相比C,Goroutines生命周期更短但创建得更加频繁。

Concurrency Primitive Usages

统计结果显示,共享内存用得比消息传递更多。

观察结果2:虽然传统的共享内存线程间通讯还保留着重度使用,Go程序员也使用了大量的消息传递方式。


Bug Study Methodology

收集Bug的方式:从github上搜索带有 race、deadlock、 concurrency、mutex、atomic等关键字的提交,然后进行随机抽样和过滤,再人工处理,最后总共包含171个bug。

Bug分类:

  • 按bug的行为:
    • 如果一个或多个goroutine被卡住,无法继续往前执行,则称之为阻塞bug;
    • 如果goroutine可以正常结束,但是他们的目的没有达到,则称之为非阻塞bug;
  • 按bug的原因:shared memory or passing messages;


Blocking Bugs

Root Causes of Blocking Bugs

观察结果3:与普遍认知(消息传递不易出错)相反,更多的阻塞性bug是因为消息传递引发的。

(mis)Protoction of Shared Memory

Mutex:重复加锁,获取锁的顺序冲突,忘记解锁;

RWMutex:进程A重复获取读锁,进程B获取写锁,导致死锁;

WaitGroup:

bug 2
   var group sync.WaitGroup
   group.Add(len(pm.plugins))
   for _, p := range pm.plugins {
      go func(p *plugin) {
         defer group.Done()
      }(p)
      group.Wait()
   }
fix
   var group sync.WaitGroup
   group.Add(len(pm.plugins))
   for _, p := range pm.plugins {
      go func(p *plugin) {
         defer group.Done()
      }(p)
   }
   group.Wait()

Misuse of Message Passing

常规的一些bug 如前面提到的(bug 1);另外,当使用 Go 一些特定的库时,也需要额外注意,比如:

bug 3
ctx := context.Background()
hctx, hcancel := context.WithCancel(ctx)
if timeout > 0 {
   hctx, hcancel = context.WithTimeout(ctx, timeout)
}
fix
ctx := context.Background()
var hctx context.Context
var hcancel context.CancelFunc
if timeout > 0 {
   hctx, hcancel = context.WithTimeout(ctx, timeout)
} else {
   hctx, hcancel = context.WithCancel(ctx)
}

说明:WithCancel 创建了一个goroutine,而后如果timeout > 0,hcancel 指向新对象,导致goroutine无法正常结束。

观察结果5:所有因消息传递导致的阻塞bug都跟Go新的消息传递语义如channel有关,它们都难以检测特别是跟其它同步机制混用的时候。

Detection of Blocking Bugs

Go提供了内置的死锁监测,然而对于前面的bug,只有少部分能被检测出来。原因是:

  • 只要还有goroutine在执行,则不会认为是死锁;
  • 只检测了Go并发基元阻塞的goroutine,因等待系统资源而阻塞的goroutine并不考虑;


Non-Blocking Bugs

Root Causes of Non-blocking Bugs

虽然与传统语言相似,有些bug却是因为对Go的新特性理解不透彻导致:

bug 4
for i := 0; i <= 10; i++ {
   go func() {
      fmt.Println(i)
   }()
}
fix
for i := 0; i <= 10; i++ {
   go func(i int) {
      fmt.Println(i)
   }(i)
}
说明:匿名函数导致的竞态

另外也有 WaitGroup 的误用导致的bug:

bug 5
func (p *peer) send(){
   p.mu.Lock()    
   defer p.mu.Unlock()
   switch p.status {
   case idle:
      go func(){
         p.wg.Add(1)    
         // ..
         p.wg.Done()
      }()
   case stopped:  
   }
}

func (p *peer) stop(){
   p.mu.Lock()
   p.status = stopped
   p.mu.Unlock()
   p.wg.Wait()
}
fix
func (p *peer) send(){
   p.mu.Lock()    
   defer p.mu.Unlock()
   switch p.status {
   case idle:
      p.wg.Add(1)
      go func(){
         // ..
         p.wg.Done()
      }()
   case stopped:  
   }
}

func (p *peer) stop(){
   p.mu.Lock()
   p.status = stopped
   p.mu.Unlock()
   p.wg.Wait()
}
说明:无法保证goroutine 比 p.wg.Wait() 更早执行

channel重复关闭导致的Bug:

bug 6
select {
case <- c.closed:
default:
   close(c.closed)
}
fix
Once.Do(func(){
    close(c.closed)
})
说明:当多个goroutine同时执行该代码时,有可能使channel被关闭多次,导致panic

select 随机性导致的Bug:

bug 7
ticker := time.NewTicker()
for {
   f()
   select {
   case <- stopChn:
      return
   case <- ticker:
   }
}
fix
ticker := time.NewTicker()
for {
   select {
   case <- stopChn:
      return
   default:   
   }
   f()
   select {
   case <-stopChn:
      return
   case <-ticker:
   }
}
说明:由于select的随机性,可能导致 stopChn 后 f() 被执行多一次

Timer误用导致的Bug:

bug 8
timer := time.NewTimer(0)
if dur > 0 {
   timer := time.NewTimer(dur)
}
select {
case <-timer.C:
case <-ctx.Done():
   return nil
}
fix
var timeout <-chan time.Time
if dur > 0 {
   timeout = time.NewTimer(dur).C
}
select {
case <-timeout:
case <-ctx.Done():
   return nil
}
说明:当dur=0时,则select总是直接返回,ctx.Done() 不起作用


Detection of Non-Blocking Bugs

Go 提供了数据竞态检测器,可以在 build 的时候加入 -race 标志。


原始论文


Introduction

A major design goal of Go is to improve traditional multithreaded programming languages and make concurrent programming easier and less error-prone.

  • making threads (called goroutines) lightweight and easy to create
  • using explicit messaging (called channel) to communicate across threads

In this paper, we conduct the first empirical study on Go concurrency bugs using six open-source, productiongrade Go applications: Docker [13] and Kubernetes [36], two datacenter container systems, etcd [15], a distributed key-value store system, gRPC [19], an RPC library, and CockroachDB [10] and BoltDB [6], two database systems.

Our study focuses on a long-standing and fundamental question in concurrent programming: between message passing [27, 37] and shared memory, which of these inter-thread communication mechanisms is less error-prone. Go encourages the use of channels over shared memory with the belief that explicit message passing is less error-prone.

Along the cause dimension, we categorize bugs into those that are caused by misuse of shared memory and those caused by misuse of message passing. Along the second dimension, we separate bugs into those that involve (any number of) goroutines that cannot proceed (we call them blocking bugs) and those that do not involve any blocking (non-blocking bugs).

Background and Applications

All local variables declared before an anonymous function are accessible to the anonymous function, and are potentially shared between a parent goroutine and a child goroutine created using the anonymous function, causing data race.

Misusing WaitGroup can cause both blocking bugs (Section 5) and non-blocking bugs (Section 6).

channel can only be used after initialization, and sending data to (or receiving data from) a nil channel will block a goroutine forever. Sending data to a closed channel or close an already closed channel can trigger a runtime panic.

The select statement allows a goroutine to wait on multiple channel operations. A select will block until one of its cases can make progress or when it can execute a default branch. When more than one cases in a select are valid, Go will randomly choose one to execute. This randomness can cause concurrency bugs as will be discussed in Section 6.

Go Concurrency Usage Patterns

Goroutine Usages

Thus, we ask “do real Go programmers tend to write their code with many goroutines (static)?” and “do real Go applications create a lot of goroutines during runtime (dynamic)?”

Observation 1: Goroutines are shorter but created more frequently than C (both statically and at runtime).

Concurrency Primitive Usages

Shared memory synchronization operations are used more often than message passing, and Mutex is the most widely-used primitive across all applications.

Observation 2: Although traditional shared memory thread communication and synchronization remains to be heavily used, Go programmers also use significant amount of messagepassing primitives. Implication 1: With heavier usages of goroutines and new types of concurrency primitives, Go programs may potentially introduce more concurrency bugs.

Implication 1: With heavier usages of goroutines and new types of concurrency primitives, Go programs may potentially introduce more concurrency bugs.

Bug Study Methodology

Collecting concurrency bugs

To collect concurrency bugs, we first filtered GitHub commit histories of the six applications by searching their commit logs for concurrencyrelated keywords, including “race”, “deadlock”, “synchronization”, “concurrency”, “lock”, “mutex”, “atomic”, “compete”, “context”, “once”, and “goroutine leak”.

We then randomly sampled the filtered commits, identified commits that fix concurrency bugs, and manually studied them.

Bug Taxonomy

The first dimension is based on the behavior of bugs. If one or more goroutines are unintentionally stuck in their execution and cannot move forward, we call such concurrency issues blocking bugs. If instead all goroutines can finish their tasks but their behaviors are not desired, we call them non-blocking ones.

The second dimension is along the cause of concurrency bugs. Concurrency bugs happen when multiple threads try to communicate and errors happen during such communication. Our idea is thus to categorize causes of concurrency bugs by how different goroutines communicate: by accessing shared memory or by passing messages.

Blocking Bugs

Root Causes of Blocking Bugs

Observation 3: Contrary to the common belief that message passing is less error-prone, more blocking bugs in our studied Go applications are caused by wrong message passing than by wrong shared memory protection.

Mutex:28 blocking bugs are caused by misusing locks (Mutex), including double locking, acquiring locks in conflicting orders, and forgetting to unlock.

RWMutex:Go’s write lock requests have a higher privilege than read lock requests. This unique lock implementation can lead to a blocking bug when a goroutine (th-A) acquires one RWMutex twice with read locking, and these two read lock operations are interleaved by a write lock operation from another goroutine (th-B). When th-A’s first read lock operation succeeds, it will block th-B’s write lock operation, since write locking is exclusive. However, th-B’s write lock operation will also block th-A’s second read lock operation, since the write lock request has a higher privilege in Go’s implementation. Neither th-A nor th-B will be able to proceed.

Observation 5: All blocking bugs caused by message passing are related to Go’s new message passing semantics like channel. They can be difficult to detect especially when message passing operations are used together with other synchronization mechanisms.

Detection of Blocking Bugs

There are two reasons why the built-in detector failed to detect other blocking bugs. First, it does not consider the monitored system as blocking when there are still some running goroutines. Second, it only examines whether or not goroutines are blocked at Go concurrency primitives but does not consider goroutines that wait for other systems resources.

Non-Blocking Bugs

Root Causes of Blocking Bugs

Interestingly, we found seven non-blocking bugs whose root causes are traditional but are largely caused by the lack of a clear understanding in new Go features. 

Observation 7: About two-thirds of shared-memory nonblocking bugs are caused by traditional causes. Go’s new multithread semantics and new libraries contribute to the rest onethird.

Detection of Non-Blocking Bugs

Go provides a data race detector which uses the same happenbefore algorithm as ThreadSanitizer [53]. It can be enabled by building a program using the ‘-race’ flag. During program execution, the race detector creates up to four shadow words for every memory object to store historical accesses of the object. It compares every new access with the stored shadow word values to detect possible races.

There are three possible reasons why the data race detector failed to report many non-blocking bugs. First, not all non-blocking bugs are data races; the race detector was not designed to detect these other types. Second, the effectiveness of the underlying happen-before algorithm depends on the interleaving of concurrent goroutines. Finally, with only four shadow words for each memory object, the detector cannot keep a long history and may miss data races.


附件:Understanding Real-World Concurrency Bugs in Go.pdf

  • No labels
Write a comment...