A performance story: How a faulty QSFP crippled a Ceph cluster


At work, we encountered a strange case of performance degradation on one of our production Ceph clusters which provides storage to our public cloud ~okeanos.

We wrote a post at our tech blog, explaining the problem, our debugging journey, how we solved it and what we’ve done for the future. Long story short, a QSFP in the underlying network infrastructure, introduced huge packet loss and caused performance issues to our cloud. Of course, it was not Ceph’s fault :).

The post can be found here.