It's "a free and open-source, software-defined storage platform," according to Wikipedia, providing object storage, block storage, and file storage "built on a common distributed cluster foundation". The charter advisory board for Ceph included people from Canonical, CERN, Cisco, Fujitsu, Intel, Red Hat, SanDisk, and SUSE.
And Nite_Hawk (Slashdot reader #1,304) is one of its core engineers — a former Red Hat principal software engineer named Mark Nelson. (He's now leading R&D for a small cloud systems company called Clyso that provides Ceph consulting.) And he's returned to Slashdot to share a blog post describing "a journey to 1 TiB/s". This gnarly tale-from-Production starts while assisting Clyso with "a fairly hip and cutting edge company that wanted to transition their HDD-backed Ceph cluster to a 10 petabyte NVMe deployment" using object-based storage devices [or OSDs]...)
I can't believe they figured it out first. That was the thought going through my head back in mid-December after several weeks of 12-hour days debugging why this cluster was slow... Half-forgotten superstitions from the 90s about appeasing SCSI gods flitted through my consciousness...
Ultimately they decided to go with a Dell architecture we designed, which quoted at roughly 13% cheaper than the original configuration despite having several key advantages. The new configuration has less memory per OSD (still comfortably 12GiB each), but faster memory throughput. It also provides more aggregate CPU resources, significantly more aggregate network throughput, a simpler single-socket configuration, and utilizes the newest generation of AMD processors and DDR5 RAM. By employing smaller nodes, we halved the impact of a node failure on cluster recovery....
The initial single-OSD test looked fantastic for large reads and writes and showed nearly the same throughput we saw when running FIO tests directly against the drives. As soon as we ran the 8-OSD test, however, we observed a performance drop. Subsequent single-OSD tests continued to perform poorly until several hours later when they recovered. So long as a multi-OSD test was not introduced, performance remained high. Confusingly, we were unable to invoke the same behavior when running FIO tests directly against the drives. Just as confusing, we saw that during the 8 OSD test, a single OSD would use significantly more CPU than the others. A wallclock profile of the OSD under load showed significant time spent in io_submit, which is what we typically see when the kernel starts blocking because a drive's queue becomes full...
For over a week, we looked at everything from bios settings, NVMe multipath, low-level NVMe debugging, changing kernel/Ubuntu versions, and checking every single kernel, OS, and Ceph setting we could think of. None these things fully resolved the issue. We even performed blktrace and iowatcher analysis during "good" and "bad" single OSD tests, and could directly observe the slow IO completion behavior. At this point, we started getting the hardware vendors involved. Ultimately it turned out to be unnecessary. There was one minor, and two major fixes that got things back on track.
It's a long blog post, but here's where it ends up:
Fix One: "Ceph is incredibly sensitive to latency introduced by CPU c-state transitions. A quick check of the bios on these nodes showed that they weren't running in maximum performance mode which disables c-states."
Fix Two: [A very clever engineer working for the customer] "ran a perf profile during a bad run and made a very astute discovery: A huge amount of time is spent in the kernel contending on a spin lock while updating the IOMMU mappings. He disabled IOMMU in the kernel and immediately saw a huge increase in performance during the 8-node tests." In a comment below, Nelson adds that "We've never seen the IOMMU issue before with Ceph... I'm hoping we can work with the vendors to understand better what's going on and get it fixed without having to completely disable IOMMU."
Fix Three: "We were not, in fact, building RocksDB with the correct compile flags... It turns out that Canonical fixed this for their own builds as did Gentoo after seeing the note I wrote in do_cmake.sh over 6 years ago... With the issue understood, we built custom 17.2.7 packages with a fix in place. Compaction time dropped by around 3X and 4K random write performance doubled."
The story has a happy ending, with performance testing eventually showing data being read at 635 GiB/s — and a colleague daring them to attempt 1 TiB/s. They built a new testing configuration targeting 63 nodes — achieving 950GiB/s — then tried some more performance optimizations...
Ceph (software)
Ceph (pronounced /ˈsɛf/) is a free and open-source software-defined storage platform that provides object storage,[7] block storage, and file storage built on a common distributed cluster foundation. Ceph provides completely distributed operation without a single point of failure and scalability to the exabyte level, and is freely available. Since version 12 (Luminous), Ceph does not rely on any other conventional filesystem and directly manages HDDs and SSDs with its own storage backend BlueStore and can expose a POSIX filesystem.
Ceph replicates data with fault tolerance,[8] using commodity hardware and Ethernet IP and requiring no specific hardware support. Ceph is highly available and ensures strong data durability through techniques including replication, erasure coding, snapshots and clones. By design, the system is both self-healing and self-managing, minimizing administration time and other costs.
Large-scale production Ceph deployments include CERN,[9][10] OVH[11][12][13][14] and DigitalOcean.[15][16]
Design
see Wikipedia article
History
Ceph was created by Sage Weil for his doctoral dissertation,[33] which was advised by Professor Scott A. Brandt at the Jack Baskin School of Engineering, University of California, Santa Cruz (UCSC), and sponsored by the Advanced Simulation and Computing Program (ASC), including Los Alamos National Laboratory (LANL), Sandia National Laboratories (SNL), and Lawrence Livermore National Laboratory (LLNL).[34] The first line of code that ended up being part of Ceph was written by Sage Weil in 2004 while at a summer internship at LLNL, working on scalable filesystem metadata management (known today as Ceph's MDS).[35] In 2005, as part of a summer project initiated by Scott A. Brandt and led by Carlos Maltzahn, Sage Weil created a fully functional file system prototype which adopted the name Ceph. Ceph made its debut with Sage Weil giving two presentations in November 2006, one at USENIX OSDI 2006[36] and another at SC'06.[37]
After his graduation in autumn 2007, Weil continued to work on Ceph full-time, and the core development team expanded to include Yehuda Sadeh Weinraub and Gregory Farnum. On March 19, 2010, Linus Torvalds merged the Ceph client into Linux kernel version 2.6.34[38][39] which was released on May 16, 2010. In 2012, Weil created Inktank Storage for professional services and support for Ceph.[40][41]
In April 2014, Red Hat purchased Inktank, bringing the majority of Ceph development in-house to make it a production version for enterprises with support (hotline) and continuous maintenance (new versions).[42]
In October 2015, the Ceph Community Advisory Board was formed to assist the community in driving the direction of open source software-defined storage technology. The charter advisory board includes Ceph community members from global IT organizations that are committed to the Ceph project, including individuals from Red Hat, Intel, Canonical, CERN, Cisco, Fujitsu, SanDisk, and SUSE.[43]
In November 2018, the Linux Foundation launched the Ceph Foundation as a successor to the Ceph Community Advisory Board. Founding members of the Ceph Foundation included Amihan, Canonical, China Mobile, DigitalOcean, Intel, OVH, ProphetStor Data Services, Red Hat, SoftIron, SUSE, Western Digital, XSKY Data Technology, and ZTE.[44]
Ceph: A Journey to 1 TiB/s
Jan 19, 2024
Mark Nelson (nhm)
No comments:
Post a Comment