http://opentechnow.blogspot.com/2010/02/linux-ssd-optimization-guide.html I've got a small home server with a software RAID-5 for storing my files. It also runs a few virtual machines and acts as a NAT router for internet access. Nothing expensive, just some Frankensteinian patchwork built from old hardware left over when I upgraded my workstation. Nevertheless, I granted it a brand new Intel X25-M SSD last week. Did I mention that this server is running Gentoo Linux? I thought this would be a good time to do a fresh install and get everything right that might have gone wrong the first time. Besides, installing Linux always is an interesting (and masochistic) experience, especially when your chosen distribution has no installer :) Because getting my partitions and file systems aligned also proved to be difficult task, I thought why not make a small article out of this! Erase Block Size SSDs always operate on entire blocks of memory. This is so because, before writing to a memory cell, flash memory needs to be erased, which requires the application of a large voltage to the memory cells, which can only happen to an entire memory cell block at once (probably because this kind of power would affect other cells around the one being erased, at least that's my guess.) Anyway, this means that if you write 1 KB of data to an SSD with an erase block size of 128 KB, the SSD needs to read 127 KB from the target block, erase the block and write the old data plus the new data back into the block. That's something one just has to accept when using an SSD. Modern SSD firmware will do its best to pre-erase blocks when it's idle and try to write new data into these pre-erased blocks (by mapping data to other locations on the drive without the knowledge of the OS.) Still, watch what happens if a file system just sees the SSD as a brick of memory and writes data at a random position: The SSD now has to erase and write two blocks, even though one would have sufficed for the amount of data being written. To fix this, the drive's firmware would have to do data mapping on the byte level, which likely isn't going to happen (in the worst case, you would need more memory for the remapping table than the drive's capacity!) If the file system's write was aligned to a multiple of the SSD's erase block size, the result would be this: Thus, it's generally a good idea to make sure your file system's writes are aligned to multiples of your SSD's erase block size. As I found out, this isn't quite as easy as it sounds. The first road block is already encountered when you partition a hard drive: Partition Alignment If the partitions of a hard drive aren't aligned to begin at multiples of 128 KiB, 256 KiB or 512 KiB (depending on the SSD used), aligning the file system is useless because everything is skewed by the start offset of the partition. Thus, the first thing you have to take care of is aligning the partitions you create. A cylinder. A sector. Traditionally, hard drives were addressed by indicating the cylinder, head and sector at which data was to be read or written. These represented the radial position, the drive head (= platter and side) and the axial position of the data respectively. With LBA (logical block addressing), this is no longer the case. Instead, the entire hard drive is addressed as one continuous stream of data. Linux' fdisk, however, still uses a virtual C-H-S system where you can define any number of heads and sectors yourself (the cylinders are calculated automatically from the drive's capacity), with partitions always starting and ending at intervals of heads x cylinders. Thus, you need to choose a number of heads and sectors of which the SSD's erase block size is a multiple. I found two posts which detail this process: Aligning Filesystems to an SSD's Erase Block Size and Partition alignment for OCZ Vertex in Linux. The first one recommends 224 heads and 56 sectors, but I can't quite understand where those numbers come from, so I used the advice from the post on the OCZ forums with 32 heads and 32 sectors which means fdisk uses a cylinder size of 1024 bytes. And because fdisk partitions in units of 512 cylinders (= 512 x heads x sectors) fdisk's unit size now happens to be an SSD's maximum erase block size. Nice! To make fdisk use 32 heads and 32 sectors, remove all partitions from a hard drive and then launch fdisk with the following command line when you create the first partition: fdisk -S 32 -H 32 /dev/sda The OCZ post also recommends starting at the second 512-cylinder unit because the first partition is otherwise shifted by one track. Don't ask me why :) Here's how I partitioned my SSD in the end: For a normal hard drive, I'd probably use 128 heads and 32 tracks now to achieve 4 KiB boundaries for my partitions. RAID Chunk Size If you plan on running a software RAID array, I've seen chunk sizes of 64 KiB and 128 KiB being recommended. This can be specified using the --chunk parameter for mdadm, eg. mdadm --create /dev/md3 --level=1 --chunk=128 --raid-devices=2 /dev/sda3 /dev/sdb3 Probably the larger chunk size is more useful if you are storing large files on the RAID partition, but I haven't found any advice which included benchmarks or at least a solid explanation yet. File System Alignment Now that the partitions have been taken care of, the file systems need to use proper alignment as well. Generally all file systems use some kind of allocation blocks, usually with a size of 4 KiB. But increasing this size to 128 KiB (or even 512 KiB) would waste a lot of space since any file would use up memory in a multiple of that number. Luckily, Linux file systems can be tweaked a lot. I'm using ext4, here the -E stride,stripe-width parameters control the alignment. The HowTos/Disk Optimization page in the CentOS wiki gives this advice: The drive calculation works like this: You divide the chunk size by the block size for one spindle/drive only. This gives you your stride size. Then you take the stride size, and multiply it by the number of data-bearing disks in the RAID array. This gives you the stripe width to use when formatting the volume. This can be a little complex, so some examples are listed below. For example if you have 4 drives in RAID5 and it is using 64K chunks and given a 4K file system block size. The stride size is calculated for the one disk by (chunk size / block size), (64K/4K) which gives 16K. While the stripe width for RAID5 is 1 disk less, so we have 3 data-bearing disks out of the 4 in this RAID5 group, which gives us (number of data-bearing drives * stride size), (3*16K) gives you a stripe width of 48K. The Linux Kernel RAID wiki offers further insight: Calculation chunk size = 128kB (set by mdadm cmd, see chunk size advise above) block size = 4kB (recommended for large files, and most of time) stride = chunk / block = 128kB / 4k = 32kB stripe-width = stride * ( (n disks in raid5) - 1 ) = 32kB * ( (3) - 1 ) = 32kB * 2 = 64kB If the chunk-size is 128 kB, it means, that 128 kB of consecutive data will reside on one disk. If we want to build an ext2 filesystem with 4 kB block-size, we realize that there will be 32 filesystem blocks in one array chunk. stripe-width=64 is calculated by multiplying the stride=32 value with the number of data disks in the array. A raid5 with n disks has n-1 data disks, one being reserved for parity. (Note: the mke2fs man page incorrectly states n+1; this is a known bug in the man-page docs that is now fixed.) A raid10 (1+0) with n disks is actually a raid 0 of n/2 raid1 subarrays with 2 disks each. So these are the stride and stripe-width parameters I'd use: Intel SSDs with an erase block size of 128 (or 512 KiB -- Intel isn't quite straightforward with this, see the comments section for a discussion on the subject - if anyone from Intel is reading this, help us out! ;-)) that are not part of a software RAID: -E stride=32,stripe-width=32 OCZ Vertex SSDs with an erase block size of 512 KiB that are not part of a software RAID: -E stride=128,stripe-width=128 Normal hard drives that are not part of a software RAID trust the defaults Any software RAID: -E stride=raid chunk size / file system block size,stripe-width=raid chunk size x number of data bearing disks Thus, I set up the file systems on the Intel SSD like this: mkfs.ext4 -b 1024 -E stride=128,stripe-width=128 -O ^has_journal /dev/sda1 mkfs.ext4 -b 4096 -E stride=32,stripe-width=32 /dev/sda3 mkfs.ext4 defaulted to 1024 byte allocation units on my boot partition, so I adjusted the stride up to 128 KiB according to the advice from the CentOS wiki. The alignment of my boot partition is probably not of any relevance because the system will read maybe 10 files from it and not modify anything, but I wanted to stay consistent :) ============================================== http://www.zdnet.com/blog/perlow/geek-sheet-a-tweakers-guide-to-solid-state-drives-ssds-and-linux/9190 Geek Sheet: A Tweaker's Guide to Solid State Drives (SSDs) and Linux By Jason Perlow | November 18, 2010, 8:45am PST Summary: Got Linux? Got SSD? Here’s your tweaking guide. ditor’s note: this post was originally published in July of 2008. It’s been updated with current information. Is 20th century conventional Winchester multi-platter, multi-head random-access disk technology too quaint for you? Want to run your PC or server on storage devices that consume far less energy than the traditional alternatives? Want a portable or mobile storage unit that will never fail due to G-forces or “crashing?“. Looking for a highly reliable and fast random access storage medium to use for your most important data? Do you have $250.00-$500.00 in spare change lying around? Then Solid State Drives (SSDs) are for you. Right now, the cost of using 2.5″ SATA SSDs as exclusive primary storage devices is rather high. At around $240.00-$250 for one of the higher-performing 120GB units when compared with commodity pricing on traditional mechanical disk storage, SSDs are out of reach for most consumers as bulk storage at $2 per Gigabyte — 1TB hard disks can now be found for as little as $60-$75 each, with an approximate price of 6 cents per Gigabyte. For the most part, they’ve been relegated to executive-class ultra power-miserly and ultra-thin notebooks like the Macbook Air and other high-end business class notebooks running on Windows. But that doesn’t mean these babies can’t perform well on Linux and other Unix operating systems like FreeBSD, and can be an easy way to boost performance of your system provided that you don’t use them as your primary storage device. At their $2 per GB price point, unless you are really rolling in cash, you probably don’t want to be stuffing these with anything less than mission-critical data. SSDs perform exceedingly well for things like MySQL databases, provided you tweak your kernel, BIOS, and filesystems accordingly. They’re also excellent for boot drives, root filesystems, and storing small VM files. For example, I run my Ubuntu Linux 10.10 boot drive and a 20GB Windows XP VM on mine, a 120GB OCZ Vertex 2. When this piece was originally posted in 2008, Veteran Linux hacker Geoff “Mandrake” Harrison was happy to provide me with some insight from an O’Reilly MySQL conference in which he and his colleagues presented on how to make your MySQL machines fly using SSDs. Mandrake’s employer at the time had spent an obscene amount of money on SSDs, and he was happy to brag about how much power they are saving and the performance they are getting out of them relative to good old SATA2 disk units. Also See: MagicFab SSD Checklist (Ubuntu.com) Tweak #1: If your system motherboard uses a disk caching bus, change the BIOS setting from “Write Through” to “Write Back”. The standard practice on Linux with conventional drives is to set it as “Write Through” but the simpler architecture of an SSD results in poorer performance with this default setting. Once you’ve enabled it in the BIOS, you can set this on a drive by drive level by executing this command as the root user: [root@techbroiler ~]# hdparm -W1 /dev/sda Similarly, if you have conventional drives in your system in addition to SSDs, you can issue a: [root@techbroiler ~]# hdparm -W0 /dev/sda To disable write-back caching. To make these changes persist between reboots, add the commands to the /etc/rc.local file. Tweak #2: Use the “noop” simple I/O scheduler. By default, Linux uses an “elevator” so that platter reads and writes are done in an orderly and sequential matter. Since an SSD is not a conventional disk, the “elevator” scheduler actually gets in the way. By adding block/sda/queue/scheduler = noop to your /etc/sysfs.conf (requires the sysutils package) or elevator=noop to the kernel boot parameters in your /etc/default/grub file (assuming you are only using one SSD) you will greatly improve read and write performance on your SSD. For those of you using Linux in virtual machines on conventional drives such as JBOD and SAN-based arrays, this is a good practice as well, since most VMs are implemented in image files (such as .vmdk on VMWare and .vhd on Hyper-V) and there is no need to treat I/O to a virtual disk the same as a physical one. Tweak #3: Change the file system mount options on SSDs to “noatime” and mount your /tmp in RAM. On certain Linux distributions, such as Ubuntu, the default is “relatime”. This tells the kernel to write the Last Accessed Time attribute on files. Conversely, “noatime” tells the kernel not to write them, which considerably improves performance. Linus himself suggests using it in circumstances such as this, so therefore, I consider it to be gospel. Here’s what my /etc/fstab looks like. UUID=aaf49668-2624-4238-a486-baf341361be6 / ext4 noatime,discard,data=ordered,errors=remount-ro 0 1 tmpfs /tmp tmpfs nodev,nosuid,noexec,mode=1777 0 0 Note that I have set data=ordered as opposed to data=writeback, which you would use if you formatted the drive with ext2 instead of ext4 or another journaled FS. In my case, I’m using ext4, mounted fully journalized, so this is a good compromise. Additionally, because I have a large amount of RAM on my system (16GB) I’ve moved the temp file system into RAM as opposed to running it on disk. This also creates a significant performance boost and reduces unnecessary writes to the SSD. If you have at least 4GB of RAM, this is a good idea. Tweak #4: Ditch the journal and RAID your SSDs. File system journaling is done primarily for increased reliability, but it’s a drag on performance. Given that SSDs by their nature are going to be less prone to reliability quirks than a conventional drive, Mandrake suggests creating a RAID1 of two SSD units and formatting the file system to ext2, or formatting them to ext3 and mounting them as ext2 in the /etc/fstab. Dump your MySQL database on a RAID of SSDs, and you’ll be in performance hog heaven. [EDIT 11/19/2010: It has been suggested in the comments that formatting the filesystem as ext4 and mounting with the journal enabled and RAID1 may actually be faster than ext2 RAID1 and provide additional referential integrity, and mounting a ext4 RAID1 in unjournalized mode may also be faster than ext2 RAID1. Your mileage may vary.] [EDIT 07/27/2008: Some concerns were raised about what could happen if the power goes out and you lose referential integrity of the FS and are unable to replay it from the journal -- so you might want to use a traditional disk using a journaled FS to sync the database to for backups.] Gonna mortgage your house or send your children into slavery for sweet SSD love? Talk Back and let me know. ============================================== http://www.nuclex.org/blog/personal/80-aligning-an-ssd-on-linux SATURDAY, FEBRUARY 20, 2010 Linux SSD Optimization Guide Solid state drives went from being the new kid on the block to being a stock option in a very short period of time. There has been a lot of speculation concerning SSDs, most of it concentrated on the reliability aspect of them. Although now it is clear that journaling filesystems aren't such a big problem as initially thought to be, you may still want to optimize your computer's operating system and limit the number of writes that the drive receives. Also, some of the optimizations that will be presented in this article only have a performance impact that is related to how the kernel treats the drive and its cache options, rather than trying to divert writes from it. We will start by tuning the underlying filesystem, then we will subdue misbehaving applications, and after that we will alter the way that the kernel schedules disk writes and the caching method. Filesystem options If you are just installing the OS on your new laptop and you are concerned about the lifespan of your SSD, you might as well choose a non-journaling filesystem. You should be aware that journaling ones are more resilient to errors, but if you are using it on a laptop there are few chances of your system suddenly turning off due to an unexpected power failure. Also, you can jar and bump your drive as much as you want, the bits and bytes on the drive will stay right where they are. Format your partition(s) with the ext2 file system if you think you don't need journaling, otherwise go with its younger incarnations, like ext3 or 4. Whichever you choose, don't forget to enable the noatime option, that will save a few writes by not updating the access time on the files that you use. You can also apply these two modifications to an existing system, by altering the /etc/fstab file. You can only change from an ext3 filesystem back to ext2, and to do that you just change the corresponding type in the fstab file. Adding noatime option requires you to append it to the field immediately after the partition type. On my computer, the unmodified line looks like this UUID=57ee6d85-c209-47fc-9ebc-d626dfd99df5 / ext3 errors=remount-ro 0 1 After applying the changes, it's like this: UUID=57ee6d85-c209-47fc-9ebc-d626dfd99df5 / ext2 noatime,errors=remount-ro 0 1 Note that there are no spaces in the mount options field. Also, you can't go from an ext4 back to ext2, since there have been quite a lot of changes in between the two. If you are an advanced user and you would like to use ext4 without the journal, all you need to create one is a Linux distro with a recent e2fsprogs package and the following command: mke2fs -t ext4 -O ^has_journal /dev/YOUR_DEVICE Less Writes Having cut down on the amount of disk usage that the Linux operating system does just to keep your files organized and accessible, it's time to discipline the applications a little. To do that, we will go back to /etc/fstab, where we will append these lines: tmpfs /tmp tmpfs defaults 0 0 tmpfs /var/tmp tmpfs defaults 0 0 tmpfs /var/log tmpfs defaults 0 0 They basically move the /tmp folder and the part of /var which holds the system logs (not that important if you aren't running a server) into the ram memory and keeping them away from your SSD. However, those folders won't be pre-populated as they would have been if they were persistent, so we need to re-create the necessary folders in order to keep various applications happy. Here is a script that takes care of that: for dir in apparmor apt ConsoleKit cups dist-upgrade fsck gdm installer news ntpstats samba unattended-upgrades ; do if [ ! -e /var/log/$dir ] ; then mkdir /var/log/$dir fi done To make sure that the script will run every time the computer is booted, you need to add it in the /etc/rc.local file, down at the bottom, just above the exit 0 line. Note that the above script creates the folders required by a system running the GDM desktop manager from GNOME, if you are a KDE fan, add kdm to the for dir in [...] enumeration. The same goes for any errors exposed by dmesg. Another heavy disk drive used, believe it or not, is your web browser. It makes a cache with elements from the pages you visit, like images and scripts, that can be loaded straight from it on subsequent visits to the website. Although that may sound like a good idea, a fast Internet connection can offset the caching speed boost. So, the best way to see if caching is useful to you or not is to test your browser, both with this function on and off. I will only show you how to toggle the cache in Mozilla Firefox, but a Google search should allow you to quickly find how to do that for other browsers. It's a pretty straightforward job: type about:config in your browser's address bar, and in the page that opens set the browser.cache.disk.enable variable to false, by double-clicking on it. Job done! Kernel parameters tuning Since platter-based hard drives were the most widespread storage option, and they still are, it's only natural for the kernel to be optimized with them in mind. One of these optimizations is the disk write scheduler, or elevator. The default one used is deadline, but by switching to noop we will get better disk throughput. The way to achieve that depends on the bootloader you are using, but it is basically reduced to passing the elevator=noop parameter to the kernel at boot. For the original GRUB, adding a parameter to the configuration and making it stick is easy: you edit the /boot/grub/menu.lst file, heading straight for the ## ## Start Default Options ## section. Lower in the file is a line that looks like this: # kopt=vga=normal root=/dev/sda2 ro (note that it might differ significantly on your system, but the # kopt= part is what you are looking for). To those parameters we will add our elevator=noop magic. On my system, I ended up with: ## ## Start Default Options ## ## default kernel options ## default kernel options for automagic boot options ## If you want special options for specifiv kernels use kopt_x_y_z ## where x.y.z is kernel version. Minor versions can be omitted. ## e.g. kopt=root=/dev/hda1 ro # kopt=vga=normal root=/dev/sda2 ro elevator=noop After you performed that edit, you have to ask grub to apply the new parameters, and you do that by running update-grub as root. If you're on a newer system that employs GRUB's ext4 supporting sibling, GRUB2, you need to follow a different path of action. We'll be looking at the /etc/default/grub file, which is a great deal shorter and more clean than its predecessor's configuration. Somewhere in its upper half is the GRUB_CMDLINE_LINUX_DEFAULT= parameter. Add the elevator=noop bit to it, and remember to preserve the quotes. As usual, here's how my setup looks after the modifications: [...] GRUB_TIMEOUT="10" GRUB_DISTRIBUTOR=`lsb_release -i -s 2> /dev/null || echo Debian` GRUB_CMDLINE_LINUX_DEFAULT="quiet splash elevator=noop" GRUB_CMDLINE_LINUX="" [...] Make GRUB2 aware of your modifications by running update-grub2. One more optimization to apply: the caching method. Regular hard-drives work better with the write-through caching method, but that's not true for SSDs. Solid state drives seem to work better when another type of caching is used, namely write-back. But there is a catch: not all of them support it, but you can easily find out if it works for your drive or not. Run the following command as root: hdparm -W1 /dev/sda, replacing sda with the corresponding path for your setup (if you have more than one drive, run it for all of them, one by one, and note the results). If no error is returned, you can add the command to the /etc/rc.local file, above the exit 0 line, and it will be run each time your computer starts. That's it. Your SSD should now have a longer lifespan, and your Linux operating system will make the most of the advantages that this kind of drive has to offer.