AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |
Back to Blog
Backblaze blog hard drive8/19/2023 Upon load, the database program would rewrite all that data into its internal data format, effectively making a copy of the large input dataset. Unfortunately, to actually run this query, I would typically need to first load the data into a database server such as Postgres. Which I find to be a more concise description of what's going on. Select manufacturer, model, serial_number, max ( date ) as retired_date, min ( date ) as launched_date, count ( date ) as observed_days, max ( failure ) as failure from source_table group by manufacturer, model, serial_number Using SQL, I could express our desired aggregate as This is a clever approach that works pretty well, but specifying it in this way (the "imperative" style of programming) isn't as expressive as it could be using a declarative language like SQL. His Python program loads one day at a time and keeps track of the first and last day it observes each serial number. Ross Lazarus solved this by exploiting the fact that the dataset is ordered and partitioned by day. A person who wears suits to work might refer to this as a "big data" problem. In theory, this is a straightforward aggregate to run, but because the raw dataset Backblaze provides is many gigabytes in size, it can't be run in Pandas because it can't be fit into memory. For our purposes, we need only one row per drive with two pieces of information: how long it operated for and whether it failed or not. Kaplan-Meier regression enables us to use this partial information to build the survival curve instead of throwing it away.īackblaze provides a dataset that has one row per drive, per day that it operated, with a full snapshot of the drive's self-reported SMART stats on that day. This means we don't know exactly how long it would have taken for it to fail, but we know that it worked without failing for at least one year. For example, if a hard drive is retired from the data center after only one year to make space for one with a larger capacity, then this drive didn't fail – instead, it is "right-censored" in the data. This is important because not every hard drive will be in the datacenter long enough to fail. This estimator builds survival curves from data in a way that handles missing data in a useful way. To construct the survival curve, we will use a Kaplan-Meier estimator. I also refine his technique somewhat by using Apache Spark to improve performance and expressiveness of the aggregation. In this post I'll repeat Ross Lazarus' analysis using data that has been updated through Q3 2019. Reading this series was the first I had heard of a survival curve, and it seems like such a great visualization for understanding this data that I'm surprised Backblaze doesn't report it themselves. Ross Lazarus, an Australian computational biologist, used this dataset to build survival curves for hard drives in a series of blog posts in 2016. This is a commonly used technique in studies of medical patient survival rates after receiving some treatment, but equipment failure is another good application.īackblaze commendably makes fully granular source data available in addition to the summary statistics in their blog posts. This is a plot of the fraction of a population that hasn't had some terminal event happen to it yet ("death" or "failure") as a function of time elapsed since some starting point. ![]() For example, a failure rate may hold steady at a reasonable value while the drive is under warranty, only to fail at a higher rate after it hits a certain age.Ĭapturing patterns like this is the domain of survival analysis, and in particular a visualization called the " survival curve". However, not all of Backblaze's drives are the same age, an individual drive's chance of failure may vary over its lifetime in a way that is not captured by a simple summary statistic. The key metric presented in Backblaze's blog posts, the annualized drive failure rate, is a reasonably good starting point for understanding which hard drives are more reliable than others. Because they use large numbers of consumer hard drives, which are the same ones I would consider buying to use in my desktop computer, I like to consult their blog whenever I am shopping for a new one. Survival Analysis: Backblaze Hard Drives ¶īackblaze, a cloud backup service, provides one of the best public services on the internet by periodically posting hard drive failure rates for the drives in their datacenter.
0 Comments
Read More
Leave a Reply. |