My company recognized
early, near the inception of the product, that if we were able to collect
enough operational data about how our products are performing in the field, get
it back home and analyze it, we'd be able to dramatically reduce support costs.
Also, we can create a feedback
loop that
allows engineering to improve the product very quickly, according to the
demands that are being placed on the product in the field.
Looking at it from that
perspective, to get it right, you need to do it from the inception of the
product. If you take a look at how much data we get back for every
array we sell in the field, we could be receiving anywhere from 10,000 to 100,000 data
points per minute from each array. Then, we bring those back home, we put them
into a database, and we run a lot of intensive analytics on those data.
Once you're doing that, you
realize that as soon as you do something, you have this data you're starting to
leverage. You're making support recommendations and so on, but then you realize
you could do a lot more with it. We can do dynamic
cache sizing. We can figure out how much cache a customer needs based on
an analysis of their real workloads.
We found that big
data is really paying off for us. We want to continue to increase how much it's
paying off for us, but to do that we need to be able to do bigger queries
faster. We have a team of data scientists and we don't want them sitting here
twiddling their thumbs. That’s what brought us to Vertica.
We have a very tight
feedback loop. In one release we put out, we may make some changes in the way
certain things happen on the back end, for example, the way NVRAM is
drained. There are some very particular details around that, and we can observe
very quickly how that performs under different workloads. We can make tweaks
and do a lot of tuning.
Without the kind of data we
have, we might have to have multiple cases being opened on performance in the
field and escalations, looking at cores, and then simulating things in the lab.
It's a very
labor-intensive, slow process with very little data to base the decision on.
When you bring home operational data from all your products in the field,
you're now talking about being able to figure out in near real-time the
distribution of workloads in the field and how people access their storage. I
think we have a better
understanding of the way storage works in the real world than any
other storage vendor, simply because we have the data.
I
don’t remember the exact year, but it may have been eight years ago roughly
that I became aware of Vertica. At some point, there was an announcement that Mike Stonebraker was involved in a
group that was going to productize the C-Store
Database, which was sort of an academic experiment at UC Berkeley,
to understand the benefits and capabilities of real column
store.
I was immediately
interested and contacted them. I was working at another storage company at the
time. I had a 20 terabyte
(TB) data
warehouse,
which at the time was one of the largest Oracle on Linux data
warehouses in the world.
They didn't want to
touch that opportunity just yet, because they were just starting out in alpha
mode. I hooked up with them again a few years later, when I was CTO at a
different company, where we developed what's substantially an extract,
transform, and load (ETL) platform.
By then, they were
well along the road. They had a great product and it was solid. So we tried it
out, and I have to tell you, I fell in love with Vertica because of the
performance benefits that it provided.
When you start thinking
about collecting as many different data points as we like to collect, you have
to recognize that you’re going to end up with a couple choices on a row store.
Either you're going to have very narrow tables and a lot of them or else you're
going to be wasting a lot of I/O overhead, retrieving entire rows where you
just need a couple fields.
That was what piqued my
interest at first. But as I began to use it more and more, I realized that the
performance benefits you could gain by using Vertica properly were another
order of magnitude beyond what you would expect just with the column-store
efficiency.
That's because of
certain features that Vertica allows, such as something called pre-join
projections. At a high-level, it lets you maintain the normalized logical
integrity of your schema, while having under the hood, an optimized denormalized query
performance physically on disk.
Can you be
efficient if you have a denormalized structure on disk because Vertica allows
you to do some very efficient types of encoding on your data. So all of the low
cardinality columns that would have been wasting space in a row store end up
taking almost no space at all.
It's been my
impression, that Vertica is the data warehouse that you would have wanted to
have built 10 or 20 years ago, but nobody had done it yet.
Nowadays, when I'm
evaluating other big data platforms, I always have to look at it from the
perspective of it's great, we can get some parallelism here, and there are
certain operations that we can do that might be difficult on other platforms,
but I always have to compare it to Vertica. Frankly, I always find that Vertica
comes out on top in terms of features, performance, and usability.
I built the environment at
my current company from the ground up. When I got here, there were roughly 30
people. It's a very small company. We started with Postgres. We started with
something free. We didn’t want to have a large budget dedicated to the backing
infrastructure just yet. We weren’t ready to monetize it yet.
So, we started on
Postgres and we've scaled up now to the point where we have about 100 TBs on
Postgres. We get decent performance out of the database for the things that we
absolutely need to do, which are micro-batch updates and transactional
activity. We get that performance because the database lives here.
I don't know what
the largest unsharded Postgres instance is in the world, but I feel
like I have one of them. It's a challenge to manage and leverage. Now, we've
gotten to the point where we're really enjoying doing larger queries. We really
want to understand the entire installed base of how we want to do analyses that
extend across the entire base.
We want to understand the
lifecycle of a volume. We want to understand how it grows, how it lives, what
its performance characteristics are, and then how gradually it falls into
senescence when people stop using it. It turns out there is a lot of really
rich information that we now have access to to understand storage lifecycles in
a way I don't think was possible before.
But to do that, we
need to take our infrastructure to the next level. So we've been doing that and
we've loaded a large number of our sensor data that’s the numerical data I have
talked about into Vertica, started to compare the queries, and then started to
use Vertica more and more for all the analysis we're doing.
Internally, we're
using Vertica, just because of the performance benefits. I can give you an
example. We had a particular query, a particularly large query. It was to look
at certain aspects of latency over a month across the entire installed base to
understand a little bit about the distribution, depending on different factors,
and so on.
We ran that query in
Postgres, and depending on how busy the server was, it took anywhere from
12 to 24 hours to run. On Vertica, to run the same query on the same data takes
anywhere from three to seven seconds.
I anticipated that
because we were aware upfront of the benefits we'd be getting. I've seen it
before. We knew how to structure our projections to get that kind of
performance. We knew what kind of infrastructure we'd need under it. I'm really
excited. We're getting exactly what we wanted and better.
This is only a
three node cluster. Look at the performance we're getting. On the smaller
queries, we're getting sub-second latencies. On the big ones, we're getting
sub-10 second latencies. It's absolutely amazing. It's game changing.
People can sit at
their desktops now, manipulate data, come up with new ideas and iterate without
having to run a batch and go home. It's adramatic productivity increase. Data
scientists tend to be fairly impatient. They're highly paid people, and you
don’t want them sitting at their desk waiting to get an answer out of the
database. It's not the best use of their time.
When it comes to the cloud
model for deployment, there's the ease of adding nodes without downtime, the
fact that you can create a K-safe
cluster. If my cluster is 16 nodes wide now, and I want two nodes
redundancy, it's very similar to RAID. You can specify that,
and the database will take care of that for you. You don’t have to worry about
the database going down and losing data as a result of the node failure every
time or two.
I love the fact that you don’t have to pay
extra for that. If I want to put more cores or nodes on it or I want to
put more redundancy into my design, I can do that without paying more for it.
Wow! That’s kind of revolutionary in itself.
It's great to see a database company incented
to give you great performance. They're incented to help you work better with
more nodes and more cores. They don't have to worry about people not being able
to pay the additional license fees to deploy more resources. In that sense,
it's great.
We have our own private cloud -- that’s how I
like to think of it -- at an offsite colocation facility. We do DR here. At the same time, we have a K-safe cluster. We had a hardware
glitch on one of the nodes last week, and the other two nodes stayed up, served
data, and everything was fine.
Those kinds of features are critical, and that
ability to be flexible and expand is critical for someone who is trying to
build a large cloud infrastructure, because you're never going to know in
advance exactly how much you're going to need.
If you do your job right as a cloud provider,
people just want more and more and more. You want to get them hooked and you
want to get them enjoying the experience. Vertica lets you do that.
Disclosure: PeerSpot has made contact with the reviewer to validate that the person is a real user. The information in the posting is based upon a vendor-supplied case study, but the reviewer has confirmed the content's accuracy.