WEBVTT

00:00.000 --> 00:07.000
Hi, can you hear me?

00:07.000 --> 00:08.000
Okay, cool.

00:08.000 --> 00:09.000
Hi, I'm Anna.

00:09.000 --> 00:12.000
I'm going to be your MC for the next couple of talks.

00:12.000 --> 00:18.000
Next up, we have James Bellchamber, and the performance impact of auto instrumentation.

00:18.000 --> 00:21.000
You can kick it up now.

00:21.000 --> 00:24.000
Right, hey, everybody.

00:24.000 --> 00:28.000
Yeah, I finished this about 10 minutes ago, so stick with me.

00:29.000 --> 00:31.000
I didn't last year.

00:31.000 --> 00:33.000
Last year, I did this thing where it was my first talk.

00:33.000 --> 00:34.000
So I did, hey, for STEM.

00:34.000 --> 00:36.000
And I'm going to do it again because it was really fun.

00:36.000 --> 00:38.000
So hello, for STEM.

00:38.000 --> 00:41.000
Woo!

00:41.000 --> 00:46.000
And that's the sort of energy that you've got after a whole day,

00:46.000 --> 00:49.000
and how many people were at Brightnight last night?

00:49.000 --> 00:50.000
It was a couple.

00:50.000 --> 00:51.000
Okay, so it wasn't just me.

00:51.000 --> 00:52.000
Awesome.

00:52.000 --> 00:54.000
And if I've got this much energy, and you guys weren't,

00:54.000 --> 00:56.000
mostly weren't at Brightnight, then let's keep that up.

00:56.000 --> 00:57.000
Right, so everyone.

00:57.000 --> 00:58.000
My name's James.

00:58.000 --> 01:02.000
I've spent the last year or so in the trenches of an observability

01:02.000 --> 01:03.000
transformation.

01:03.000 --> 01:06.000
So I thought I'd come and tell you about it a little bit.

01:06.000 --> 01:08.000
So I'm an ops guy.

01:08.000 --> 01:14.000
I've spent about 20 years now working up from fixing computers

01:14.000 --> 01:15.000
door to door.

01:15.000 --> 01:19.000
I was a Linux sys admin, then that became uncalled,

01:19.000 --> 01:21.000
so I became a DevOps engineer.

01:21.000 --> 01:24.000
Now I'm a platform engineer, I guess.

01:25.000 --> 01:27.000
So everything's coming from this sort of perspective.

01:27.000 --> 01:31.000
I'm now working with a great software developer who will be referenced in this talk a little bit

01:31.000 --> 01:34.000
as well, so we've got that as well on that side as well.

01:34.000 --> 01:40.000
You might know me from my talk last year, introducing observability to an airline.

01:40.000 --> 01:41.000
So I'm still there.

01:41.000 --> 01:43.000
Still plugging away.

01:45.000 --> 01:47.000
Look at this guy.

01:47.000 --> 01:50.000
It was so young and so happy.

01:51.000 --> 01:55.000
So the stack that we built to support our auto instrumentation needs

01:55.000 --> 01:57.000
has kind of evolved into a platform now.

01:57.000 --> 02:00.000
And we've some pretty cool things since there.

02:00.000 --> 02:04.000
So we have collectors in our base images that are ready to export

02:04.000 --> 02:07.000
forward telemetry from them to our back end systems.

02:07.000 --> 02:10.000
We've got pair team customization on that as well.

02:10.000 --> 02:14.000
So we've got nice suite of metrics that are coming from no exposure

02:14.000 --> 02:15.000
and windows exporter.

02:15.000 --> 02:19.000
We do have some windows instances that we can use for systems monitoring.

02:19.000 --> 02:24.000
Notably, by the way, just as an aside, no exposure if you're on AWS,

02:24.000 --> 02:27.000
far better information than you're getting from AWS itself.

02:27.000 --> 02:29.000
So use that if you can.

02:29.000 --> 02:33.000
And we have everything as code as well, which is lovely, including our dashboards,

02:33.000 --> 02:38.000
which have key metrics that have then did an updated for all services based on what they're using

02:38.000 --> 02:40.000
and what we should be keeping an eye on.

02:40.000 --> 02:44.000
Following sensible use, utilization, saturation,

02:44.000 --> 02:45.000
saturation, saturation.

02:45.000 --> 02:49.000
And errors and red requests, errors and duration.

02:49.000 --> 02:52.000
Guidance for the infrastructure and application layers.

02:52.000 --> 02:55.000
Again, this is customizable on a pertain basis.

02:55.000 --> 03:00.000
And of course, a lot of this is driven by auto instrumentation.

03:00.000 --> 03:04.000
Autom, am I allowed to swear?

03:04.000 --> 03:10.000
That's auto instrumentation is fudging brilliant.

03:10.000 --> 03:16.000
And who here is contributed to the auto instrumentation libraries?

03:16.000 --> 03:19.000
I was hoping, as I said, that's fair enough.

03:19.000 --> 03:22.000
These are cool, and wherever they are, I love them for doing this.

03:22.000 --> 03:24.000
It's a made-up job so much easier.

03:24.000 --> 03:26.000
They're basically there.

03:26.000 --> 03:28.000
You hate this one.

03:28.000 --> 03:31.000
Some people hate this one, I say this, but it's basically like pattern matching.

03:31.000 --> 03:33.000
It reaches into applications.

03:33.000 --> 03:39.000
It recognises common patterns, mostly functions in frameworks and like this standard libraries.

03:39.000 --> 03:43.000
And they write sensible spans to go with that code.

03:43.000 --> 03:48.000
So in your auto instrument, a web app, for example, you're not going to get everything.

03:48.000 --> 03:50.000
You're not going to get like your custom logic.

03:50.000 --> 03:53.000
But you're going to get the connection coming in.

03:53.000 --> 03:55.000
You're going to get the response going out.

03:55.000 --> 03:59.000
You're going to get like spring framework function calls.

03:59.000 --> 04:01.000
You're going to get database connections and stuff.

04:01.000 --> 04:04.000
You're going to see that all in your spans.

04:04.000 --> 04:12.000
It's a ridiculous uplift, and it's basically free.

04:12.000 --> 04:14.000
Basically free.

04:14.000 --> 04:22.000
Now, some people that I worked with have pointed out that code needs to be executed.

04:22.000 --> 04:25.000
And that uses CPU cycles.

04:25.000 --> 04:31.000
These applications, which are being meticulously designed to be as performant as possible,

04:31.000 --> 04:36.000
mostly written in Java.

04:36.000 --> 04:41.000
So yeah, have you thought about the performance James?

04:41.000 --> 04:45.000
And you know, they're right.

04:45.000 --> 04:50.000
James, performance is impacted by auto instrumentation.

04:50.000 --> 04:53.000
Luckily, this is something that I did some testing on.

04:53.000 --> 04:56.000
And so for this talk, I fleshed out my knowledge.

04:56.000 --> 05:00.000
I did some demos and stuff like this, and this is what I'm going to be sharing with you.

05:00.000 --> 05:07.000
So before we go deep, let's do some basic testing on a couple of languages that we can currently auto instrument.

05:07.000 --> 05:12.000
I would like to say that this is not me showing you internal testing with the client.

05:12.000 --> 05:19.000
This is me using demonstrative testing that I can use to show what I've experienced.

05:19.000 --> 05:23.000
So the stuff that you're going to see up here is not benchmarking.

05:23.000 --> 05:28.000
You shouldn't be looking at what I'm showing you today as a benchmark of how these things perform.

05:28.000 --> 05:34.000
But it does help us tell the story of what I've experienced over the last two years.

05:34.000 --> 05:35.000
Let's start with Java.

05:35.000 --> 05:38.000
So Java auto instrumentation is incredibly easy.

05:38.000 --> 05:42.000
If you've got Java in your stack at the moment, auto instrumentate, it's so easy.

05:42.000 --> 05:53.000
You download the agent and attach the agent, dash agent, and then you attach two environment variables and restart the application.

05:53.000 --> 05:55.000
That's basically it.

05:56.000 --> 06:04.000
And like most of these solutions, what it's going to do is start going to start pouring out open telemetry protocol on 4318.

06:04.000 --> 06:07.000
I think by default, to local host.

06:07.000 --> 06:12.000
And then if you want it to go somewhere else, then you can add another environment variable and stuff like that.

06:12.000 --> 06:15.000
And we, in this instance, we did want it to go somewhere else.

06:15.000 --> 06:18.000
So for all these testers, we're hosting the collector externally.

06:18.000 --> 06:24.000
That just means that you'll see a little bit more network traffic, but you shouldn't see any other performance impact from the collector.

06:25.000 --> 06:27.000
So test auto instrumentation.

06:27.000 --> 06:29.000
I wrote a very simple spring boot application.

06:29.000 --> 06:32.000
This is possibly my first Java application.

06:32.000 --> 06:37.000
And it just responds to any request on slash hello with hello.

06:37.000 --> 06:39.000
Obviously, this isn't quite it.

06:39.000 --> 06:41.000
There is some other boilerplate.

06:41.000 --> 06:43.000
It is Java.

06:43.000 --> 06:47.000
I like Java, but you get the juice.

06:47.000 --> 06:49.000
So let's start with the most basic test.

06:49.000 --> 06:52.000
We're going to do a get a sleep, a get a sleep.

06:52.000 --> 06:55.000
So we're running 10 virtual users.

06:55.000 --> 06:57.000
We're running over 10 minutes.

06:57.000 --> 07:02.000
And we're just going to get my IP address on slash hello and then sleep for a second.

07:02.000 --> 07:06.000
We won't do this often, but this is just like the demo.

07:06.000 --> 07:08.000
And this is what we're getting.

07:08.000 --> 07:12.000
So although it looks significant here, like this kind of stuff here,

07:12.000 --> 07:16.000
I want you to pay attention, I like load these as well.

07:16.000 --> 07:19.000
I'm probably running straight out of the camera, but it's another.

07:19.000 --> 07:23.000
You can see here like five milliseconds to 20 milliseconds on this.

07:23.000 --> 07:25.000
You can see these bars over here.

07:25.000 --> 07:26.000
This is all relative.

07:26.000 --> 07:28.000
But this is all within margin of errors.

07:28.000 --> 07:31.000
And so I'm just kind of showing this to you as an example of what we're doing.

07:31.000 --> 07:33.000
We can see the manual instrumentation at the top.

07:33.000 --> 07:34.000
This is average request for second.

07:34.000 --> 07:36.000
This is the average P99.

07:36.000 --> 07:38.000
And then you've got your P99 in your requests.

07:38.000 --> 07:40.000
Errors don't come into it actually all that much.

07:40.000 --> 07:41.000
So I didn't really need that.

07:41.000 --> 07:43.000
That was probably a waste of space.

07:43.000 --> 07:49.000
And as you can see, I'm not really seeing any difference between auto instrumentation.

07:49.000 --> 07:51.000
Oh, manual instrumentation.

07:51.000 --> 07:54.000
I asked Marnia who's the software person I'm working with.

07:54.000 --> 07:57.000
It's a manual instrument this Java app that I wrote as well.

07:57.000 --> 08:00.000
So we've got no instrumentation, auto instrumentation,

08:00.000 --> 08:02.000
a manual instrumentation.

08:02.000 --> 08:05.000
I also installed no new tool from the back end.

08:05.000 --> 08:07.000
So we can see one second sleeps.

08:07.000 --> 08:11.000
If you can't see these properly, I will upload them as well, don't worry.

08:12.000 --> 08:14.000
So everything hovers around the same place though,

08:14.000 --> 08:15.000
except for memory.

08:15.000 --> 08:21.000
This is what you're seeing here is this is the weight of auto instrumentation on the RAM.

08:21.000 --> 08:23.000
Of course, we're loading this agent in.

08:23.000 --> 08:25.000
But these are, these are on AWS.

08:25.000 --> 08:27.000
These are T3A micro.

08:27.000 --> 08:31.000
So these are two cores with one gigabyte of RAM.

08:31.000 --> 08:33.000
So, you know, there's not, again,

08:33.000 --> 08:35.000
relatively not that much going on here.

08:35.000 --> 08:38.000
And of course, you can see over in the network transmit there.

08:38.000 --> 08:41.000
We're seeing more obviously because we're sending out the spans.

08:41.000 --> 08:43.000
But pay attention, these are killer bits.

08:43.000 --> 08:47.000
So these are these are tiny changes.

08:47.000 --> 08:51.000
So essentially this is boring and this was just me using it to show you

08:51.000 --> 08:53.000
like what we're doing today.

08:53.000 --> 08:57.000
So we're going to remove the sweet now and we're just going to see what happens.

08:57.000 --> 09:00.000
So this is just your simple spring boot.

09:00.000 --> 09:02.000
Hello world application.

09:02.000 --> 09:07.000
And as you can see, we've got something really weird going on.

09:07.000 --> 09:11.000
So manual instrumentation is, this is not margin of error.

09:11.000 --> 09:13.000
A manual instrumentation is performing.

09:13.000 --> 09:17.000
We'd be a lot worse than auto instrumentation and none.

09:17.000 --> 09:21.000
So auto instrumentation and no instrumentation seem to be performing that same.

09:21.000 --> 09:27.000
But we seem to be getting this ridiculous performance impact

09:27.000 --> 09:32.000
from manual instrumentation, manually instrumenting my little application.

09:32.000 --> 09:35.000
Let's pin that before you all crucify me.

09:35.000 --> 09:37.000
We will talk about that in a second.

09:37.000 --> 09:40.000
And again, looking at the instance stats, we're not seeing all that much going on.

09:40.000 --> 09:45.000
Again, weird manual instrumentation is using a whole lot more memory.

09:45.000 --> 09:50.000
Otherwise, this is just a story of a busy server.

09:50.000 --> 09:52.000
There's pin that.

09:52.000 --> 09:57.000
Let's go and have a look at another language and see if we can get our bearings on this.

09:57.000 --> 10:01.000
Go auto instrumentation is also pretty easy.

10:02.000 --> 10:04.000
Technically, of course, it shouldn't be possible.

10:04.000 --> 10:08.000
The way that auto instrumentation works is it tends to attach to a virtual machine

10:08.000 --> 10:11.000
or an interpreter and it adds stuff.

10:11.000 --> 10:14.000
And how do you add stuff to a completely compiled application?

10:14.000 --> 10:15.000
Well, you use EBPF.

10:15.000 --> 10:17.000
EBPF is really cool.

10:17.000 --> 10:20.000
I'm not really going to talk about that too much here.

10:20.000 --> 10:23.000
But it's very much working progress this at the moment.

10:23.000 --> 10:26.000
I think they introduced a beta release.

10:26.000 --> 10:27.000
Is that true?

10:27.000 --> 10:29.000
Yeah, a couple of days ago.

10:29.000 --> 10:32.000
So do look at go auto instrumentation.

10:32.000 --> 10:34.000
Again, you compile it.

10:34.000 --> 10:36.000
You do have to compile it currently.

10:36.000 --> 10:38.000
Upload it to your server.

10:38.000 --> 10:41.000
The way that it works is that it runs as a separate application.

10:41.000 --> 10:45.000
And it looks at the application utility.

10:45.000 --> 10:47.000
Which is different.

10:47.000 --> 10:50.000
So again, gin application.

10:50.000 --> 10:55.000
I asked my new to write this again, because I don't know go.

10:55.000 --> 10:57.000
We'll see how this performs.

10:57.000 --> 10:59.000
So again, sleep one seconds.

10:59.000 --> 11:00.000
We're not really seeing much.

11:00.000 --> 11:01.000
These are the reasonable gaps.

11:01.000 --> 11:03.000
This is just margin of error.

11:03.000 --> 11:06.000
And with sleep one second on use.

11:06.000 --> 11:11.000
Again, we're not seeing crazy amounts of differences between it.

11:11.000 --> 11:13.000
So let's go straight to no sleep.

11:13.000 --> 11:15.000
That's the interesting bit.

11:15.000 --> 11:20.000
And this is more like what I would expect to see than when we saw

11:20.000 --> 11:22.000
which are the manual instrumentation.

11:22.000 --> 11:25.000
Is that we would see actual differences.

11:25.000 --> 11:27.000
Sometimes significant, sometimes not.

11:27.000 --> 11:31.000
Again, this isn't crazy significant.

11:31.000 --> 11:35.000
In fact, it's the auto instrumentation that's performing the worst.

11:35.000 --> 11:36.000
The worst here.

11:36.000 --> 11:39.000
We're looking at five milliseconds versus ten milliseconds.

11:39.000 --> 11:43.000
That's not crazy big gaps.

11:43.000 --> 11:50.000
So yeah, we're seeing a significant drop off in a request per second as well.

11:50.000 --> 11:53.000
And again, we're seeing server busy.

11:53.000 --> 11:59.000
We are again seeing auto instrumentation using more memory than manual instrumentation.

11:59.000 --> 12:02.000
But these are for relatively small differences.

12:02.000 --> 12:08.000
Okay, that was a lot of taking screenshots of Grafana and I'm passing them off as slides.

12:08.000 --> 12:13.000
So let's go and look at some lessons that we can learn so far from the talk.

12:13.000 --> 12:18.000
So we definitely, 100% know that they were right to suspect that code,

12:18.000 --> 12:24.000
upon being executed, does use resources.

12:24.000 --> 12:28.000
We also know that manual instrumentation uses resources.

12:28.000 --> 12:31.000
Okay, I'll get to the serious insights.

12:31.000 --> 12:37.000
One thing I didn't tell you is that Mario being a great developer is not a Java developer.

12:37.000 --> 12:41.000
Another thing I didn't tell you is that he did this incredibly quickly.

12:41.000 --> 12:47.000
So we basically just got this application working and we just left it there and we started doing testing.

12:48.000 --> 12:50.000
Money is a go developer.

12:50.000 --> 12:54.000
So the manual instrumentation of our little go application,

12:54.000 --> 12:58.000
although implemented just as quickly, we're dripping with her previous experience.

12:58.000 --> 13:01.000
Like, you know, she knew how to write go applications.

13:01.000 --> 13:04.000
And so this is the old adage, right?

13:04.000 --> 13:08.000
It's like how assembly can be more performant than Python.

13:08.000 --> 13:14.000
And that's true, but can your assembly be more performant than your Python?

13:14.000 --> 13:19.000
We did actually go back and rewrite the Java code.

13:19.000 --> 13:26.000
And I did this mostly because I didn't want anyone that was working on that project to think that I'd miss represented them.

13:26.000 --> 13:31.000
And as you can see, once we worked out what was going wrong, that we rerun the tests,

13:31.000 --> 13:36.000
and now auto instrumentation is performing about the same as manual instrumentation.

13:36.000 --> 13:43.000
But it was a great example of how manually instrumenting an application can catch you out,

13:43.000 --> 13:45.000
just like with any software development.

13:45.000 --> 13:47.000
This is the instance that's for posterity.

13:47.000 --> 13:50.000
Again, I'll put this in the slides that I'll upload in a bit.

13:50.000 --> 13:54.000
And you can see how now manual instrumentation is slotting in right between auto and none,

13:54.000 --> 13:57.000
which is the kind of thing we'd expect.

13:57.000 --> 14:02.000
So auto instrumentation by comparison is much more consistent.

14:02.000 --> 14:09.000
You definitely can squeeze more performance out of manually instrumenting an application.

14:09.000 --> 14:18.000
But honestly, even if you can, at the expense of maybe getting caught out and bringing out bugs and stuff like that, it might not matter.

14:18.000 --> 14:25.000
What we might have built there was possibly one of the most simple applications that ever served a HTTP request.

14:25.000 --> 14:33.000
I'd wager that more resources were spent on creating and shipping a span for each request than actually serving the request itself.

14:33.000 --> 14:36.000
So what we've explored here is certainly interesting.

14:36.000 --> 14:41.000
It's given us some basic orientation on the performance of auto instrumentation.

14:41.000 --> 14:48.000
But I think it's time for us to start looking at the performance of real applications.

14:48.000 --> 14:52.000
I did say real applications, but I'm going to start with Peclinic.

14:52.000 --> 14:54.000
Peclinic is a, you've heard of Peclinic.

14:54.000 --> 14:59.000
I heard a couple of giggles, so giggles, laughs, manly laughs.

14:59.000 --> 15:02.000
Peclinic is an example application written in spring.

15:03.000 --> 15:07.000
It's a bit clunky, a bit slow, it has cute pets.

15:07.000 --> 15:11.000
I mean, it's basically a sample application that allows you to learn the tooling, right?

15:11.000 --> 15:14.000
And it's kind of a pretend CRM system for that.

15:14.000 --> 15:16.000
So we're going to instrument this.

15:16.000 --> 15:18.000
We're going to script some interactions this time.

15:18.000 --> 15:20.000
I'm not just going to do the one thing.

15:20.000 --> 15:23.000
I think I did like 17 different requests across the system.

15:23.000 --> 15:26.000
Searching, browsing, and form submissions.

15:26.000 --> 15:28.000
We'll see how we do.

15:29.000 --> 15:33.000
Now, the first thing I know is that I thought it was a fairly lightweight test.

15:33.000 --> 15:36.000
But again, we've got this very simple server,

15:36.000 --> 15:40.000
and we've got this Java application that can't really handle that much.

15:40.000 --> 15:42.000
It's not designed to be performing.

15:42.000 --> 15:46.000
And 20 virtual users working through 17 different interactions.

15:46.000 --> 15:49.000
It spruggled, as you can see.

15:49.000 --> 15:54.000
Errors, bars started actually coming up here.

15:54.000 --> 15:57.000
So yeah, auto instrumentation.

15:57.000 --> 16:01.000
Is also instrumentation making a difference to this application?

16:01.000 --> 16:03.000
No.

16:03.000 --> 16:04.000
No.

16:04.000 --> 16:07.000
324 versus 326 requests.

16:07.000 --> 16:11.000
This is, I mean, the application that's under a huge amount of load.

16:11.000 --> 16:15.000
But we're not seeing the kind of gaps that we were seeing before.

16:15.000 --> 16:18.000
But maybe that was just because I put it under huge amount of load.

16:18.000 --> 16:21.000
The reality, and this is, by the way, the instance side.

16:21.000 --> 16:24.000
The only difference here really is in memory.

16:24.000 --> 16:27.000
It's worth saying that pet clinic uses an in-memory database,

16:27.000 --> 16:30.000
and I only gave a gigabyte of RAM.

16:30.000 --> 16:34.000
So it didn't quite run out, but it was struggling.

16:34.000 --> 16:37.000
There's no other pressure on this, because let's be fair,

16:37.000 --> 16:42.000
even if this was a real CRM for that,

16:42.000 --> 16:45.000
you'd have maybe three users making five requests a day.

16:45.000 --> 16:51.000
So this is one virtual user, like, and this is our red testing.

16:51.000 --> 16:54.000
And even then, like, when we're taking off the pressure,

16:54.000 --> 16:59.000
what you can see here is, and this is what I've seen in reality.

16:59.000 --> 17:03.000
You know, the measurable differences from our basic testing,

17:03.000 --> 17:07.000
from our performance testing, from our, you know, all of that stuff.

17:07.000 --> 17:09.000
They've gone here.

17:09.000 --> 17:15.000
We're actually seeing the auto instrumentation has very negligible impact.

17:15.000 --> 17:20.000
And again, on the back end, again, RAM seems to be the only thing I ever see

17:20.000 --> 17:23.000
is that there's also an instrumentation, a little bit more RAM.

17:23.000 --> 17:25.000
Again, we've got, that's like 60 megabytes,

17:25.000 --> 17:28.000
as I could get to play with.

17:28.000 --> 17:31.000
And so this brings us to lesson six.

17:31.000 --> 17:34.000
An application that is actually doing something,

17:34.000 --> 17:37.000
more significant than just serving a simple response on an end point,

17:37.000 --> 17:42.000
is probably only going to be negligibly impacted by auto instrumentation.

17:42.000 --> 17:47.000
Or, really, indeed, manual instrumentation that's been done well.

17:48.000 --> 17:51.000
Going to go through a couple more, and then we'll move on.

17:51.000 --> 17:57.000
Let's pick something that's real that's doing less.

17:57.000 --> 18:03.000
Do you know that there's a module you can add to HDTPD that starts pouring out spans?

18:03.000 --> 18:04.000
It's good.

18:04.000 --> 18:08.000
And, of course, it should be these, like a C++ application.

18:08.000 --> 18:09.000
So that's really cool.

18:09.000 --> 18:11.000
I don't know if it's auto instrumentation.

18:11.000 --> 18:14.000
I guess it's just instrumentation.

18:14.000 --> 18:20.000
But it one pass at this quickly turns out that I had to turn the errors.

18:20.000 --> 18:23.000
It's going to hit the just it works page.

18:23.000 --> 18:26.000
You know, you load it up to HDTPD and does it works?

18:26.000 --> 18:28.000
That's a 403.

18:28.000 --> 18:30.000
Who knew?

18:30.000 --> 18:32.000
And yeah.

18:32.000 --> 18:34.000
Oh.

18:34.000 --> 18:42.000
So this is HDTPD just responding with the simple, you know, HDML page, it works.

18:42.000 --> 18:51.000
And as you can see that we're adding an order of magnitude more on top of that in terms of response times.

18:51.000 --> 19:03.000
And we can do three times the three put with that auto instrumentation versus with auto instrumentation.

19:03.000 --> 19:07.000
And at the back end, you know, you know, the disk was much more expressive.

19:07.000 --> 19:14.000
Interestingly, I think that I could do a lot of tuning here because I think that it was writing a bunch of logs that it didn't need to write.

19:14.000 --> 19:15.000
They weren't errors.

19:15.000 --> 19:17.000
That's just by default what it was putting there.

19:17.000 --> 19:21.000
And this only had eight gigs of disk space.

19:21.000 --> 19:24.000
So I suspect there's tuning as possible.

19:24.000 --> 19:29.000
But this kind of represents the fact that you should really be thinking about what your applications are doing.

19:29.000 --> 19:36.000
Because if you've got HDTPD instance there, that's just meant to be very quickly responding to static requests.

19:36.000 --> 19:40.000
You know, that's an example of something where you might see an impact.

19:40.000 --> 19:44.000
But this is not how people use HDTPD, is it?

19:44.000 --> 19:47.000
People use HDTPD to host WordPress.

19:47.000 --> 19:51.000
And then I'll give you with each other.

19:51.000 --> 19:54.000
Let's put WordPress behind this and try again.

19:54.000 --> 19:59.000
You can see here that I ran WordPress with and without.

19:59.000 --> 20:03.000
I could have chosen been in labels for this with and without HDTPD instruments.

20:03.000 --> 20:10.000
And then I instrumented WordPress itself with PHP based auto instrumentation.

20:10.000 --> 20:14.000
The rare double auto.

20:14.000 --> 20:25.000
And as you can see here, the big gap that we saw between HDTPD being auto instrumented and not auto instrumented.

20:25.000 --> 20:27.000
That basically disappears here.

20:27.000 --> 20:32.000
And actually it's the PHP based auto instrumentation that you start seeing having a problem.

20:32.000 --> 20:35.000
Not significant.

20:35.000 --> 20:39.000
I'd say that I'd still be happy with that personally because I think that it's worth it.

20:39.000 --> 20:42.000
We'll talk about that in a second.

20:42.000 --> 20:44.000
I'm not sure any sure what's happening here.

20:44.000 --> 20:53.000
But I won't note that for each request to load the sample page on WordPress, it created 35 spans.

20:53.000 --> 20:58.000
Twenty-two of those spans were database queries.

20:58.000 --> 21:00.000
This is for this single sample page.

21:00.000 --> 21:03.000
I don't understand WordPress.

21:03.000 --> 21:07.000
But yeah, otherwise everything looks pretty much within margin of error.

21:07.000 --> 21:12.000
And honestly, that's not, I've done WordPress hosting.

21:12.000 --> 21:14.000
And that's not what people worry about, right?

21:14.000 --> 21:19.000
When you're doing WordPress hosting, you're thinking about your caching efficiency.

21:19.000 --> 21:23.000
So again, you see that performance impact and you go, oh no, big performance impact.

21:23.000 --> 21:25.000
And then you think about your application.

21:25.000 --> 21:30.000
Well, actually most of your WordPress applications, yep, ten minutes, oh, I'll speed up.

21:30.000 --> 21:37.000
Most of your WordPress applications come from caching, you're looking at caching efficiency for your performance impact.

21:37.000 --> 21:43.000
But it does underline the need for good performance testing, which you should be doing anyway.

21:43.000 --> 21:49.000
But when someone asks me, what will the performance impact of auto instrumentation be?

21:49.000 --> 21:53.000
My answer is, my answer is well tested.

21:53.000 --> 21:56.000
Because it does depend on your application.

21:56.000 --> 22:02.000
And most of these auto instrumentation stuff, it's been pretty easy to ensure.

22:02.000 --> 22:08.000
So the harder ones were the HTTP module, had to spend a bit at the time on that.

22:08.000 --> 22:12.000
And WordPress, because WordPress doesn't install with composer.

22:12.000 --> 22:17.000
And the auto instrumentation installs the composer, but you'll work it out.

22:17.000 --> 22:20.000
But yeah, the little bits do add up.

22:20.000 --> 22:24.000
So yeah, get into performance testing of applications.

22:24.000 --> 22:28.000
And now is a great time to start.

22:28.000 --> 22:36.000
Okay, so far I've been relying on you, dear audience, to believe me when I say, hey, all of this is worth it.

22:36.000 --> 22:40.000
I would like to leave a little bit on why it's worth it.

22:40.000 --> 22:44.000
Firstly, looking at applications in terms of traces.

22:44.000 --> 22:49.000
Even if you have the most punctory span, and most of your application is just custom code.

22:49.000 --> 22:51.000
You're still going to get the request span.

22:51.000 --> 22:53.000
You're still going to get things like the database course.

22:53.000 --> 22:57.000
This is actually an example that we had from real application with our client.

22:57.000 --> 22:59.000
This isn't a real trace.

22:59.000 --> 23:01.000
We're going to throw on a tempo doesn't output this.

23:01.000 --> 23:04.000
And this is a representation of the problem that we had.

23:04.000 --> 23:07.000
We had an application that had a huge amount of custom logic.

23:07.000 --> 23:10.000
And we were basically not seeing anything, even with auto instrumentation.

23:10.000 --> 23:17.000
But we did know that one of the things that we were being told is that the application that we were running that they had developed was slow

23:17.000 --> 23:20.000
because our database wasn't performing as well.

23:20.000 --> 23:24.000
We were hosting the database and our database wasn't performing well enough.

23:24.000 --> 23:30.000
And of course databases never perform perfectly.

23:30.000 --> 23:34.000
So of course the database engineers went off and they tried improving the response times in the database.

23:34.000 --> 23:37.000
And they could see that there were some slow queries and things like that.

23:37.000 --> 23:43.000
But from the traces we could see that when we got very long requests that were causing us problems,

23:43.000 --> 23:46.000
we were getting very short database queries.

23:46.000 --> 23:50.000
So we were able to flip the scripts on that problem and to be able to say,

23:50.000 --> 23:58.000
Hey, yeah, we know that there are slow queries, but those slow queries are not connected to the slow requests in your application.

23:58.000 --> 24:02.000
And this was because it was a Java application so we could just auto instrumentate.

24:02.000 --> 24:06.000
It was incredibly easy.

24:06.000 --> 24:08.000
The other thing is automated metrics.

24:08.000 --> 24:16.000
A lot of, especially older applications aren't outputting the metrics that you need to be able to do proper red dashboards.

24:16.000 --> 24:19.000
And even if they do, they might be wrong.

24:19.000 --> 24:28.000
With auto instrumentation, you can pull all of this out and we could create dashboards that upper management had been asking maybe for years.

24:28.000 --> 24:33.000
And they were not being put ahead of, you know, feature requests and stuff.

24:33.000 --> 24:38.000
It was very easy with tracing to go and generate metrics from that.

24:38.000 --> 24:41.000
There's also things like the service graph, I'll get through this very quickly.

24:41.000 --> 24:47.000
You've seen this in other things, but I never really understood the value of this because I was just like,

24:47.000 --> 24:51.000
Well, it shows you how your application works, but you know that, right?

24:51.000 --> 24:58.000
But of course, with a lot of enterprises going back to maybe 60, 70s, 80s, 90s, you know,

24:58.000 --> 25:02.000
and they've got stuff that was developed that's out developed maybe 10, 20 years ago.

25:02.000 --> 25:05.000
And only very few people actually understand what's going on.

25:05.000 --> 25:10.000
And to stay understanding what's going on, they have to work very broadly instead of being to get deep on stuff.

25:10.000 --> 25:17.000
Traces just give you this map of everything and correct everyone's understanding really, really quickly.

25:17.000 --> 25:22.000
And even when you've got applications that are not instrumented,

25:22.000 --> 25:26.000
where you can't auto instrument and you're not able to go and manual instrument,

25:26.000 --> 25:30.000
there was somebody in the last year that was doing cobalt.

25:30.000 --> 25:34.000
Does anyone here do cobalt this year?

25:34.000 --> 25:36.000
Yeah.

25:36.000 --> 25:38.000
What's the point, yes, someone else?

25:38.000 --> 25:42.000
You're grossing on them, you're grossing on them for using cobalt.

25:42.000 --> 25:46.000
And so yeah, like, you know, there's no instrumentation library of cobalt.

25:46.000 --> 25:52.000
So far as I know, but you can inject the context for the trace into the application,

25:52.000 --> 25:56.000
which means that you can go, oh, it went off to this application.

25:56.000 --> 25:57.000
I can't see a traces for that.

25:57.000 --> 26:01.000
But I can click a button and see all the log lines that match it.

26:01.000 --> 26:05.000
God, that's useful.

26:05.000 --> 26:12.000
And as we've seen from our experiments, the performance here is not really apparent at the scale of whatever most applications are doing.

26:12.000 --> 26:14.000
You should be doing performance testing.

26:14.000 --> 26:19.000
But if you don't have any of the things that I just listed across your whole estate,

26:19.000 --> 26:21.000
you should be leaning towards auto instrumentation.

26:21.000 --> 26:23.000
You should be leaning towards picking up these libraries,

26:23.000 --> 26:26.000
testing them on your systems, implementing them and getting them out there,

26:26.000 --> 26:30.000
because there's not much of a reason not to, it's free of observability.

26:30.000 --> 26:33.000
And it's so useful.

26:33.000 --> 26:36.000
I do want to talk about manual instrumentation for a moment,

26:36.000 --> 26:38.000
because you might be thinking that I'm telling here,

26:38.000 --> 26:41.000
don't do manual instrumentation, only do auto instrumentation.

26:41.000 --> 26:45.000
It is true that manual instrumentation code is a ton of work.

26:46.000 --> 26:49.000
This, for example, is a simple application, and all the bits in yellow,

26:49.000 --> 26:52.000
a bits that you need to do to instrument your application manually.

26:52.000 --> 26:54.000
Of course, this is unfair, as an example,

26:54.000 --> 26:57.000
because of the more complexity you have to application,

26:57.000 --> 27:01.000
the more code versus other stuff that you need.

27:01.000 --> 27:04.000
So the balance would change.

27:04.000 --> 27:06.000
But this is still a ton of work,

27:06.000 --> 27:09.000
especially on instrumenting existing applications.

27:09.000 --> 27:11.000
Doing it for new code is really, really easy.

27:11.000 --> 27:12.000
You should just do it.

27:12.000 --> 27:13.000
Just start doing it.

27:14.000 --> 27:17.000
The other problem is that humans, of course, create bugs.

27:17.000 --> 27:20.000
They also fix bugs.

27:20.000 --> 27:22.000
They also create bugs.

27:22.000 --> 27:24.000
Auto instrumentation has already been written.

27:24.000 --> 27:26.000
It's been tested in a lot of places already,

27:26.000 --> 27:28.000
and the bugs have been worked out.

27:28.000 --> 27:31.000
To varying degrees, some are more further along than others.

27:31.000 --> 27:33.000
And of course, because it's open source,

27:33.000 --> 27:36.000
you might find it easier to fix the few problems that you're having

27:36.000 --> 27:39.000
in your auto instrumentation than you would

27:40.000 --> 27:43.000
in debugging your application and getting it working

27:43.000 --> 27:47.000
with manual instrumentation, which, of course, is the beauty of open source.

27:47.000 --> 27:49.000
And the other thing is that open source,

27:49.000 --> 27:53.000
auto instrumentation, you hate this slide, don't you?

27:53.000 --> 27:57.000
I'm not big, I'm found a better way of explaining it than this, frankly.

27:57.000 --> 28:00.000
The reality is also instrumentation is awesome.

28:00.000 --> 28:02.000
It's fudging awesome.

28:02.000 --> 28:04.000
But it's also a reasonably dumb.

28:04.000 --> 28:07.000
It's essentially doing pattern matching under the hood.

28:07.000 --> 28:10.000
It's finding things and decorating them and stuff.

28:10.000 --> 28:12.000
Those are people who are going to come up to me afterwards,

28:12.000 --> 28:16.000
and be like, it's not that incorrect to me, please do.

28:16.000 --> 28:20.000
But it's guessing what attributes should be and shouldn't be added,

28:20.000 --> 28:24.000
and of course, that's not going to match your application.

28:24.000 --> 28:28.000
And so how well it works is going to depend on whether you're mostly

28:28.000 --> 28:31.000
writing custom logic and custom code or whether you're mostly

28:31.000 --> 28:35.000
averaging applications that have already been also instrumented.

28:35.000 --> 28:39.000
So it's kind of like the bus that you're going to get almost to your destination

28:39.000 --> 28:44.000
while you're still working on that custom rocket engine that's in your garage.

28:44.000 --> 28:47.000
The rocket car isn't ready yet,

28:47.000 --> 28:50.000
and the walk from the bus stop isn't that far.

28:50.000 --> 28:54.000
That sounded so much better when I wrote it down.

28:54.000 --> 28:59.000
And with auto instrumentation, you can start your observability journey now,

28:59.000 --> 29:00.000
instead of waiting.

29:00.000 --> 29:03.000
Remember, observability isn't tracing.

29:03.000 --> 29:06.000
Observability is a different way of working.

29:06.000 --> 29:09.000
Instead of your telemetry being something that you design,

29:09.000 --> 29:12.000
and then to answer no question.

29:12.000 --> 29:16.000
Instead of telemetry being something that you design into answer

29:16.000 --> 29:20.000
known questions, and then implement as part of your software development,

29:20.000 --> 29:25.000
telemetry becomes something that you implement to expose the internal state of the system.

29:25.000 --> 29:30.000
And then you can start asking questions of it outside of that loop.

29:30.000 --> 29:33.000
You really want to get to this for two reasons.

29:33.000 --> 29:38.000
One, it's going to be a bionic arm for your troubleshooting.

29:38.000 --> 29:42.000
But secondly, it means that everyone else has got this capability as well.

29:42.000 --> 29:44.000
They don't have to ask you what's going on in your application.

29:44.000 --> 29:47.000
They don't have to ask you to implement that new counter.

29:47.000 --> 29:49.000
You can just say, no, no, no, no.

29:49.000 --> 29:50.000
The traces are there.

29:50.000 --> 29:53.000
We've made sure that the attributes are all there,

29:53.000 --> 29:57.000
and you know, just go and pick it out.

29:58.000 --> 30:01.000
Right, have I got time for miscellaneous?

30:01.000 --> 30:03.000
Okay, really, really quick.

30:03.000 --> 30:05.000
Okay, firstly, AWS Lander.

30:05.000 --> 30:08.000
AWS Lambda works.

30:08.000 --> 30:11.000
So, if you're using that, the only thing is that by default

30:11.000 --> 30:14.000
the AWS version distribution of open telemetry

30:14.000 --> 30:17.000
includes the collector in the Lambda layer.

30:17.000 --> 30:20.000
If anyone works on landers, you'll know that one of the things you want to do

30:20.000 --> 30:22.000
is you want them to start up and complete very, very quickly

30:22.000 --> 30:24.000
and having a whole collector start and then stop again

30:24.000 --> 30:26.000
as a huge amount of weight to that.

30:26.000 --> 30:28.000
I would recommend adding an external collector

30:28.000 --> 30:30.000
and having it send to an external collector again.

30:30.000 --> 30:33.000
Your mileage may vary, but this is something that's really important

30:33.000 --> 30:36.000
when it comes to performance of auto instrumentation.

30:36.000 --> 30:39.000
The other thing is graphite of bailer.

30:39.000 --> 30:41.000
Well, I'd say generally EBPF.

30:41.000 --> 30:44.000
EBPF is awesome, as we could see with go,

30:44.000 --> 30:45.000
it was your only option.

30:45.000 --> 30:47.000
We'll see, see, plus, plus with rust.

30:47.000 --> 30:50.000
It's going to be your only option for getting traces out.

30:50.000 --> 30:54.000
But it gets you less unless you configure it.

30:54.000 --> 30:58.000
It might change as we go on, but at the moment, if you use EBPF based

30:58.000 --> 31:01.000
also instrumentation, it will give you less in the way of spans.

31:01.000 --> 31:03.000
It will give you less context.

31:03.000 --> 31:06.000
Someone will come up again afterwards and correct me on that

31:06.000 --> 31:07.000
and I'd love to see it.

31:07.000 --> 31:10.000
But my experience is that it's giving you less,

31:10.000 --> 31:12.000
and also it's about as performance, maybe a little bit less

31:12.000 --> 31:14.000
sometimes, maybe a little bit more.

31:14.000 --> 31:20.000
So my approach is still to use traditional auto instrumentation

31:20.000 --> 31:25.000
where you can use it and to use EBPF where you can't.

31:25.000 --> 31:29.000
Again, I've got the, this is just for posterity.

31:29.000 --> 31:31.000
You'll see this afterwards.

31:31.000 --> 31:33.000
So yeah, these are our lessons.

31:33.000 --> 31:36.000
You should probably be using auto instrumentation.

31:36.000 --> 31:38.000
Time is up.

31:38.000 --> 31:41.000
You should probably be using auto instrumentation.

31:41.000 --> 31:54.000
So just say at the end there, we are looking for people

31:54.000 --> 31:59.000
who community led projects that want to leverage this kind of stuff.

31:59.000 --> 32:02.000
We benefited a huge amount from this, so we want to give back.

32:02.000 --> 32:05.000
So if you're running a community led project, open source projects,

32:05.000 --> 32:08.000
come talk to us, we'd be really happy to help out because it helps us out

32:08.000 --> 32:10.000
with some experiences as well.

32:10.000 --> 32:14.000
So maybe we can give one minute if somebody has a question.

32:18.000 --> 32:20.000
Is there any questions that I perfectly answer?

32:20.000 --> 32:21.000
There you go.

32:28.000 --> 32:30.000
Okay. Thank you for the presentation.

32:30.000 --> 32:35.000
So in my experience, auto instrumentation tends to create a huge volume

32:35.000 --> 32:43.000
of traces, matrix, logs, which in the end cause costs for the storage of those

32:43.000 --> 32:49.000
that are sometimes one order of magnitude higher than the cost of running the application itself.

32:49.000 --> 32:56.000
So how do you handle, like you should, what are the strategies to reduce the costs?

32:56.000 --> 32:59.000
Okay, strategies for reducing cost.

32:59.000 --> 33:03.000
How much do I want people to hate me?

33:03.000 --> 33:07.000
Okay, don't use that dog.

33:07.000 --> 33:12.000
So what we're using at the moment is we're using Grafana tempo, not an ad,

33:12.000 --> 33:21.000
but it sends it straight to S3, so you get this ability to do all the things you can do with S3

33:21.000 --> 33:24.000
to store this stuff and you can use any other back end as well.

33:24.000 --> 33:27.000
Storage of spans doesn't have to be expensive.

33:27.000 --> 33:30.000
I think that's just a choice that the industry may.

33:30.000 --> 33:34.000
But the other thing is, yeah, once you've gotten to the case where you're spending

33:34.000 --> 33:37.000
we're not, we're just not spending that much.

33:37.000 --> 33:41.000
I think, you know, I'm not going to quote a number, but we're spending like, like,

33:41.000 --> 33:43.000
not point, not one X.

33:43.000 --> 33:47.000
What we're spending on running the applications and we're using auto instrumentation.

33:47.000 --> 33:51.000
So, and all I can tell you is that the way that we've done that, the strategy was

33:51.000 --> 33:58.000
saving money by using a tool like Grafana tempo, which pushes it to S3,

33:58.000 --> 34:04.000
which gives us those economies of S3 rather than the economies of maybe something like,

34:04.000 --> 34:07.000
you know, data dog as an example.

34:07.000 --> 34:12.000
And very specifically, I do call that data dog because many people have come to me and said,

34:12.000 --> 34:16.000
we love this idea, but we don't do it because it's so expensive and I find out that the reason

34:16.000 --> 34:19.000
is that they're using data dog, if you're listening data dog, lower your prices.

34:19.000 --> 34:27.000
Thank you, James.

34:27.000 --> 34:29.000
I think that's it.

