WEBVTT

00:00.000 --> 00:10.000
Thanks for coming to my talk.

00:10.000 --> 00:12.000
Thanks for staying for so late.

00:12.000 --> 00:17.000
There's just one talk after mine, so keep going.

00:17.000 --> 00:21.000
This talk is going to be about this really tracing and suicide swift.

00:21.000 --> 00:26.000
And as my previous talk, this also originated from Google

00:26.000 --> 00:27.000
with some of code.

00:27.000 --> 00:29.000
But in my case, it was back in 2020.

00:29.000 --> 00:32.000
So you can see me pitching to the swift forums.

00:32.000 --> 00:35.000
And that's when I started working on Swift's stupid tracing

00:35.000 --> 00:38.000
with my mentor Konrad.

00:38.000 --> 00:42.000
And since then, I've been a commentator of the Swift's

00:42.000 --> 00:43.000
stupid tracing library.

00:43.000 --> 00:46.000
And I also created a couple of spinoff projects,

00:46.000 --> 00:50.000
including an open telemetry tracer.

00:50.000 --> 00:55.000
Your general for today is I will give you a quick overview of Swift

00:55.000 --> 00:57.000
of Swift observability.

00:57.000 --> 01:01.000
Then we'll take a quick deep dive into Swift's stupid tracing

01:01.000 --> 01:03.000
the library itself.

01:03.000 --> 01:06.000
Then I'll show you how to make tracing even more

01:06.000 --> 01:09.000
exciting if you pull it together with logging.

01:09.000 --> 01:12.000
And then we'll take a quick look at open telemetry

01:12.000 --> 01:14.000
and Swift's hotel.

01:14.000 --> 01:17.000
And then a couple of next steps.

01:17.000 --> 01:22.000
So as Tim mentioned earlier, there's three signals

01:22.000 --> 01:25.000
in observability and Swift.

01:25.000 --> 01:29.000
Currently there's logging metrics and then

01:29.000 --> 01:32.000
last but only this real tracing.

01:32.000 --> 01:36.000
Throughout this talk, I'll be using a fake server

01:36.000 --> 01:38.000
that is a product kit, not API.

01:38.000 --> 01:40.000
It's written in server sites, Swift, of course,

01:40.000 --> 01:42.000
using the hummingbird project.

01:42.000 --> 01:45.000
And you can see there's just two end points

01:45.000 --> 01:49.000
one to get all products and another one to get a single product.

01:49.000 --> 01:51.000
Here we can see.

01:51.000 --> 01:53.000
This service running in VS code.

01:53.000 --> 01:57.000
And by default, this is as much observability as we get.

01:57.000 --> 02:01.000
We can see the server started, but after that we're kind of on our own.

02:01.000 --> 02:05.000
So let's gradually add more observability to this.

02:05.000 --> 02:08.000
First up, we have logging.

02:08.000 --> 02:15.000
Logging is way to record specific points in time.

02:15.000 --> 02:19.000
And you can attach more information to those points

02:19.000 --> 02:21.000
through metadata.

02:21.000 --> 02:25.000
But the downside is it's kind of hard to understand a full request.

02:25.000 --> 02:28.000
So for example, if I fetch the products,

02:28.000 --> 02:33.000
I may log when I start carrying from the database,

02:33.000 --> 02:37.000
but I may not know how it turned out.

02:37.000 --> 02:39.000
So add logging.

02:39.000 --> 02:41.000
Some additional logging here.

02:41.000 --> 02:45.000
What I'm doing is I use hummingbird's log request middleware.

02:45.000 --> 02:48.000
And this already gives me additional logs.

02:48.000 --> 02:52.000
And see, so for each request I'm sending a guide log statements

02:52.000 --> 02:57.000
and includes a metadata like the path and request method.

02:57.000 --> 03:00.000
We can also add our own custom logging.

03:00.000 --> 03:04.000
For example, here logging if the post-crescue request failed.

03:04.000 --> 03:07.000
And which it did in this case, unfortunately,

03:07.000 --> 03:10.000
and we got a log statement for this.

03:10.000 --> 03:13.000
Next up, we're going to add metrics.

03:13.000 --> 03:17.000
Instead of logs, metrics are aggregated.

03:17.000 --> 03:21.000
We don't, again, look at a specific request,

03:21.000 --> 03:23.000
but we rather look at the system as a whole.

03:23.000 --> 03:27.000
And this is great for providing high-level overview

03:27.000 --> 03:29.000
of how our system is doing.

03:29.000 --> 03:33.000
And we can build stuff like alerts on top of it.

03:33.000 --> 03:36.000
To add metrics to hummingbirds,

03:36.000 --> 03:39.000
I'm just adding the metrics middleware

03:39.000 --> 03:41.000
and behind the scenes I also set up

03:41.000 --> 03:45.000
the Swift metrics library to export metrics for parameters.

03:45.000 --> 03:49.000
So in this case, you can see there's the rate

03:49.000 --> 03:52.000
of active requests per endpoint.

03:52.000 --> 03:55.000
And I just hit the endpoint a bunch of times.

03:55.000 --> 03:57.000
That's where we saw a spike.

03:57.000 --> 04:00.000
And then it went down again.

04:00.000 --> 04:04.000
And now let's get to the actual topic of this talk.

04:04.000 --> 04:06.000
Distributed tracing.

04:06.000 --> 04:11.000
So that's the third signal that we're adding to a product catalog API.

04:11.000 --> 04:14.000
Distributed tracing is again specific to one request.

04:14.000 --> 04:18.000
But as opposed to logging, it kind of gives you an overview

04:18.000 --> 04:22.000
of the entire lifecycle of that request.

04:22.000 --> 04:27.000
So you can see each operation that makes up this request.

04:27.000 --> 04:33.000
You can also see both high-level, like Span A in this case,

04:33.000 --> 04:35.000
is a high-level operation.

04:35.000 --> 04:39.000
But then you can also get detailed and see which sub-operations

04:39.000 --> 04:41.000
this is made up of.

04:42.000 --> 04:46.000
And you can also tell how long each operation takes.

04:46.000 --> 04:51.000
So this is great for detecting bottlenecks, for example.

04:51.000 --> 04:55.000
If you see that stuff is running in sequentially,

04:55.000 --> 04:58.000
instead of concurrently, for example, you may want to shift things

04:58.000 --> 05:01.000
around to make the overall request faster.

05:01.000 --> 05:05.000
To add tracing, you can see the pattern now.

05:05.000 --> 05:08.000
And again, just adding a middleware behind the scenes,

05:08.000 --> 05:12.000
a set of tracing to export to a two-coloured digger.

05:12.000 --> 05:16.000
And here is digger in action.

05:16.000 --> 05:18.000
So here we can see a single trace.

05:18.000 --> 05:22.000
That shows that I hit the specific products and points,

05:22.000 --> 05:24.000
and if we draw it down into that,

05:24.000 --> 05:27.000
we can see the span that was created.

05:27.000 --> 05:32.000
And it includes again metadata like the P method.

05:32.000 --> 05:35.000
We can also create our own spans.

05:35.000 --> 05:40.000
So in this case, we just wrap our post-gres skills,

05:40.000 --> 05:42.000
statements in another span.

05:42.000 --> 05:47.000
And then this span, because it runs inside the request handling operation,

05:47.000 --> 05:51.000
as a child span of the one that I've been created.

05:51.000 --> 05:54.000
So in Yager, we can see that we started with the products,

05:54.000 --> 05:56.000
and points got a span for that.

05:56.000 --> 06:02.000
And then we switched over to the SQL query span.

06:02.000 --> 06:06.000
We can also add additional information to our spans using attributes.

06:06.000 --> 06:10.000
For example, here I included the SQL statement that we actually running.

06:10.000 --> 06:12.000
And we can also see this in Yager.

06:12.000 --> 06:15.000
For example, here I fetch a product that didn't exist,

06:15.000 --> 06:22.000
and in the SQL statement span, I can also see the SQL query.

06:22.000 --> 06:28.000
In this case, this was still all running within one service.

06:28.000 --> 06:30.000
So that's not really distributed tracing.

06:30.000 --> 06:33.000
Which is fun, but it could be better.

06:33.000 --> 06:36.000
So here I'm calling out to a different server.

06:36.000 --> 06:41.000
In this case, an open feature-based feature-flegging system.

06:41.000 --> 06:47.000
And I'm just checking a feature-flegged to see if I should fake a failure in my server.

06:47.000 --> 06:52.000
And then if we hit the endpoint again, now stuff is getting interesting.

06:52.000 --> 06:57.000
So here we can see that I still started with my products and points.

06:57.000 --> 07:00.000
But then I'm calling out to this thing called flipped,

07:00.000 --> 07:03.000
which is the feature-flegging system that I'm using.

07:03.000 --> 07:09.000
And we can perfectly see how much time and when we call out to this service.

07:09.000 --> 07:16.000
To take things even further, this is now integrating the product catalog

07:16.000 --> 07:19.000
in an entire microservice architecture.

07:19.000 --> 07:22.000
So it's not just the product catalog service anymore,

07:22.000 --> 07:25.000
but it's still part of this microservice.

07:25.000 --> 07:30.000
And in this case, for example, there's a couple of microservice involved.

07:30.000 --> 07:33.000
So we start by putting something in our shopping cart,

07:33.000 --> 07:35.000
which then calls out to different services,

07:35.000 --> 07:37.000
which in turn call out to product catalog,

07:37.000 --> 07:39.000
which calls out to feature-flegging.

07:39.000 --> 07:45.000
And then on this slide, we can clearly see that after all this initial work is done,

07:45.000 --> 07:49.000
we then, from the API, call out to this recommendation service,

07:49.000 --> 07:53.000
for example, to get recommended product based on what we put in the cart.

07:53.000 --> 07:56.000
But if these two would take up a bunch of time,

07:56.000 --> 08:01.000
then maybe this could indicate that we should figure out a way to parallelize this instead

08:01.000 --> 08:05.000
to speed up the overall response time.

08:05.000 --> 08:09.000
What we also get is a nice service diagram.

08:09.000 --> 08:13.000
So this clearly shows us the dependencies between our services.

08:13.000 --> 08:19.000
So here we can see who's calling the product catalog.

08:19.000 --> 08:24.000
Now that you saw how distributed tracing works in action,

08:24.000 --> 08:29.000
I'll take a quick deep dive into the Swift distributed tracing library itself.

08:29.000 --> 08:33.000
If you've ever worked with Swift log or Swift metrics,

08:33.000 --> 08:36.000
this is pretty familiar to you.

08:36.000 --> 08:38.000
It also provides just the interface,

08:38.000 --> 08:45.000
and then there's implementers that actually implement the trace server that is being used.

08:46.000 --> 08:50.000
Same as logging in metrics, it has three target audiences.

08:50.000 --> 08:54.000
We have library authors, instrumentation authors,

08:54.000 --> 08:57.000
and application developers.

08:57.000 --> 08:59.000
Let's start with the library authors.

08:59.000 --> 09:01.000
These are the ones like coming bird, for example,

09:01.000 --> 09:04.000
I think I should peak line post-gross Neo.

09:04.000 --> 09:08.000
And their job is to start spans where it makes sense.

09:08.000 --> 09:11.000
So for example, hummingbird starting, we serve span.

09:11.000 --> 09:15.000
They don't know or care about the specific trace that is being used.

09:15.000 --> 09:20.000
They just use this abstraction provided by Swift distributed tracing.

09:20.000 --> 09:23.000
Then instrumentation authors, on the other hand,

09:23.000 --> 09:28.000
they care about which specific distributed tracing system is being used in the end.

09:28.000 --> 09:30.000
But they don't start spans themselves.

09:30.000 --> 09:35.000
They just take care of exporting those spans to the system.

09:35.000 --> 09:38.000
And then last one release, we have application developers.

09:38.000 --> 09:43.000
So they are the ones that choose a specific trace or implementation.

09:43.000 --> 09:46.000
Then they just use libraries that have built in support,

09:46.000 --> 09:49.000
so that could be adding hummingbird's middleware, for example.

09:49.000 --> 09:53.000
And then optionally, as you saw, we can also create custom spans,

09:53.000 --> 09:57.000
but that's not required to get distributed tracing.

09:57.000 --> 10:01.000
So I've talked a bit about the trace of protocol.

10:01.000 --> 10:04.000
And this is what instrumentation authors would implement.

10:05.000 --> 10:08.000
If you can see it has a copper requirements, first of all,

10:08.000 --> 10:13.000
it needs to know about the specific type of span that this trace occurs.

10:13.000 --> 10:18.000
And it most importantly has the ability to start spans.

10:18.000 --> 10:24.000
And then whatever a span is might be different in each specific implementation of a traceer,

10:24.000 --> 10:28.000
but for library authors or application developers,

10:28.000 --> 10:32.000
this specific type of span doesn't really matter.

10:32.000 --> 10:36.000
Speaking of spans, spans can be mutated,

10:36.000 --> 10:38.000
but only until they're finished.

10:38.000 --> 10:42.000
And you have to end them, like finish them,

10:42.000 --> 10:46.000
by calling the end methods in order for them to then be exported.

10:46.000 --> 10:50.000
So you start a span, for example, for the postgres query,

10:50.000 --> 10:54.000
and then at some point when the postgres query ends, you also end the span.

10:54.000 --> 10:56.000
So you have to start in an end date,

10:56.000 --> 11:01.000
and at that point, the specific trace that would export it to a tracing backend.

11:02.000 --> 11:08.000
Spans are also uniquely identified, where I think of the service context.

11:08.000 --> 11:13.000
The service context is basically a type-safe dictionary,

11:13.000 --> 11:15.000
there's nothing fancy in there.

11:15.000 --> 11:20.000
And then the specific trace implementation puts the IDs in one's to use in there,

11:20.000 --> 11:24.000
so it could be a span ID and a trace ID, for example.

11:24.000 --> 11:28.000
The service context is not in a task local.

11:28.000 --> 11:33.000
And this is how we get automatic parent-child span relations.

11:33.000 --> 11:37.000
So for example, on this slide, I'm creating a span called parent,

11:37.000 --> 11:42.000
and then inside, I create another span called child,

11:42.000 --> 11:47.000
and this is using task locals under the hood to just give me a,

11:47.000 --> 11:52.000
to establish this connection from parent to child.

11:52.000 --> 11:55.000
Again, this works great for local,

11:55.000 --> 11:58.000
but if we actually want to take tracing distributed,

11:58.000 --> 12:03.000
then we have to get the service context across boundaries as well.

12:03.000 --> 12:06.000
So it could be async or process boundaries.

12:06.000 --> 12:10.000
So if we take the example of a client server from earlier,

12:10.000 --> 12:14.000
this means that the clients, when it makes a request and creates a span,

12:14.000 --> 12:19.000
it needs to inject these identifiers into the HTTP headers,

12:19.000 --> 12:22.000
and then the server needs to extract those,

12:22.000 --> 12:27.000
and then the server creates a span based on this restored service context

12:27.000 --> 12:29.000
that was propagated.

12:29.000 --> 12:33.000
This propagation is done using the instrument protocol.

12:33.000 --> 12:37.000
The instrument protocol is inherited by the trace protocol,

12:37.000 --> 12:39.000
so if you're building a trace protocol yourself,

12:39.000 --> 12:42.000
you also have to conform to these methods.

12:42.000 --> 12:48.000
The instrument itself is acknowledged about the thing that carries it,

12:48.000 --> 12:52.000
so in the client server example that were HTTP headers,

12:52.000 --> 12:56.000
but it could also be different kinds of metadata.

12:56.000 --> 13:04.000
What it does know about is the specific values that it wants to transport.

13:04.000 --> 13:09.000
Then let's take things another step further,

13:09.000 --> 13:12.000
and integrate tracing with logging.

13:12.000 --> 13:16.000
So this feature is built on a thing called metadata providers

13:16.000 --> 13:20.000
with log, and basically what a metadata provider is

13:20.000 --> 13:24.000
is it goes from void to logging metadata.

13:24.000 --> 13:30.000
In case of tracing, we use these to then transform the task local service context,

13:30.000 --> 13:34.000
so we can just pick up into specific log metadata.

13:34.000 --> 13:41.000
That allows us to associate log statements with specific spans and cross reference.

13:41.000 --> 13:43.000
So here we can see this in action.

13:43.000 --> 13:47.000
So just by choosing my metadata provider,

13:47.000 --> 13:53.000
I get these metadata fields in my log generated by hummingbird,

13:53.000 --> 13:57.000
and then I can just clearly see if I search for a trace ID,

13:57.000 --> 14:01.000
which log statements belong to this request.

14:01.000 --> 14:03.000
And then I could also take this trace ID,

14:03.000 --> 14:09.000
take it to the eager, look up the particular trace that this belongs to.

14:11.000 --> 14:17.000
Then let's briefly talk about open telemetry.

14:17.000 --> 14:19.000
It's an open observability standard.

14:19.000 --> 14:23.000
It supports logging metrics, distributed tracing,

14:23.000 --> 14:25.000
and hopefully some profiling as well.

14:25.000 --> 14:29.000
It has this thing called the open telemetry protocol,

14:29.000 --> 14:33.000
and this is how different components

14:33.000 --> 14:37.000
that speak open telemetry's language talk to each other.

14:37.000 --> 14:43.000
And it's supported by a variety of observability tools.

14:43.000 --> 14:45.000
So if you have something that speaks auto,

14:45.000 --> 14:49.000
you can pretty much use any observability back and you want.

14:49.000 --> 14:53.000
Speaking of, if you do want to do that in Swift,

14:53.000 --> 14:55.000
we have Swift Auto,

14:55.000 --> 14:59.000
and Swift Auto provides these OTLP exporters,

14:59.000 --> 15:03.000
currently for distributed tracing and also for metrics,

15:03.000 --> 15:05.000
logging is in the works.

15:05.000 --> 15:09.000
And it's done by being an instrumentation author,

15:09.000 --> 15:13.000
so it implements the Tracer protocol.

15:13.000 --> 15:17.000
So it knows how to start spans in an open telemetry format,

15:17.000 --> 15:21.000
and then it knows how to export those spans

15:21.000 --> 15:25.000
to an OTLP compatible background.

15:25.000 --> 15:29.000
This is how we can bootstrap Swift Auto.

15:29.000 --> 15:31.000
So here we can see,

15:31.000 --> 15:35.000
which has created HGRPC SpanExporter,

15:35.000 --> 15:43.000
which means that we sent these spans via OTLP over GRPC.

15:43.000 --> 15:45.000
To the observability back ends,

15:45.000 --> 15:49.000
we have a bunch of other stuff in the set out,

15:49.000 --> 15:51.000
like batching the spans to not export them,

15:51.000 --> 15:53.000
as soon as they ended,

15:53.000 --> 15:57.000
but first trade a batch and then export their batch,

15:57.000 --> 15:59.000
and then we just create our auto Tracer,

15:59.000 --> 16:01.000
and then this last line is very familiar

16:01.000 --> 16:03.000
to work with Swift Logos with metrics,

16:03.000 --> 16:07.000
that's where we just bootstrap the system,

16:07.000 --> 16:09.000
and at that point,

16:09.000 --> 16:11.000
everyone that creates spans,

16:11.000 --> 16:13.000
including libraries that you're using,

16:13.000 --> 16:15.000
but also the expense that you create on your own,

16:15.000 --> 16:19.000
will start using OTLP under the hood,

16:19.000 --> 16:23.000
without needing additional libraries,

16:23.000 --> 16:25.000
like hummingbird, open telemetry, for example,

16:25.000 --> 16:27.000
or postgres, open telemetry,

16:27.000 --> 16:31.000
it's just postgres and hummingbird supporting Swift

16:31.000 --> 16:33.000
distributed tracing as the abstraction layer,

16:33.000 --> 16:35.000
and then you as the end user say,

16:35.000 --> 16:37.000
I do want to do,

16:37.000 --> 16:39.000
I do want to use the open telemetry Tracer,

16:39.000 --> 16:45.000
and that's how it works, as easy as that.

16:45.000 --> 16:47.000
Next steps,

16:47.000 --> 16:53.000
I quickly teased the logging integration in Swift Auto,

16:53.000 --> 16:55.000
so we do have the metadata provider,

16:55.000 --> 16:59.000
which puts trace and span IDs inside your log statements,

16:59.000 --> 17:03.000
but we want to go step further and also export,

17:03.000 --> 17:05.000
because open telemetry supports logging metrics,

17:05.000 --> 17:07.000
and distributed tracing,

17:07.000 --> 17:11.000
we want to export with log statements via OTLP as well.

17:11.000 --> 17:15.000
Then we want to roll out Swift distributed tracing

17:15.000 --> 17:19.000
into more libraries to get even more of this metric for free.

17:19.000 --> 17:21.000
So, for example, earlier,

17:21.000 --> 17:23.000
for me wrapping the postgres statement myself,

17:23.000 --> 17:25.000
instead it would be nice for postgres,

17:25.000 --> 17:27.000
and you would have built in support with Swift,

17:27.000 --> 17:29.000
used to be the tracing, so I don't have to do that anymore.

17:29.000 --> 17:31.000
Same for the Async HTTP line,

17:31.000 --> 17:33.000
but at least for this one,

17:33.000 --> 17:35.000
I already created a poor request,

17:35.000 --> 17:37.000
so check it out,

17:37.000 --> 17:41.000
and then finally we want to get to a Swift Auto 1.0,

17:41.000 --> 17:45.000
which will then include log exporting as well.

17:45.000 --> 17:49.000
Thank you so much for listening.

17:49.000 --> 17:51.000
I have some handy QR code here,

17:51.000 --> 17:53.000
if you want to check out Swift distributed tracing.

17:59.000 --> 18:01.000
Thank you.

18:07.000 --> 18:08.000
Anyone?

18:08.000 --> 18:09.000
Last call?

18:09.000 --> 18:10.000
Okay.

18:10.000 --> 18:12.000
Can you put your slides to everything?

18:12.000 --> 18:14.000
Yes, I will do that after this one.

18:14.000 --> 18:17.000
And the example that I've been using

18:17.000 --> 18:21.000
for Microsoft's demo will be on GitHub as well.

18:35.000 --> 18:37.000
Let's say the question was

18:37.000 --> 18:41.000
if distributed access and Swift already support distributed tracing,

18:41.000 --> 18:43.000
they don't add this point,

18:43.000 --> 18:47.000
but it's kind of teasing it.

18:47.000 --> 18:49.000
So yes, it could work.

18:49.000 --> 18:53.000
I played around with it in the past as well.

19:07.000 --> 19:11.000
I don't know of future plans of distributed access,

19:11.000 --> 19:15.000
but I feel free to ask Conrad about it.

19:21.000 --> 19:23.000
And then so the action you had one?

19:27.000 --> 19:28.000
X-ray.

19:28.000 --> 19:32.000
Yeah, I do have a support library called Swift Auto X-ray.

19:32.000 --> 19:35.000
Sorry, I have to repeat the question.

19:35.000 --> 19:38.000
The question was,

19:38.000 --> 19:41.120
This will be the tracing works with Amazon X-ray already.

19:41.120 --> 19:44.400
And the answer is yes, I did use it a previous job.

19:44.400 --> 19:48.200
And for that, I just had to create an additional support

19:48.200 --> 19:52.120
library because the way X-ray does context propagation

19:52.120 --> 19:52.880
is a bit different.

19:52.880 --> 19:55.360
There are trace IDs look different from the standard

19:55.360 --> 19:58.200
that's open-time metric users.

19:58.200 --> 20:02.760
So like briefly go back to this set up.

20:02.760 --> 20:05.400
Here we can see that I'm using the auto random ID

20:05.400 --> 20:08.320
generator and the W3C propagator.

20:08.320 --> 20:10.960
If you want to use X-ray, you just configure it

20:10.960 --> 20:14.080
with an X-ray specific ID generator and an X-ray

20:14.080 --> 20:16.240
specific propagator and then you're done.

20:16.240 --> 20:19.080
And these two are implemented in the usual auto X-ray

20:19.080 --> 20:20.200
repository.

20:20.200 --> 20:22.080
All right, that's it.

20:22.080 --> 20:24.800
All right, more questions in nature chat

20:24.800 --> 20:26.800
or meet him in the hallway.

20:26.800 --> 20:28.920
I'm sure he'd love a high five.

20:28.920 --> 20:32.160
Great share color.

20:32.160 --> 20:34.280
And thank you so much for the detailed talk.

20:34.280 --> 20:36.080
I love when my two worlds collide.

20:36.080 --> 20:37.800
I'm a recovering Kubernetes maintainer.

20:37.800 --> 20:40.760
So it's always great to see cloud native crashing

20:40.760 --> 20:42.080
with Swift.

20:42.080 --> 20:42.600
Amazing.

20:42.600 --> 20:43.440
Thank you so much.