WEBVTT

00:00.000 --> 00:13.000
okay can you hear me can you hear me so just to soundtrack just to hear me

00:13.000 --> 00:21.400
perfect so thank you so let's track on my name is Lorenzo I'm a humble

00:21.400 --> 00:27.640
flink contributor why my friend here honk is actually a proper flink

00:27.640 --> 00:32.360
committer okay Apache flink committer and we bought war for AWS but we are

00:32.360 --> 00:35.680
not talking about AWS today we are talking about Apache flink and

00:35.680 --> 00:44.520
prometters as open source okay so no AWS service is here just I mean

00:44.520 --> 00:48.960
asking cooperation just to understand you all this the first question I

00:48.960 --> 00:52.480
think I know the answer because we are we are an observability track how many

00:52.480 --> 00:59.440
of you are familiar with promitius okay kind of expected I would say but how

00:59.440 --> 01:07.560
many of you are familiar with flink okay much fewer fine we will explain very

01:07.560 --> 01:12.760
briefly what is flink and what we are talking about and today we are talking

01:12.760 --> 01:17.320
about Apache flink and prometius together for observability use case so

01:17.320 --> 01:20.920
you're definitely in the right track spoiler at you are not talking about

01:20.920 --> 01:25.680
observing flink that's a different story is a soul problem we are not talking

01:25.680 --> 01:33.320
about that so I'm actually handing over to Hong to explain what we are

01:33.320 --> 01:38.200
talking about thanks Lorenzo so I think I think let's set the scene a little bit

01:38.200 --> 01:42.000
first right so I think main thing is like observability right what do you

01:42.000 --> 01:44.760
want to observe the first thing so most of the time we think about stuff like

01:44.760 --> 01:48.960
servers right so we have a server bunch of some compute instances that you

01:48.960 --> 01:51.840
want to monitor like earlier talk about tracing we're talking about logging

01:51.840 --> 01:55.280
right you you have a server you want to monitor it and standard right you have

01:55.280 --> 01:58.280
some form of dashboarding some form of alerting right for me this and

01:58.280 --> 02:02.400
refine our stack standard right so most of the time this works you know why

02:02.400 --> 02:06.800
why do you need what's there there's no problem to solve right but nowadays we

02:06.800 --> 02:09.400
have lost stuff stuff right there's lots of other devices that you

02:09.400 --> 02:11.720
probably want to observe and you probably want to think about how you're

02:11.720 --> 02:15.160
going to build a platform to look at these things so just a couple of examples

02:15.160 --> 02:19.040
here right you have a bunch of devices that at home right IoT devices right

02:19.040 --> 02:22.120
you might you have phones your phone applications that might have you know

02:22.120 --> 02:25.200
not work connectivity issues right sometimes the data is on sometimes data is

02:25.200 --> 02:29.600
off you might have different kind of like vehicles around right you might have

02:29.600 --> 02:33.200
let's say delivery parcels you might want to track right so observability I

02:33.200 --> 02:35.800
don't know satellites for example I don't know like stuff that loose

02:35.800 --> 02:39.880
connection periodically right so stuff basically you have things within the

02:39.880 --> 02:44.080
meta connectivity you have you have dispersed devices and most of the time

02:44.120 --> 02:48.280
you have many many many many devices you don't just have like I guess you have

02:48.280 --> 02:51.400
a lot of services well but most of the times you have probably a lot more

02:51.400 --> 02:54.640
devices right a lot different types of device you ideas and how are you going to

02:54.640 --> 02:57.600
observe this can we use the same stack that's the question right we have a

02:57.600 --> 03:01.240
soft solution you know the familiar with this interface we know how to use

03:01.240 --> 03:05.240
prompt you all we have lots of you know skills within this stack we're

03:05.240 --> 03:09.000
familiar with it can we reuse this that's the question right and hopefully we

03:09.000 --> 03:13.360
can but we're going to explore it a little bit so that's the landscape

03:13.360 --> 03:17.400
about let's get an easy one first right observing service right so we know

03:17.400 --> 03:20.160
most of the time like there are different ways of doing it but like

03:20.160 --> 03:23.240
parameters of those two different ways like let's talk about this one way

03:23.240 --> 03:26.800
first right so you can put in a promise exporter put in some form of end

03:26.800 --> 03:31.680
point you have a scraper right from it is scripts and then you store in a time

03:31.680 --> 03:35.440
series DB and then you set up your visualizing and a learning right using

03:35.440 --> 03:38.960
prompt you all the query and then you set up your graphs simple pull model right

03:38.960 --> 03:43.280
so from it is pulls from the from the data source that you want but can you

03:43.280 --> 03:47.640
apply the same to IoT devices right just do the same thing possible probably

03:47.640 --> 03:49.760
not right how you're going to set up a server how you're going to set up an

03:49.760 --> 03:52.880
endpoint how you're going to be able to kind of collect this data on a periodic

03:52.880 --> 03:56.880
manner intermittent like you have intermittent that will connect to be what you're

03:56.880 --> 04:01.360
going to do so what we do is like we use the push model from it is right so

04:01.360 --> 04:04.280
that's from it is gateway so what you do is then you will basically have

04:04.280 --> 04:08.600
those components remote right using a particular API protocol that we have

04:08.600 --> 04:12.640
remote right protocol and you write it in the promise is right so this kind of

04:12.640 --> 04:16.480
you probably like I guess what's this custom component I guess depends on your

04:16.480 --> 04:20.000
system so it's generic in a sense that because there are lots of different types

04:20.000 --> 04:23.600
of devices right so how do you monitor these things most of the time you want to

04:23.600 --> 04:28.840
set up some integration from your device into into something else let me

04:28.840 --> 04:31.680
stop you it doesn't matter the point is that you're pushing from those devices

04:31.680 --> 04:36.440
into into from it is but what are the characteristics of this kind of problem

04:36.480 --> 04:40.040
space right so one of the characteristics is that most of the time your devices

04:40.040 --> 04:45.240
have small hardware you don't want to kind of set up like a huge huge memory or

04:45.240 --> 04:48.720
huge CPU on your for example your home device on your like I guess if you

04:48.720 --> 04:51.880
use up a lot of memory and space on your phone that's not that's not a good

04:51.880 --> 04:55.160
application right so so the point is you probably want to have you want to keep

04:55.160 --> 04:59.000
it as light as possible right you want to send probably things as like I guess

04:59.000 --> 05:02.200
it's frequently as possible in case some of the connection gets dropped right you

05:02.200 --> 05:05.840
never know that you drop some packets so there's something types of problems

05:05.840 --> 05:09.040
there are specific to this problem space we'll explain more about it later and

05:09.040 --> 05:13.800
why maybe the current stack might might not deal very well with it so yeah

05:13.800 --> 05:19.080
so let's try let's see so let's explore promque offers just a quick summary

05:19.080 --> 05:22.120
for in case people so you can do all things in promque all right so you have all

05:22.120 --> 05:25.680
your data and promedias right then you you can filter and you aggregate the

05:25.680 --> 05:28.600
the series in real time is really really powerful right the semantics you can

05:28.600 --> 05:32.840
do with it you can do rate you can do derivations right derivative sorry and you

05:32.840 --> 05:36.040
can do like I guess window wing as well there's lots of powerful things you can

05:36.040 --> 05:39.720
do with promque yes but then there are a couple of limitations as well so like

05:39.720 --> 05:44.320
a couple of things around joints so enrichment is is something that's pretty

05:44.320 --> 05:48.080
out of the set up so for example if you wanted to let's say send a record with

05:48.080 --> 05:51.200
limited amount of information with little like let's say just a new idea of

05:51.200 --> 05:54.040
your device and you don't want to enrich it with extra information that you

05:54.040 --> 05:56.840
want to do join over in promque in promque it's like that's not something

05:56.840 --> 06:00.400
that you probably can solve that right if you wanted to you could but then you

06:00.400 --> 06:05.160
know it gets complex very easily the other one is maybe some form of custom

06:05.160 --> 06:09.520
derived metrics let's say you have some form of state machine right that

06:09.520 --> 06:12.880
you have in your device and you you want to use that as as a join with another

06:12.880 --> 06:17.720
metrics so it's not that it's it's bad it works right it's just that in the

06:17.720 --> 06:20.920
specific use cases it doesn't work and when you have super high cardinality

06:20.920 --> 06:25.800
data right you have lots of new ideas what you're going to do with it right so most

06:25.800 --> 06:30.600
like if you have let's say hundred hundred thousand of devices or fifty

06:30.600 --> 06:33.800
thousand devices stuff like that you start to hit the limit where you want to run

06:33.800 --> 06:38.480
like over a windowing over a thousand hundred thousand cardinality you fit you

06:38.480 --> 06:43.680
get into some problems so swap performance on high cardinality data so

06:43.680 --> 06:47.000
let's I guess go back to the IoT devices that we talked about just simplify

06:47.000 --> 06:50.600
it diagram right on the left you have some observed assets just to simplify

06:50.600 --> 06:54.080
it genericify it right so something you want to observe basically right the

06:54.080 --> 06:57.200
problem space we're looking at is high cardinality we're looking at probably

06:57.200 --> 07:01.040
potentially millions of devices high frequencies so some of these devices

07:01.040 --> 07:06.080
send lots and lots of data constantly right maybe different standardization

07:06.080 --> 07:09.440
as well some of them send very frequently some of them send sometimes one

07:09.440 --> 07:12.080
second how it standardizes how you make sure that you have the same

07:12.080 --> 07:15.680
target our cross it right and then a couple of things as well like like

07:15.680 --> 07:19.640
maybe they they send repeated data right you get duplicates in case you

07:19.640 --> 07:24.280
drop some of them right so there's different types of problems and then just

07:24.280 --> 07:27.480
to summarize sorry here so you must use to push model we talked about that not

07:27.480 --> 07:32.000
mass but push model fits better high cardinality is usually a common problem

07:32.000 --> 07:35.480
high frequency or different types of frequency as well and usually you want to

07:35.480 --> 07:39.920
keep it like send something super short string right so usually you don't have

07:39.920 --> 07:43.200
the context info so for example like on a device you probably don't know

07:43.200 --> 07:47.360
the creative the make like you again you can add it but then you will increase

07:47.360 --> 07:50.560
that's a trade-off right so you can also reduce and maybe the

07:50.560 --> 07:54.160
contextual info might change the time right you want to keep it life so stuff

07:54.160 --> 07:59.040
like that so if you just use the same stack process on query right then you

07:59.040 --> 08:02.000
run into the problems that we talked about on high cardinality you have

08:02.000 --> 08:05.800
an expensive operation right and you're doing that constantly every time you

08:05.800 --> 08:09.040
refresh it that's what I guess you can cash and stuff like that but the point is

08:09.040 --> 08:13.760
like in the end you still have this expensive query if you want to live data

08:13.760 --> 08:16.640
that you have to run constantly every time when you refresh it that's what

08:16.640 --> 08:19.960
another thing we talked about enrichment as well you probably can enrich it as

08:19.960 --> 08:23.000
easily you probably can you can plug it in into graphite this law of

08:23.000 --> 08:27.640
different graphite will powerful but the point is that there are all your

08:27.640 --> 08:31.640
complexity on this side right so what can we do to simplify it right and help you

08:31.640 --> 08:36.160
to for example you know build up a more extensible or more can we can

08:36.160 --> 08:39.320
be shifted out of it so that's where flink can help you right so what you can

08:39.320 --> 08:43.760
do is that flink can start in as a pre-processing engine because flink has

08:43.760 --> 08:48.440
very powerful time semantics it can handle lots of like like it's built in

08:48.440 --> 08:52.520
handling of time semantics like hopefully it's gonna explain it a little bit later

08:52.520 --> 08:57.160
but basically like you can take it in a time series data it can process it for

08:57.160 --> 09:01.920
you you can simplify it for you and there's lots of I guess similar to Chrome

09:01.920 --> 09:05.720
QL if you think about that but you can do it in a large distributed scale right

09:05.720 --> 09:10.120
and run it across like distributed clusters they can process a lot of data and

09:10.120 --> 09:13.920
you can do it in a reliable way you can if your if your cluster goes down you

09:13.920 --> 09:17.640
can replicate the data and produce the same results right so it's a kind of thing

09:17.640 --> 09:22.000
that can scale very well and it's kind of thing that can produce you consistent

09:22.000 --> 09:25.360
results which is important when you want to look at like counting for example a

09:25.360 --> 09:28.040
simple count you want to count the right number you cannot count twice right

09:28.040 --> 09:31.360
so stuff like that you you can produce consistent results pre-process it

09:31.360 --> 09:35.680
generate your enriched aggregated metrics put in the promidious any of that's

09:35.680 --> 09:39.880
what can just query those so it's kind of like allowing you to shift I guess

09:39.880 --> 09:43.000
to complex these somewhere else is it possible you know of course there are three

09:43.000 --> 09:46.520
offs which you'll talk about but you know this is one potential solution that can

09:46.520 --> 09:52.200
help you so I know what to learn so yep thank you so let's talk about

09:52.200 --> 09:58.000
flink just very briefly and specifically for this type of use case not in

09:58.000 --> 10:03.640
general flink but for who is not familiar with flink what is flink this is

10:03.640 --> 10:06.920
actually from the flink website from the apartment flink website this is the

10:06.920 --> 10:10.880
definition let's go through it because these actually contains all the relevant

10:10.880 --> 10:16.160
key points of flink so a flink is a framework and a distributed processing

10:16.160 --> 10:21.280
engine for state full computation over about an amount of data streams so

10:21.280 --> 10:26.000
what does it mean first of all is a framework is a framework that allow you to

10:26.000 --> 10:33.240
code so you write code Java mainly let's say you're a Japanese language I'm

10:33.240 --> 10:38.520
saying Java for simplicity but you can use any jv language or Python jv and

10:38.520 --> 10:44.280
first okay so you can write code and this is where the power is because you can do

10:44.280 --> 10:49.080
super simple aggregation you know count over tongue window that's super easy you

10:49.080 --> 10:54.040
have very high level primitive for that but you can go down as much as you

10:54.040 --> 10:59.440
need so you can really implement the business logic you're writing codes in Java

10:59.440 --> 11:04.680
or Python it's also distributed processing energy because it allows you to run

11:04.680 --> 11:10.640
it at scale in the distributed way and it scales a lot okay and it's designed to

11:10.640 --> 11:19.080
be fast to manage to handle huge throughput huge is mainly designed for state

11:19.080 --> 11:21.920
for computation what is a state for computation every time you need the

11:21.920 --> 11:27.160
application to keep in mind something account is stateful because you need to

11:27.160 --> 11:32.000
keep the count keeping the count is not difficult you have a variable

11:32.000 --> 11:37.200
memory what is difficult is doing safely in the application stop and restart so

11:37.200 --> 11:41.160
if you just keep a variable when you stop and restart everything is lost so

11:41.160 --> 11:45.560
this is the part that is complicated then fling provides guarantees on this

11:45.560 --> 11:49.880
check pointing say pointing are not going down in that but this is out to the

11:49.880 --> 11:54.840
box with fling so it provides what is going exactly once internal state

11:54.840 --> 11:59.720
guarantees so the count will never be messed up if the application crashes and

11:59.720 --> 12:04.720
restart this is guaranteed I'm using the count as a simple example but can be

12:04.720 --> 12:09.400
something much more complicated for example mini state machines about the state

12:09.400 --> 12:13.160
of one device you can have a state machine for each device and this is state

12:13.160 --> 12:17.720
okay and fling will handle this for you and the last part bounded and I'm

12:17.720 --> 12:22.080
bounded that's what's what is this it's probably not so important for our

12:22.080 --> 12:25.240
use case because we are focusing on streaming but actually something is

12:25.240 --> 12:30.960
there's thing in fling is generalize batch as a sub case of streaming so with

12:30.960 --> 12:35.680
the same API you can actually implement the use case running to on a batch of

12:35.680 --> 12:41.280
data you have a five set you run it as you would do with spark then use which

12:41.280 --> 12:47.760
to stream and you run the same as streaming running continuously the API is

12:47.760 --> 12:52.920
the same and the way fling does it internally because it really generalizes

12:52.920 --> 12:59.720
the bounded data set the batch data set as it's just a stream with bounce

12:59.720 --> 13:04.160
again this last part is not so important for us today because we are we are

13:04.160 --> 13:08.640
talking about spring processing so it's an engine that keeps running in other

13:08.640 --> 13:13.440
important thing about fling is it has multiple connectors connectors the

13:13.440 --> 13:17.760
component fling that allow you to connect to an external system so think

13:17.760 --> 13:24.160
can read from many many system this is totally not exhaustive okay just

13:24.160 --> 13:30.960
if you sample them and write to many many system streams messaging but

13:30.960 --> 13:35.000
also bounded data set you can read files you can read for a database same

13:35.000 --> 13:40.200
thing for writing and again this is not exhaustive notice that permit is not

13:40.200 --> 13:48.480
there or use not to be there and I will get there in a minute so let's go back

13:48.480 --> 13:52.400
to our use case so what we are talking about what the home introduce it we

13:52.400 --> 13:57.800
want to do something like this so we have our observed assets we ingest them we

13:57.800 --> 14:01.520
don't care how you ingest many many tools for doing that different

14:01.520 --> 14:06.720
technology we don't care so you ingest them and you have all the events all the

14:06.720 --> 14:11.480
raw events from your devices and we want to pre-process it with fling and

14:11.480 --> 14:16.720
then write directly to providers that we consider like a time series

14:16.720 --> 14:21.960
did we and then address this what you are actually familiar with actually you

14:21.960 --> 14:26.680
need an additional component in the middle because you cannot push into fling

14:26.680 --> 14:32.280
fling is an application you can push into the application you need to read

14:32.280 --> 14:38.720
fling need to read need to pull the data from some source and because you need

14:38.720 --> 14:45.480
to flow data in you need to stream data in the most appropriate transport let's

14:45.480 --> 14:49.800
say for this kind of use case is a stream storage like Kafka for example Kafka

14:49.800 --> 14:54.280
is probably the best fit but also like pools so you can use messaging you can use

14:54.280 --> 14:59.200
services like in-eases streaming again messaging will probably be fine but

14:59.200 --> 15:06.040
usually with fling with state food stream processing a log semantic like Kafka is

15:06.040 --> 15:09.360
better because they allow to keep guarantees and go back to some point and

15:09.360 --> 15:13.360
replay if needed but anyhow you need a component in the middle so your

15:13.360 --> 15:19.560
ingestion layer layer has to adjust this data whatever they are and send them to

15:19.560 --> 15:24.560
Kafka or any other stream storage and then you will have fling consuming a

15:24.560 --> 15:31.320
real time with quite low latency from from Kafka do the processing and then

15:31.320 --> 15:38.640
write directly to providers so and what can you do with fling effectively well

15:38.640 --> 15:41.800
you can solve the problem that phone was introducing because it's very easily

15:41.800 --> 15:48.040
with the fling API with the programming interface to reduce cardinality because

15:48.040 --> 15:52.880
you can aggregate horizontally and you can actually aggregate horizontal not just

15:52.880 --> 15:58.640
by the dimension you already have but because you can do enrichment because you

15:58.640 --> 16:02.480
can easily look up an external database for example to look up for as

16:02.480 --> 16:08.360
on we're saying the model the maker of a device the location what is the

16:08.360 --> 16:12.360
owner of the device so all data that you probably have in database with fling

16:12.360 --> 16:16.440
is very easy to do it efficiently not you know in the dumb way just quering

16:16.440 --> 16:21.400
every time a database looking up and adding additional dimension to your data

16:21.560 --> 16:26.320
this is a mention are actually necessary for doing a meaningful observation

16:26.320 --> 16:31.400
and because your automation becomes easy to use this additional dimension to

16:31.400 --> 16:36.160
reduce cardinality reduce the number of actual metrics because maybe for some

16:36.160 --> 16:44.800
use cases you don't want to you know follow each car each device alone you

16:44.800 --> 16:50.400
actually want to aggregate them and if you want you can reduce them okay you can

16:50.400 --> 16:55.080
reduce cardinality granularity means frequency in the sense because it's not

16:55.080 --> 17:00.760
uncommon that where the distance of devices the frequency they send row events

17:00.760 --> 17:05.400
is very high it's actually the sending event not just for observability the

17:05.400 --> 17:11.280
sending event for other jobs and they may send multiple times for seconds and

17:11.280 --> 17:16.640
multiple times of multiple samples per second is too much for observing for

17:16.640 --> 17:21.800
additional alerting is a lot of course you can throw it to parameters but it's

17:21.800 --> 17:25.520
a waste of resources and of course you can do filtering if you want filtering

17:25.520 --> 17:29.360
out of layers whatever and you can very easily calculate the right metrics and

17:29.360 --> 17:33.640
this is the right metrics it's not just you know some to other metrics of

17:33.640 --> 17:37.800
course you can do it and you can probably do it easily also in prompt you

17:37.800 --> 17:42.640
well but as we are saying you can for example implement some simple state

17:42.640 --> 17:46.960
machines so based on row events they're not necessarily metrics they're

17:46.960 --> 17:52.360
coming from device you can decide that the that device was moving from this

17:52.360 --> 17:57.360
state to this other state and then back because flinky state full it will keep

17:57.360 --> 18:01.960
the state also if you stop and restart or your application crashes so you keep

18:01.960 --> 18:09.560
track on this and then you emit the state periodically as a metric okay of course

18:09.560 --> 18:13.240
you need to turn it into a number we are talking about metrics but the way

18:13.240 --> 18:17.240
here of course but this is something you can do is very powerful and you could

18:17.240 --> 18:26.160
put any logic there it's programming logic so the top three features are very

18:26.160 --> 18:32.440
useful to scale because they simplify scale you know you reduce the number of

18:32.440 --> 18:37.320
samples you're sending to parameters so because we are talking about tens of

18:37.320 --> 18:41.440
thousand a hundred or thousand device it's a lot if they're sending each

18:41.440 --> 18:46.880
is sending one metric per second and probably each device is sending a

18:46.880 --> 18:51.800
bit of 10 metrics or something that you very easily reach huge number of

18:51.800 --> 18:58.320
cardinality and frequency and during the top one you actually reduce the

18:58.320 --> 19:02.920
throughput to to permitted you can do some you can we just kind of don't

19:02.920 --> 19:06.760
need no granularity the other up to you really depends the bottom one

19:06.760 --> 19:11.040
enrichment and the right metric actually simplify analysis or make some analysis

19:11.040 --> 19:15.240
possible as I was saying for example if you don't have the maker if you don't have

19:15.240 --> 19:18.800
the location of the device and maybe the device don't know the location you

19:18.800 --> 19:23.000
know it because of an external process you have it in a database but the device

19:23.000 --> 19:28.920
don't know it device is only sending you the ID how can you analyze something

19:28.920 --> 19:33.240
that is happening in a room if you don't know if you just know the idea of the

19:33.240 --> 19:37.920
device unless you do some form of enrichment so actually the bottom ones

19:37.920 --> 19:45.200
unlock some of the observability cases unlock some cases that you cannot do

19:45.200 --> 19:53.720
as well otherwise sorry just checking don't we are in time so let let's look at

19:53.720 --> 19:59.160
the integration between flink and promoters so what we are talking about just

19:59.160 --> 20:03.680
you need a bit in the our flink application this is a very simplified

20:03.680 --> 20:07.840
representation of a typical flink application so in the flink application you

20:07.840 --> 20:12.680
always have one or more sources they are connecting to the external system

20:12.680 --> 20:17.480
Kafka in this case or whatever you have many connections doing that you do one or

20:17.480 --> 20:22.520
more transformations whatever what we described and then you end up sending the

20:22.520 --> 20:26.600
data to one or more things we are showing one here but you can actually I have more

20:26.600 --> 20:31.880
we talk about that later and our idea is you have your observative assets you

20:31.880 --> 20:36.280
ingest it we omitted in this picture Kafka in the middle then you have your

20:36.280 --> 20:40.840
application reading from Kafka for solving and sinking in this case directly

20:40.840 --> 20:44.800
we know intermediary to primitives using the push remote right

20:44.800 --> 20:51.840
primitive interface okay so if you are familiar with flink you and I'm sure you

20:51.840 --> 20:57.280
have heard about the flink primitives metric reporters okay so you were maybe

20:57.280 --> 21:01.560
wondering what is this guy talking about it is a small problem it's already

21:01.560 --> 21:05.160
there we're talking about something completely different okay the

21:05.160 --> 21:08.440
primitive flink related primary report is the

21:08.440 --> 21:14.480
time for observing flink is the time for observing the application it works perfect

21:14.480 --> 21:21.640
works well it's a sole problem but you cannot use you cannot hack it to handle

21:21.640 --> 21:27.400
your data it will never scale because your flink application even if he's running

21:27.400 --> 21:32.400
at a huge scale even if he's super complicated that we never had tens of

21:32.400 --> 21:36.320
thousand or metrics and the exporter is not the time for that it doesn't provide

21:36.320 --> 21:43.120
any guarantees it is not designed to handle data is designed to observe flink so

21:43.120 --> 21:46.040
we're talking about something different we are not talking about this what we are

21:46.040 --> 21:52.520
talking about is that sink we need a sink so until recently there was no

21:52.520 --> 22:00.920
primitive sink connector in flink you can implement your own if you want

22:00.920 --> 22:06.880
remote remote right protocol is well documented the only thing you have to do

22:06.880 --> 22:11.680
everything by yourself because there is no primitive client you have to use HTTP you

22:11.680 --> 22:15.640
have to serialize you have to batch if you're trying to send one sample at a

22:15.640 --> 22:21.720
time a primitive will just stop receiving data very very soon you will not be able

22:21.720 --> 22:25.320
to say you need to do proper batching you need to be sure that if you

22:25.320 --> 22:30.160
parallelize the right you keep the sample to the same time series in the same

22:30.160 --> 22:34.160
power right or otherwise permitted we start speeding them back because they are

22:34.160 --> 22:38.640
out of order and other things like that so it was possible what was very

22:38.640 --> 22:43.040
cranky while conversely with the connector just a thing component you just put

22:43.040 --> 22:48.520
it there and you send the data so this actually was released recently in last

22:48.520 --> 22:53.640
November is actually a contribution we contributed the two of us and a couple

22:53.640 --> 22:57.680
other colleagues contributed this part of the Apache flink project okay it's not

22:57.680 --> 23:01.680
it's nothing property is it's part of the Apache flink project it does support

23:01.680 --> 23:07.720
with the recent version they support a version flink 111 20 the API is the same

23:07.720 --> 23:12.920
for thing 2 we have to test it to be honest it should work fine because the API

23:12.920 --> 23:17.120
is the same but flint 2 at the moment is kind of in flight soon it will be

23:17.120 --> 23:21.200
ready but I'm sure we will test it I'm not sure it will work and this just

23:21.200 --> 23:27.240
showing the documentation very briefly we tested it because the goal was

23:27.240 --> 23:34.240
working at scale so we tested the up to like the seller says up to but these are

23:34.240 --> 23:40.520
real numbers one million events per second with my one million second one

23:40.520 --> 23:46.160
million metrics a cardinal to one million this is not the limit we stopped there we

23:46.160 --> 23:50.200
stopped there but because we said okay this is probably nothing and it was

23:50.200 --> 23:55.240
working and the flink class that we are using wasn't huge at all okay

23:55.240 --> 23:59.800
wasn't huge at all we just stopped there and frankly is I stopped there because

23:59.800 --> 24:05.320
I had to request an increase of the quote of managed from it and I said okay

24:05.320 --> 24:08.960
one million metric one million event a second it's probably good enough let's

24:08.960 --> 24:14.240
stop it but I'm sure is capable to go further so they actually some trade

24:14.240 --> 24:18.360
of course nothing and nothing is a free launch so the first one is this

24:18.360 --> 24:25.000
solution requires coding there is no no code solution and you need to write

24:25.000 --> 24:29.840
code at the moment is only Java we are not yet supporting the Python flink

24:29.840 --> 24:34.840
Python interface we will probably get it soon but usually if you want to have

24:34.840 --> 24:41.120
big volumes and a low latency is better Java again I'm saying Java JVM many

24:41.120 --> 24:44.440
people are using Kotlin for example or Scarla Scarlet is becoming less

24:44.440 --> 24:49.000
special but for different reasons but you can use it of course with flink the

24:49.000 --> 24:54.080
other point is well you need to run your flink job so you need to set up a

24:54.080 --> 25:00.680
flink cluster you can run Kubernetes you can manage it is not crazy brilliant but

25:00.680 --> 25:05.680
you have to do something okay or you can use a managed managed service wing

25:05.680 --> 25:10.720
but anyhow you can run whatever you want so but you have to consider this of

25:10.720 --> 25:15.640
course and you have to consider whether your observability team has

25:15.640 --> 25:20.560
this skill or maybe you need to bring in someone with the skill so actually

25:20.560 --> 25:23.840
handing over to Hong for the conclusion I think we are in time thanks

25:23.840 --> 25:27.440
right so I think just to kind of summarize what we've covered right so the

25:27.440 --> 25:31.840
problem space you look at right is like devices lots and lots of devices right

25:31.840 --> 25:37.520
so where at scale we have lots of cardinality of high frequency maybe like

25:37.520 --> 25:41.120
you might need to clean up your data a little bit like you might have like

25:41.120 --> 25:44.400
devices which only can send small amount of contacts right that kind of

25:44.400 --> 25:48.040
problem space and we still want to use the kind of stack that we're familiar with

25:48.040 --> 25:50.600
right we don't want to build a custom monitoring solution we don't want to

25:50.600 --> 25:54.040
build a custom dashboarding and letting solution right you want to use what we

25:54.040 --> 25:57.960
already have you can use flink to plug it in basically so you don't have to

25:57.960 --> 26:01.000
learn the end users don't have to learn something new you can use that and flink

26:01.040 --> 26:05.760
we're a bit able to allow you to scale your things you're observing right

26:05.760 --> 26:09.360
so that you can still use the same solution without with low latency and

26:09.360 --> 26:14.040
you can yeah so it solves it for you basically so I think another kind of

26:14.040 --> 26:18.120
kind of easing a little bit of how for the internet flink can do as Lorenzo

26:18.120 --> 26:21.520
mentioned flink connects a lot of things the job we showed is very simple linear

26:21.520 --> 26:24.880
right one source one transform one sync that's not all we can do right so

26:24.880 --> 26:27.920
the point of bling is that you can like one of the things is stream and

26:27.920 --> 26:30.280
batch unification which can talk about later if you're interested but the

26:30.280 --> 26:34.440
idea is that you can you can produce data at high speed and you can produce

26:34.440 --> 26:39.440
data like in the batch and you can use the same job to do it the idea that you

26:39.440 --> 26:42.280
can read our body that's it's a lot of things that it solves but in this

26:42.280 --> 26:45.800
case it gives once you plug it in you get lots of things for free which is the

26:45.800 --> 26:48.120
primary trying to say so you don't you can call obviously you can call your

26:48.120 --> 26:51.120
server and can manage it but using flink plugs into a lot of things so the

26:51.120 --> 26:53.840
connector we wrote that's free you don't have to implement the remote right

26:53.840 --> 26:56.180
up for the call and everything you don't have to manage the time

26:56.180 --> 26:58.940
semantics you don't have to manage the checkpoint thing everything comes for

26:58.940 --> 27:04.380
free and in fact what you can do here is sync flink and write into multiple

27:04.380 --> 27:08.060
sources it can write the same data with consistency into multiple sources so

27:08.060 --> 27:10.860
for example you know a lot of people they they want to write it into this

27:10.860 --> 27:14.780
dashboard installation for maybe a particle use a group but maybe they want to

27:14.780 --> 27:18.460
store it like for example an a data link right somewhere in long term right

27:18.460 --> 27:22.260
way you don't have to like you can query over like five years right without

27:22.260 --> 27:26.260
without having issues and what you can do is flink and allow you to write it in

27:26.260 --> 27:32.060
both cases using the same job using one distributed cluster one job and

27:32.060 --> 27:35.460
distributed cluster to write the same semantics and the same logic into the

27:35.460 --> 27:39.020
off both of these things you update the logic of your job both of the logic

27:39.020 --> 27:41.580
changes at the same time of course there's coupling and stuff that's straight

27:41.580 --> 27:45.500
off they can talk about by the point is simplifies all your logic into one so

27:45.500 --> 27:53.540
you don't have to worry about like like diverging data analytics platforms

27:53.540 --> 27:56.740
which is a mess sometimes when people stop building incrementally so I think

27:56.740 --> 28:00.020
it's and flink scales so you don't have to worry about that you can migrate so

28:00.020 --> 28:03.740
I think I think another thing to mention as well is handle multiple types of

28:03.740 --> 28:06.700
formats so if you want to do right to data link with a different format that you

28:06.700 --> 28:09.940
write to promitias for example you can write in the column format up to you you

28:09.940 --> 28:12.620
can do it flink and handle it for you and you don't have to worry about it so

28:12.620 --> 28:16.580
hopefully that's convinced you a little bit about what one potential solution you

28:16.580 --> 28:23.380
can try yeah just a couple resources we will we have shared the slides so the

28:23.380 --> 28:26.220
first one is the documentation it's actually part of the flink of

28:26.220 --> 28:28.740
documentation so you don't have to search is the a connector to the

28:28.740 --> 28:33.140
documentation the second one is a GitHub repo with an example of the use case

28:33.140 --> 28:36.780
we described of course it's synthetic we have a data generator and the

28:36.780 --> 28:40.660
digital generator is simulating they could connect the vehicles sending the

28:40.660 --> 28:45.820
different metrics and and well you can set up everything and you can run with

28:45.820 --> 28:49.220
tens of thousands of vehicles is actually another flink application generating

28:49.220 --> 28:53.140
the data and you can see how it works and also we have a separate application

28:53.140 --> 28:57.260
just writing the rule data is the are in promitias so you can play with the

28:57.260 --> 29:02.860
dashboard to see how it is getting something useful from the aggregate and

29:02.860 --> 29:06.160
rich metric and try to do the same with the raw metrics.

29:06.160 --> 29:10.420
Romantic are simple but kind of realistic okay so you can play with it it's it's

29:10.420 --> 29:15.240
a little repo and that is it please get in touch if you have any questions if

29:15.240 --> 29:25.120
you have any questions here we are here and the question we have one question please

29:25.120 --> 29:29.040
for me say that I'll ask you absolutely have to leave so you can all hear the

29:29.040 --> 29:33.160
question and answers okay so I have to go over there

29:33.160 --> 29:38.600
Hi the enrichment use case is really nice the enrichment use case is really nice

29:38.600 --> 29:45.880
but for the cardibility reduction and the granularity reduction promitias already

29:45.880 --> 29:52.240
supports let's say 2 million metrics for cities now you have added a Kafka there

29:52.240 --> 29:56.840
you added a flink cluster there so I'm now adding more nodes more machines

29:56.840 --> 30:01.380
so can you do a cost can you share some cost analysis like okay I'm reducing

30:01.380 --> 30:05.420
two million how much more nodes I have to add in order to get the reduction

30:05.420 --> 30:10.980
absolute I think it really depends on the use case to be honest so if you just

30:10.980 --> 30:17.180
do cardibility reduction frequently reduction it might be an overkill the

30:17.180 --> 30:20.980
thing is sometimes you cannot do credit on it reduction because you don't have

30:20.980 --> 30:26.220
the additional dimensions so I think the key features enrichment to add

30:26.220 --> 30:31.020
relevant dimension and then we'll allow you to reduce the others the other

30:31.020 --> 30:35.940
thing is it might be well you need to scale promitias but I reckon if it is

30:35.940 --> 30:40.900
done if it just that you can probably do with promitias the reality is what we

30:40.900 --> 30:44.140
have seen is most of the time you also need to have additional information

30:44.140 --> 30:49.380
dimension that you don't have because the device is just sending the ID and

30:49.380 --> 30:53.580
the metrics and the measurement or an event or event this will happen

30:53.580 --> 30:59.620
let's say so you're through it we didn't do any proper you're sure sorry

30:59.620 --> 31:02.620
a bit of an acquisition but yeah just add just under the add to that I think

31:02.620 --> 31:06.020
another thing that this you're right you're adding complexity right at the

31:06.020 --> 31:09.780
front but why is it the right place that I complex deep because for example if

31:09.780 --> 31:12.780
you're writing something like Kafka and you want to change the type of

31:12.780 --> 31:16.180
aggregation you want to run you want to run on the exact same data over the

31:16.180 --> 31:19.620
last I don't know Kafka can stall like that's multiple tiers you can stop

31:19.620 --> 31:22.340
so you want to run over the last one year and you want to change the way you

31:22.340 --> 31:25.820
analyze it fling can do that for you you can read from Kafka one year ago

31:25.820 --> 31:30.900
repress it right into someone else if you're using it you can just plug and play

31:30.900 --> 31:36.660
right so the idea is that you can do it multiple ways but doing it here like

31:36.660 --> 31:40.220
people I guess people are trying to sell this idea a bit more because because of

31:40.220 --> 31:42.940
that you can you can change the way you analyze and the list is changes right

31:42.940 --> 31:46.660
you get new data on I don't know like one year down the road and you're like

31:46.660 --> 31:50.460
hey you know let me let me read we cut it in a different way right this gives

31:50.460 --> 31:57.980
you flexibility yeah has gone and do you find that you sort of push the

31:57.980 --> 32:05.100
problem back like further than with my no cardinality hello okay do you

32:05.100 --> 32:09.980
find that you know obviously cardinality explosions are very common and

32:09.980 --> 32:14.740
premieres but do you find that you have it creates a sort of an overhead

32:14.740 --> 32:18.700
of maintaining those kind of rules and jobs like further down down the

32:18.700 --> 32:22.820
stack closer you know do you have is there do you have a sort of a rule

32:22.820 --> 32:27.780
explosion then you know maintaining this huge rule session is that I mean I

32:27.780 --> 32:31.620
find that with recording rules and premieres which is the way a lot of people

32:31.620 --> 32:36.300
tackle this kind of thing that you just end up maintaining this massive rule

32:36.300 --> 32:41.020
session so do you find that with the fling jobs or is that a thing yeah so

32:41.020 --> 32:44.860
if I understand the question correctly so one problem we face when we do

32:44.860 --> 32:48.740
all this processing and premieres which can do you can do like a rule set that

32:48.740 --> 32:51.580
you can kind of have the right metrics and then but then after why it starts

32:51.580 --> 32:54.580
exploding right you have lots and lots of rules and you don't remove the rule

32:54.580 --> 32:58.260
because maybe something later on is gonna use it so you get a huge list so

32:58.260 --> 33:01.940
fling I guess you can end up in that situation I guess the difference is that

33:01.940 --> 33:05.780
maybe in fling like I guess you're accepting the fact that you're gonna

33:05.780 --> 33:08.780
code something I guess you can write you can write in SQL you can write in

33:08.780 --> 33:12.980
Python you can write in Java but the thing is then you have the flexibility of

33:12.980 --> 33:16.620
not a config file you know the flexibility of testing that code you have

33:16.620 --> 33:20.740
flexibility of integrating it with dummy data and assuming that you're

33:20.740 --> 33:24.500
you're and then forcing the interface the data interface that whenever you have

33:24.500 --> 33:27.580
this type of data you're gonna get a discount results so it allows you to have

33:27.580 --> 33:31.460
some pump your I guess it does add complexity sorry not gonna hide it it is

33:31.460 --> 33:35.020
complexity but it adds complexity in a right place right where you can you can

33:35.020 --> 33:38.260
then enforce against API then we talk about data contracts we talk about

33:38.260 --> 33:41.060
different things think it's not the only solution you can use like something

33:41.060 --> 33:44.100
like spark which maybe you're more familiar with but we can talk about the

33:44.100 --> 33:47.300
differences between spark and flingulate on as well this batch of the streaming

33:47.300 --> 33:53.220
stuff I know so if you need to add rules you need to add them somewhere so

33:53.220 --> 33:58.780
either either put them in point QL or in your final parameters in the

33:58.780 --> 34:03.100
different forms I'm saying from QL just to simplify or you move it over there

34:03.100 --> 34:09.180
you up to you the good news it well in thing you can unit test good luck

34:09.180 --> 34:15.460
unit testing what you do on this side talking about long-term storage can you

34:15.460 --> 34:21.220
write directly can you yeah for long-term storage can you write directly into a

34:21.220 --> 34:26.220
finals I can't directly do the question sorry because he's very

34:26.220 --> 34:29.820
know as you're asking about the long-term storage yeah long-term storage and can

34:29.820 --> 34:34.140
directly into finals as opposed to going into a previous and then going into a

34:34.140 --> 34:41.660
finals so sorry is really noisy right directly into finals oh in

34:41.660 --> 34:49.340
times yes sorry well effectively you are because the remote rights the

34:49.340 --> 34:52.500
remote right interface is just an interface on top of whatever is the

34:52.500 --> 34:57.300
back end so this was actually what we're doing we were using just for

34:57.300 --> 35:01.180
simplicity the managed parameters that is actually based I think it's based oh

35:01.180 --> 35:04.500
I'm not sure what is based on but it's not just you know standalone

35:04.500 --> 35:09.300
parameters so the answer is of course you can of course you can keeping any data

35:09.300 --> 35:12.780
like is for different use cases because you keep it there you do a long-term

35:12.780 --> 35:17.100
analysis you train machine learning you do whatever you need but it's it's

35:17.100 --> 35:22.580
parallel so your dashboard is fed by parameters you know how to do it you have the

35:22.580 --> 35:27.260
other data I think you've heard it all you mentioned that the

35:27.260 --> 35:32.820
permit is sync implemented the remote right version one one are you planning

35:32.820 --> 35:35.660
to also implement the support for the remote right version two which was

35:35.660 --> 35:41.340
introduced with permit is 3.0 oh we are planning as soon as it will be you know

35:41.340 --> 35:45.380
stable and really I think it's declared stable like we'll probably hear the

35:45.380 --> 35:49.260
next talk but it is okay remember it's already declared state that there's good

35:49.260 --> 35:53.060
I was we will we will we we could be updated