WEBVTT

00:00.000 --> 00:27.400
So, hi everyone. Welcome. Thanks for being here. This is my first presentation. I hope

00:27.400 --> 00:34.400
you don't want to mess it up. The reason I wanted to give it is because I, sorry.

00:34.400 --> 00:41.000
So, the reason I wanted to give this presentation is that, first of all, I wanted to present

00:41.000 --> 00:46.600
mirror hall, which is this application to use other Linux devices as kind of peer-to-peer

00:46.600 --> 00:53.000
extended monitors over the network. So, you like use them as virtual monitors. And the reason

00:53.000 --> 00:58.720
I wanted to present it was because I think basically how it was built this kind of

00:58.720 --> 01:04.400
layer and it's faster to go a bit deeper into it and we explain how to basically

01:04.400 --> 01:10.920
remake your own virtual display sharing solution at home. So, I just wanted to start with

01:10.920 --> 01:16.880
a bit of history saying that when in the 90s, personal computers became more popular and people

01:16.880 --> 01:22.200
started promoting mobile devices and mobile development. One of the core ideas that

01:22.200 --> 01:29.200
works, which was the research institute, then Apple got the ideas from that long story.

01:29.200 --> 01:35.080
It was to get those devices to seamless interact with each other. So, basically, you

01:35.080 --> 01:39.800
had a portable device that could be connected to a larger one to use the larger one as a

01:39.800 --> 01:45.960
monitor and all that kind of stuff. Why did that never happen? Why are devices still so hard

01:46.040 --> 01:53.440
to interface with each other? That is because in the state never favored standards for peer-to-peer

01:53.440 --> 01:59.640
communication, for open communication between proprietary platforms, but everyone went for their own

01:59.640 --> 02:07.640
personal implementation in a way, so convergence and peer-to-peer sharing and wireless

02:07.640 --> 02:14.640
desktop sharing never really became a unique thing. So, to be fair, we have a lot of solutions

02:14.720 --> 02:20.760
that work really well for simple wireless desktop mirroring or Linux, such as Moonlight or

02:20.760 --> 02:26.600
Sunshine, which are mostly meant for games. I never used them, I just know they're good.

02:26.600 --> 02:32.360
No network displays, which I did use, and it's pretty good for like Miracast Chromecast. But

02:32.360 --> 02:37.560
we don't really have anything for more interesting kinds of mirroring. So, first of all, I think

02:37.560 --> 02:41.600
we're familiar with how mirroring works on the other hood, so it's basically, I have my

02:41.680 --> 02:47.120
screen here. This is a display buffer. We record that screen. We reuse the same buffer.

02:47.120 --> 02:52.880
We encode it, we transmit it over the network, we play somewhere else. But what if we instead

02:52.880 --> 03:00.080
wanted to not mirror the same screen, but mirror another display buffer, then in that case,

03:00.080 --> 03:06.000
we basically have to interact with the backend, and to collaborate with the backend, which

03:06.080 --> 03:11.280
everything is to create somehow a virtual display, a virtual buffer, and then that is the one

03:11.280 --> 03:17.440
we aim to record, the one we aim to stream. And that is something that is not clearly

03:17.440 --> 03:23.760
standard. So, one solution that kind of does that a proprietary solution is a put side

03:23.760 --> 03:29.680
car, which also never used, but it's kind of you extend your max screen, you use your

03:30.080 --> 03:37.360
iPad as a second screen, some iPads, because you need expensive ones. But, during this,

03:37.360 --> 03:44.560
only nooks using existing solution would have worked, because for example, all those protocols

03:44.560 --> 03:49.360
are proprietary, so they are not easy to implement, and they also pretty, you also have pretty

03:49.360 --> 03:54.880
high latency, because most of them use TCP. Most of the implementations are also under receiving

03:54.880 --> 03:59.360
and, like, on your TV, are like software encourage, so there are a lot of latency, because it's

03:59.360 --> 04:05.040
more important to get, like, accurate video normally than fast video, and what if I wanted to

04:05.040 --> 04:10.640
kind of re-opuse it. And also, it's really easy to stream from Linux onwards, you record a screen

04:10.640 --> 04:15.440
you stream it, but it's really hard to then become a receiver for that stream currently,

04:15.440 --> 04:20.000
because for example, to do it on my podcast, you need very strange, like patches, to when you need

04:20.000 --> 04:28.000
to, like, stop Wi-Fi, to set a paddock, very, very complicated. So, starting from the virtual

04:28.000 --> 04:33.840
mirroring we saw before, I was wondering, like, what happens when you scale it up, like, what happens

04:33.840 --> 04:41.040
when you have different monitors, different devices? Can you still do some kind of virtual virtual

04:41.040 --> 04:47.920
screen buffers that really depends? So, if we have available video buffers, if we have available

04:48.000 --> 04:52.800
hardware encoding buffers, because for example, when you use your hardware encoder for video,

04:52.800 --> 04:57.360
usually you can't really put more than one stream at once, so you have to do, like, software

04:57.360 --> 05:04.240
encoding for some and hardware encoding for one. And what if we're going on for a time when to make it,

05:04.240 --> 05:12.720
like, bidirectional, so that we can turn every device into both a receiver and a streamer,

05:13.520 --> 05:22.560
so that I can just decide. And that is basically what I wanted to do for a long while, so in 2020,

05:22.560 --> 05:31.840
known for it, introduced a really cool, headless native back end, and virtual monitors API, which is

05:31.840 --> 05:38.640
basically you ask matter to create a virtual monitor on my land. It does it without any kind of

05:38.640 --> 05:45.360
packs and you can have stuff. I made my first prototype in 2022 using those APIs and the bunch

05:45.360 --> 05:52.400
of Python stuff, and that was, like, I was using my Librum 5 as our wireless screen for my,

05:52.400 --> 06:00.240
like, our wireless extension to just have it on my keep track of stuff. And then I made the first

06:00.240 --> 06:11.600
setup in November 2020, 3 and the first release just a month ago. So, basically, the mirror hole

06:11.600 --> 06:21.040
looks like this right now. I have to be really honest with you, I break my test like my test

06:21.040 --> 06:27.360
tablet this morning, so I have an alternative one, which I hope works as well. But basically,

06:27.360 --> 06:35.440
this is how it looks. I open it here as well. It uses MDNS internally, the same protocol as, like,

06:35.440 --> 06:43.680
the LNA, I think, Chromecast for sure, like, so right now, I basically turn this device into a mirror,

06:43.680 --> 06:51.680
it's kind of not very practical to be honest, but oh god, I hope this is setup correctly, because I'm

06:51.680 --> 07:08.320
not, no, there's no network. Never mind, so this device apparently doesn't, it doesn't

07:08.320 --> 07:17.200
have a working network, but it had 30 minutes ago when I flashed it, so anyway, let's do a very

07:17.280 --> 07:26.880
boring thing then. Let's spawn up another instance from must, I hope this works a bit better.

07:29.200 --> 07:36.880
And in theory, if I do this, we should be able to get something out of it.

07:39.520 --> 07:44.320
Okay, this is not really clear. What's happening? Because that's the same device.

07:45.200 --> 07:50.480
I'm really sorry, okay, this was not really, but basically what we have here is that we can kind of,

07:53.120 --> 07:58.720
kind of, yeah, okay, you see what's happening, it's like, it's literally mirroring itself

07:58.720 --> 08:03.760
and look back at mode, which is not very meaningful, I want it to do it better, but I'm sorry,

08:03.760 --> 08:18.240
I'm also trying to predict it. So you didn't succeed. Thank you, matter, those are very,

08:18.240 --> 08:26.800
very experimental APIs for a reason. Can I, can I, can I mirror, oh god, it is a mirror mode, but it's

08:26.800 --> 08:34.800
not really, yeah, okay, this is the only way I can do it, apparently. So you didn't see anything,

08:34.800 --> 08:41.760
basically, you didn't see, okay, yeah, this was, this was the idea, like you can basically switch

08:41.760 --> 08:49.360
between it and now it went somewhere, somewhere in the, what, what the, what the actual, okay,

08:49.360 --> 08:58.320
well, like this is not, what the fuck? Like this, this is right, okay, I never tested this

08:58.320 --> 09:02.720
presentation using multiple screens and doing something that this is the multiple screen API to

09:02.720 --> 09:10.160
test it, maybe it wasn't, okay, anyway. So the way this works is basically that we have

09:10.160 --> 09:14.240
a mirror hall, which is this player and streamer app, which is just dedicated, you just

09:15.120 --> 09:20.560
and below we have three small libraries that are lip mirror that detects the desktop environment

09:20.560 --> 09:26.000
that you're using and uses the bus to communicate with matter, it was to negotiate a virtual

09:26.000 --> 09:31.920
monitor to create it and then gets a pipe wire stream back. Then lip cast basically has a tiny

09:31.920 --> 09:36.640
database of pipelines so that it can use hardware acceleration you're device to generate a fast

09:37.120 --> 09:44.400
including pipeline and we use all UDP packets, not RTSP, not any, like kind of wrapper protocol

09:44.400 --> 09:51.760
because it's way faster I know and then it basically takes the stream that you just created and

09:51.760 --> 09:59.600
feeds it to it, okay, yeah, yeah, okay, you can see it. And then we have lip network, which instead

09:59.600 --> 10:06.400
does stuff like MDNS, so the thing that you can see the device in the list, when you in theory,

10:06.400 --> 10:12.640
when you plug it in you also sit there, but the network interface is down, basically that would be the

10:12.640 --> 10:21.280
idea. So that's yeah, basically lip, a debust to know, to matter, get back the stream through

10:21.280 --> 10:27.920
pipe wire, then you encode the stream using a pipeline and you transmit it there and the receiver

10:27.920 --> 10:32.560
part is a bit simpler because it's basically a video player, it's basically a network video player,

10:32.640 --> 10:41.120
you can also use RTSP as a fallback, which you will see in a second, thank you. And yeah,

10:41.120 --> 10:47.520
because we also thought if you really don't care about latency, if two seconds of latency is enough for

10:47.520 --> 10:56.240
you, then RTSP can also work and it was like a CLI so that you can basically type of very long

10:56.240 --> 11:02.000
G-streamer launch command without having your whole installed and use a device as a receiver

11:02.000 --> 11:08.400
versus H without installing anything. Okay, so now for the more technical part,

11:09.920 --> 11:17.520
like I wanted to go a bit deeper into how this worked, I didn't want to talk about NDNS that

11:17.520 --> 11:21.680
kind of stuff, but just about the fact first all that is only support smuster, unfortunately,

11:22.320 --> 11:27.760
because as far as I know, there were no other standardized API except maybe in Swade,

11:29.440 --> 11:34.960
and KD6 apparently someone told me has introduced something similar or is in planning too,

11:34.960 --> 11:41.040
so but in theory, like once the RAPIs, his library is really easy to extend because it's basically

11:41.040 --> 11:46.400
like creating more debust calls, detecting your configuration and creating more debust calls.

11:47.360 --> 11:55.040
So the way this works and yeah, matter, I think we have a tiny bit of time, but you don't see

11:55.040 --> 12:05.360
my screen anymore, which is not great. It's basically here we would have this because I can't see my screen.

12:05.360 --> 12:18.640
Can we make this larger? Okay, we have a matter screen cast API and here there should be

12:19.680 --> 12:24.400
screen cast, so what we basically do first of all is we create a session.

12:26.080 --> 12:31.280
It will probably want some parameter, but I think it's it was an empty array here because it's

12:31.280 --> 12:39.200
like the array of options you're passing in. So here we receive back the name of the object that

12:39.200 --> 12:49.360
we're using and then we kind of go back to that object that we just received, which in this

12:49.360 --> 13:03.040
application is a bit obvious, we have to scan cast again and you see this newly made object and

13:03.040 --> 13:18.320
here under session we can do a record virtual call that will create basically a string object

13:20.000 --> 13:26.080
and then we could already start the string. The thing is if we do we wouldn't intercept the event

13:26.080 --> 13:34.400
that tells us where where the pipe wires, when the pipe wires turn appears. So to do that we

13:34.400 --> 13:54.080
basically have to intercept using the bus monitor and in theory now if we go back to the

13:54.080 --> 14:04.640
session that we created and we should receive a pipe wire stream here and you also see this thing

14:04.640 --> 14:09.280
this yellow thing which means basically that we are casting the screen. Okay you see now we

14:09.280 --> 14:23.680
received an event here and if we use a pipe wire SRC which is the one of the one component in

14:24.160 --> 14:38.640
it. So we saw just now that the event was what was it 115 something like that 105? In theory

14:39.920 --> 14:50.000
here's our virtual display. So now you kind of see the loopback thingy again and yeah that's

14:50.000 --> 14:55.920
basically the interesting part so it is not too hard to get a virtual display out of matter and when you

14:55.920 --> 15:04.240
then also like yeah that's that's basically it. Then once we have it basically we create this

15:04.240 --> 15:09.920
system of pipeline that we saw which if we if we divide it in components just we have the source

15:09.920 --> 15:16.480
which picks up the element then we have Qs for buffering then we use in this case X264 but again

15:16.560 --> 15:22.400
it's chosen by the like X264 on Intel right now on this specific build but the

15:22.400 --> 15:30.800
basically if you are using arm it will try to pick up better one and then we after we

15:30.800 --> 15:37.040
package it after we encode it into a format we package it into with the payload encoder and when

15:37.040 --> 15:44.000
then we send it in this case using UDP sync and yeah that's instead the pipeline that we just use

15:44.080 --> 15:50.720
further demo is just this simplified one that just uses thank you video convert and auto sync because

15:50.720 --> 15:57.600
that's easier and the other way around is also very simple so you can also do it from the CLI and you

15:57.600 --> 16:05.680
basically see the command to also replicate it outside the app so it's just basically the same

16:05.680 --> 16:14.480
components the opposite direction UDPs and auto source and so on so yeah this is something I honestly

16:14.480 --> 16:21.360
just kind of talked about so maybe it's not super important but we yeah do the harder configurations

16:21.360 --> 16:28.960
we use there should be a priority for like better over worse encoders and decoders which we don't

16:28.960 --> 16:34.720
do right now and right now we're also capping for example the same quality because we can't really

16:34.720 --> 16:40.400
negotiate it in real time like a really nice video protocol but we're just kind of guessing the

16:40.400 --> 16:47.040
best configuration and we are using some additional configuration parameters to minimize the amount

16:47.040 --> 16:52.640
of computation effort that has to be done to encoding each stream so it's maybe slightly larger

16:52.640 --> 16:58.800
but also like or if sorry slightly less precise but also faster if you have many terms at once

16:58.880 --> 17:06.560
and also right now are very big limitation that is I think kind of annoying to most people

17:06.560 --> 17:13.360
is that you have to manually patch your firewall to use it because you need to authorize a bunch of

17:13.360 --> 17:22.960
UDP ports that is something that the flat pack currently doesn't do and yeah another very big

17:22.960 --> 17:29.120
very interesting limitation is that as soon as I release the 0.1.0 there was a regression in

17:29.120 --> 17:39.040
known because nobody has used this API yet that made me a whole crash matter which was quite interesting

17:39.040 --> 17:44.400
like I was like okay finally I have a very reason that was like one week after the yeah so I had

17:44.400 --> 17:52.080
what that very scary warning and I don't really yeah recommend to use it in production stuff until

17:52.080 --> 17:59.040
we're sure it's stable we're sure the API is very tested and so on so for the next steps I think

17:59.040 --> 18:05.200
what's important will be to kind of first of all I think there are many things at once and we it

18:05.200 --> 18:10.320
would be a bit more manageable if we split it up because maybe many people don't recover the UI

18:10.320 --> 18:16.320
but care about only the heavy gun end point that basically has a helper to create a virtual display

18:16.320 --> 18:23.840
on each platform so we ideally want to split that up we want to add encryption that's very important

18:23.840 --> 18:28.800
because right now this term is not encrypted so it's fine if I connect it via USB but if I use

18:28.800 --> 18:37.120
MDNS over the network it's a bit more I don't know I wouldn't do it and also we would like

18:37.120 --> 18:43.520
apparently there is a lot of the headaches a lot of the heavy lifting of punching holes in UDP

18:43.680 --> 18:51.280
trying to find the right route on the network and so on would be solved by using a pretty good

18:52.320 --> 18:58.560
stack in Rust that is I I don't know how you pronounce it which is basically yeah

18:58.560 --> 19:05.680
peer to peer on UDP with automated whole punching and so on and also to be really nice to

19:05.680 --> 19:13.840
input methods in some way proxy the input method from device to another but that is very I

19:13.840 --> 19:21.360
really like I looked it up and for now it will take a lot of time so that's all yeah I want to help

19:22.080 --> 19:28.000
to thank all people who help me so that's the scale up there there's Sony there's a few of the

19:28.000 --> 19:34.160
known people there's people I chat it with who help me like understand the whole pipeline thing

19:35.760 --> 19:41.280
and if someone is interested in this please let me know because I don't really have a lot of time

19:41.280 --> 19:46.800
so to work on it so like I'm just very slowly developing but it's someone is like you know

19:46.800 --> 19:52.080
it's in testing it on strange platforms and so on that'd be really nice just let me know if it

19:52.080 --> 19:58.960
breaks white breaks so yeah we can also meet in Berlin if you want and that's like this

19:58.960 --> 20:11.200
preferences and I have 30 seconds for questions no yeah 0 seconds okay if you want

20:22.080 --> 20:37.120
yeah so basically we're using RTSP right now it's a can you repeat the question

20:38.720 --> 20:47.440
I'm sorry I was like okay so the question was whether we can use Chromecast to repeat

20:47.840 --> 20:53.920
to stream this like to use a fallback sync for Chromecast and the answer is yes you can do that

20:54.560 --> 21:00.800
it would be very slow like probably one second this latency but you can do that do it yeah and that is

21:00.800 --> 21:05.360
well one of the reasons why I wanted to split it like in the part that creates this plan the other

21:05.360 --> 21:09.000
yeah so