WEBVTT

00:00.000 --> 00:12.440
Okay, hi everyone, so with the next talk, we have Steven, we're talking about talking

00:12.440 --> 00:17.000
in talking to the community, I think we should be that hard.

00:17.000 --> 00:22.440
Thank you, this plan, this talk was originally planned to have a co-speaker, but he couldn't

00:22.440 --> 00:24.640
make it because he felt it.

00:24.640 --> 00:31.800
So with myself, I'm Steven, I work on a project I call dogspec, and I'm going to talk

00:31.800 --> 00:36.040
about that project.

00:36.040 --> 00:41.440
Dogspec originates from the Dutch government and we have an issue there because we have quite

00:41.440 --> 00:49.800
a bit of documents in government that are not accessible, that are not insropable, they're

00:49.800 --> 00:56.520
hard to get comfort, and within Dutch law and European law, it's basically a requirement

00:56.520 --> 01:02.480
that any information you make public because a government, it should also be public for people

01:02.480 --> 01:08.360
who, for example, countries or have problems with eye-size or have a different situation

01:08.360 --> 01:15.440
why they can't access information in the same format, and we took on the task at the

01:15.520 --> 01:21.120
level here, that's part of the Dutch Ministry of Interior, to turn them into accessible

01:21.120 --> 01:26.680
and reusable HTML.

01:26.680 --> 01:32.920
So you might think the obvious solution is to just use a Spendock, that actually worked

01:32.920 --> 01:45.240
for a while, until it didn't, Sandock is a very good conversion to with a lot of features,

01:45.240 --> 01:50.200
it does a lot of formats, but it might be if you need specific features or you need

01:50.200 --> 01:54.640
a specific way of output, then it's harder to use.

01:54.640 --> 02:01.000
So we built around it, we created a piece of software, we call it first the P4 and the P4

02:01.040 --> 02:12.000
P4 is a Pendock, P processor, and we also ended up doing post-processing, which was

02:12.000 --> 02:15.880
actually more complex than doing the conversion ourselves, as you can see, we have piece of

02:15.880 --> 02:22.240
inputs, P processor before it goes into Pendock, and when it went out of Pendock, we

02:22.240 --> 02:29.280
used the same kind of software to post-process, it's to clean it up, Pendock would do the

02:29.280 --> 02:35.240
core conversion, P processor would normalize stuff for Pendock to do it a bit easier, and

02:35.240 --> 02:43.880
post-processing would fix up, put it in the format that we actually wanted, but you

02:43.880 --> 02:49.440
lose control in this stack, because you basically use Pendock as a black box that you

02:49.480 --> 02:57.960
pre-processing post-process on, it's complex, and the annoying part is every new requirement

02:57.960 --> 03:04.680
in our application, it requires a change in Pendock filters, or in the input we put in,

03:04.680 --> 03:10.360
or the output we got out and we need to process, so you keep on basically having brittle

03:10.360 --> 03:17.440
changes, if I know the core issues, of course, Pendock is primarily a common line interface,

03:17.440 --> 03:24.080
it does do a web server, but it's not feature-completes, it's for sure not design for

03:24.080 --> 03:31.600
collaborative editing, and that's what I'm going to show you next, but first, but it's

03:31.600 --> 03:38.920
we designed the AST-based VJBase that we can convert to, since in-between the input and

03:38.920 --> 03:45.000
outputs, and yeah, we did our conversion ourselves, you can find a recent version of the

03:45.000 --> 03:53.400
AST on this link, I'm now actually going to show something and what it does, so you have

03:53.400 --> 04:07.760
better picture with it, so our project was basically a tool to make documents more accessible,

04:07.760 --> 04:14.360
you could upload a document in our editor, and we would have validation on levels of like

04:14.360 --> 04:27.520
headings or machine objects, you can fix them, this would disappear, and then we can

04:27.520 --> 04:38.480
convert it back to order formats, and a user can, for example, just upload it to a content

04:38.480 --> 04:44.040
management system or another system, what is very cool about this, I will zoom back a

04:44.080 --> 04:50.040
little bit, is that validation message you see here is also basically an output for the

04:50.040 --> 04:57.000
converter we built, we created an input reader for the word document file, and we also

04:57.000 --> 05:03.280
created an output for the editor, so you can convert from the editor and the editor back

05:03.280 --> 05:12.840
to order formats, this was mostly surrounded around accessibility issues, around documents and

05:12.840 --> 05:23.040
documents, as a tool that would help governments make documents accessible, and as a tradition

05:23.040 --> 05:32.840
in this room, I will also give a shout out to Blocknotes in the suite, we also build this tool

05:32.840 --> 05:41.560
for Blocknotes basically where you can use it to import documents into Blocknotes in the suite,

05:41.560 --> 05:45.880
it happens in the same manner, this so the dockage reader is always the same, it doesn't

05:45.880 --> 05:52.000
matter if you export to HTML or another content management system, this was merged to production

05:52.000 --> 06:05.680
last week, and it helps to migrate from Microsoft works in general of course, it works kind

06:05.680 --> 06:16.320
of the same way as you saw in the previous demo, it just applies a document, it's being

06:16.320 --> 06:29.880
converted, and then you can basically manipulate your content that was also in the original

06:29.880 --> 06:58.620
document, so in order to convert we created an AST based on JSON, where you can divine

06:58.620 --> 07:05.060
elements, it's very similar to other ASTs, also such as the Blocknotes one we heard earlier

07:05.060 --> 07:11.980
about, it's also types, I use type check nowadays to type it and you can basically create

07:11.980 --> 07:16.940
your own image or to other programming languages, means that you only have to maintain one

07:16.940 --> 07:26.180
spec, and you can use it in order languages, I wrote the code base of the current converter

07:26.180 --> 07:34.580
in a lecture, so that's also the current language I convert my touch to, and basically with

07:34.580 --> 07:42.020
this AST you can easily map it to Blocknotes or tip-tap elements, if you're lost with this

07:42.020 --> 07:49.620
spec was really an AST to describe any and every element, so that means if this happens

07:49.620 --> 07:55.740
to your input document that there's a strange element inside of it, or layouts, we want

07:55.740 --> 08:04.940
to be able to describe it, so we can at least try to convert, as the least amount of loss

08:04.940 --> 08:14.700
as possible, we currently focus on.ex, there is possible to also convert PDFs, but that's

08:14.700 --> 08:21.420
bit more that the search is on talk, as to do with machine learning, and we convert to

08:21.420 --> 08:28.540
editors like Blocknotes and tip-tap and to formats like HTML and EPUP, and it is planned, you

08:28.540 --> 08:35.740
can see in this chart I'm not sure if it's very visible, but you can see basically what we

08:35.740 --> 08:41.980
want to implement this year and what we implement it, so currently for input, we support

08:41.980 --> 08:48.620
talkics and tip-tap, but we also want to do HTML, markdown of course, and open documents,

08:48.700 --> 08:54.060
we also want to import from Blocknotes, search again, basically export your Blocknotes documents

08:54.060 --> 09:01.820
also to formats through our system, we want to make an export in.ex, markdown and everything,

09:01.820 --> 09:10.060
this is a plan for the coming year, and we want to go to PDF for up with using types,

09:10.060 --> 09:19.820
so types as in tool to basically render out your documents as PDFs, and the decision also

09:19.820 --> 09:26.380
very, really want to go is rewriting the Alexa code to Rust, which also means you would be able

09:26.380 --> 09:34.540
to run it in browsers with WebAssembly, we'd be able to run it as a command-line interface that's

09:34.540 --> 09:45.340
also currently possible as an API that's also currently possible. A skew worker, as a library, of course,

09:47.740 --> 09:53.580
and with FFI-dynamics to any language, this would practically mean that you can use this in any

09:53.580 --> 09:59.180
projects, you can use this in any program language, and most importantly, you don't need a

09:59.260 --> 10:04.620
server if you are an editor that needs Falcon version, you can just do it in a browser,

10:05.340 --> 10:11.420
which is quite important for enter-integrated systems, because they can't expose

10:12.380 --> 10:16.060
documents to the server, because the food basically breach and the encryption.

10:19.260 --> 10:23.260
And that makes it quite interesting for projects like CREPAT and the next graph

10:23.660 --> 10:30.700
where the server can see the document contents, and by ensuring the conversion happens

10:30.700 --> 10:38.700
client-sized, we basically preserve the privacy of the user. It would mean that you can

10:38.700 --> 10:43.420
convert without back-end infrastructure, which would do its latency, and would be quite real time,

10:44.140 --> 10:52.940
and as a set-three times, and to end. I have some URLs here where you can see the projects,

10:54.380 --> 11:00.060
you can see last week docks, you can see the project docks by itself, you can see anodok that's the

11:00.060 --> 11:06.700
all the code base that we use at logias, part of the ministry, and I add that links, of course,

11:06.780 --> 11:18.380
to Pandok. All right, this is my talk, you can see my contact details here in case that you

11:19.340 --> 11:22.940
want to reach out, and I would like to take your questions.

11:24.940 --> 11:28.700
I have a first question, so when we're talking about work conversion,

11:28.780 --> 11:37.260
the question is how far can you go, because I worried it's a very large format with a lot of

11:37.260 --> 11:43.420
features, so what are the limits in terms of what is supported, what is not supported,

11:43.420 --> 11:53.020
most probably more important in the export, and link to that blog post has now recently

11:53.100 --> 12:00.540
comments and suggestions, are you planning to support comments and suggestions in a word format?

12:02.380 --> 12:07.580
That's the problem. Let's take on your first question first. Can you repeat your question

12:07.580 --> 12:13.500
of me? Yeah, I also know today. Yeah, the first question was how far can you go on working

12:13.500 --> 12:19.100
important? Yeah, how far can we go on working important? You can basically go as far as the

12:19.100 --> 12:26.700
budget reaches, but it's not your technical question. It's quite hard format, because there are

12:27.980 --> 12:34.380
quite a collection of versions of the formats that you can use to describe the exact same element,

12:35.260 --> 12:44.380
and that's quite intense. There will always be some cases that you can't really cover,

12:45.260 --> 12:49.180
and you basically have to work with a lot of test documents to import.

12:53.180 --> 13:01.340
I think almost on a level of Pandok, almost. I think you do a little more with food notes

13:01.340 --> 13:07.900
and end notes, which I don't cover yet, but it's yeah, plan to be covered. Your second question

13:07.980 --> 13:15.900
was about comments in blockness. It would be very feasible. I also haven't seen their

13:15.900 --> 13:21.740
implementation yet, so I'm not sure how hard it would be. I think their implementation

13:21.740 --> 13:31.980
are, it's quite new also. Maybe also to your question. Pandok also like to cover some of

13:32.540 --> 13:41.900
this formula, like we just like the work mode, and also of latex, are you planning to be

13:41.900 --> 13:49.820
future parity in this side? Yeah, the question was, they cover penocorfered popular formats,

13:49.820 --> 13:56.940
like automatic, latex, am I planning to be basically on par with features? It's hard,

13:57.020 --> 14:04.300
because they support so many formats. At least take you a full year to actually be on par,

14:05.500 --> 14:13.580
at least. I try to cover as many formats, but I focus on the more popular ones, because they are

14:13.580 --> 14:26.460
more important. I'm not a 100% sure, I just do the really good. You said that you have millions of

14:26.620 --> 14:33.900
old documents. If the idea to convert them all in, you know, more than format, or is it just

14:33.900 --> 14:42.300
so that people can still access the old formats, because if it's the first one, maybe there should

14:42.300 --> 14:50.060
be some automated batch where you're hanging out converting. Yeah, the question was about having

14:50.140 --> 14:57.900
millions of documents. This was the case in the Dutch government, and your question was

14:57.900 --> 15:05.260
to have it sorry. Are you intending to convert them all, which would mean having plenty of

15:05.260 --> 15:11.660
them through this, and in terms of doing what you showed, you know, my girl? Yeah, so the question is,

15:11.740 --> 15:20.700
are we planning to convert them all? The Dutch government has quite a bit of documents. It's also

15:20.700 --> 15:26.300
hard to have a pipeline of it, because it's also separate governments. It would be quite feasible to

15:26.300 --> 15:36.460
have a pipeline, but this product that I showed was really focused on the user basically doing the work,

15:36.540 --> 15:43.100
the user being guided, and that's quite necessary, because the source can't always be accessible,

15:44.220 --> 15:49.580
adding structures are not always in the correct order. So the item means that we need to

15:51.580 --> 15:59.500
make guesses about what it should be, or we should just as a user. That's the hard part of the

15:59.500 --> 16:05.820
pipeline is that it won't all be accessible. You would need to validate it, it would need to let

16:05.900 --> 16:12.380
you choose, to let the user choose, as to make it accessible.

16:13.660 --> 16:18.380
I have another question in your list of potential features in the future. You mentioned PDF

16:19.980 --> 16:27.900
as output, I'm not sure if it was input or so. Yeah. Why for this, for example, in output PDF,

16:27.900 --> 16:33.900
when there's quite a lot of solutions that already exist, like including just printing, for example,

16:33.900 --> 16:41.580
or or, or so, lots of features that already exist to convert the HTML media for

16:41.580 --> 16:46.460
problems. Yeah, but I think there's something good to do. Yeah, the question was, why

16:46.460 --> 16:56.300
focus on output PDF, because they are quite a bit of solutions. I would say depends on money.

16:56.860 --> 17:03.500
This is interesting for last week, so that's why it would be interesting for me to build.

17:03.500 --> 17:05.500
That's a short answer.

17:08.380 --> 17:10.380
Yeah, I'll explain.

17:27.260 --> 17:42.220
I've looked at the unified Js. I've looked at the unified Js.

17:42.220 --> 17:45.260
They're seeing objects, but I will after I'll just talk now.

17:45.260 --> 17:47.260
Thank you.

17:51.500 --> 17:55.740
It's great to hear this. We actually do something like this 20 years ago, so we decided to take a first

17:55.740 --> 18:01.580
route and sort of handle the optimised word and made it for us to stay alive and to come there.

18:02.380 --> 18:07.580
So I learned a bit about the last few format. We'll just talk in the docics to look at it.

18:08.620 --> 18:13.660
But I confident that you can never become future complete as a connecting to the first question.

18:13.740 --> 18:17.660
Are you not always in the life of someone who's trapped, so you speak about that always

18:17.660 --> 18:22.780
to another breaking future and then so you can't quite get there, because I don't even seem to say,

18:22.780 --> 18:27.340
this is not a problem. Yeah, the question, am I confident that I can be future complete?

18:28.540 --> 18:35.660
Basically no. The well, yeah, the docics contains so many applications that's practically impossible.

18:36.620 --> 18:43.980
I do think it will be feasible to convert most of the documents that users make

18:44.620 --> 18:48.620
fully, because they won't use the entire set of orders available.

18:50.220 --> 18:55.500
Being fully on par with which completion, on the external specgets,

18:56.220 --> 18:58.940
the spec is so big that it's practically impossible.

18:59.900 --> 19:05.580
That's what we just want to provide for the users' set of templates that you will speak

19:05.580 --> 19:15.020
template for. Yeah, using templates makes it easier, because you actually control the inputs,

19:15.020 --> 19:18.940
so it also makes it easier to convert, because you know, but you expect.

19:20.940 --> 19:25.980
And that's the result, it includes some sort of check of provenance and approval of it.

19:26.940 --> 19:28.300
So, what do you mean?

19:28.300 --> 19:34.540
You make a transformation of content, does that result? The content, does it contain

19:34.540 --> 19:40.540
metadata that says it's origin is here, and it was transformed by this and this and I

19:40.540 --> 19:47.180
guess that it's condemned correctly. If it was metadata, it does include some metadata

19:47.180 --> 19:53.980
for the moments. I think authors, I mean, that's a data generated by your transformation

19:54.300 --> 20:02.700
that is done correctly, and it does not describe if it's done correctly, but that does mean

20:02.700 --> 20:07.500
you need something like that. Sorry? You see demand for anything like that.

20:08.540 --> 20:14.940
Yeah, it would be very interesting to have something that also checks if the conversion was done

20:14.940 --> 20:20.220
correctly, but that means that you are not sure how this practically would work.

20:20.940 --> 20:24.940
And sounds like you need a second application to check if the conversion would be correct.

20:27.500 --> 20:33.260
But it is, I've heard it before in government circles, especially where they want

20:35.180 --> 20:40.460
the conversion as correct as possible. It could be interesting, but I'm not sure how it,

20:40.460 --> 20:42.060
but it would look like practically.

20:50.540 --> 20:51.020
Sorry?

21:03.020 --> 21:06.300
Yeah, because she was, do I provide round-fripped conversions?

21:08.300 --> 21:08.700
Sorry?

21:09.740 --> 21:10.460
It didn't happen.

21:16.460 --> 21:18.300
What do you exactly mean with it?

21:20.220 --> 21:22.780
I don't know if there's anything that changes, and then it's said back.

21:22.780 --> 21:25.900
And really, other parts, which you don't support, still, it says, like,

21:25.900 --> 21:27.260
listed tables or whatever.

21:29.260 --> 21:33.660
It really depends on the output formats, on what is actually supporters.

21:34.380 --> 21:35.900
Not sure if it answers your question.

21:36.940 --> 21:39.980
But, like, there's actually some support, just keeping the data

21:39.980 --> 21:40.940
because you don't understand.

21:41.740 --> 21:45.580
Oh, do I try to support keeping the data that I don't understand?

21:46.540 --> 21:49.500
This moment and not, but it would be interesting,

21:49.500 --> 21:53.820
but it would also be hard to create a spec around that.

21:54.540 --> 21:58.220
But it could be quite interesting, for example, if you letter improved,

21:58.940 --> 21:59.900
document conversion.

22:00.540 --> 22:05.020
Because then you can basically confer it again with the current knowledge.

22:05.580 --> 22:07.260
But no, I don't do that at this moment.

22:11.900 --> 22:13.900
Okay, do we have any questions?

22:16.300 --> 22:17.340
No?

22:19.020 --> 22:19.980
Okay, so thank you.

22:28.300 --> 22:28.780
Sorry.

22:28.780 --> 22:30.780
Okay.

