WEBVTT

00:00.000 --> 00:11.760
All right, our next speaker is Evan Rusakis, who's going to present how Apache Superset

00:11.760 --> 00:14.760
reinvented and re-engineered its world documentation.

00:14.760 --> 00:16.520
Please give a warm welcome to Evan.

00:16.520 --> 00:22.040
Thanks for joining.

00:22.040 --> 00:26.840
Just curious if anybody has not heard of Apache Superset here, I'd love to see a hand.

00:26.840 --> 00:27.840
All right, fantastic.

00:27.840 --> 00:29.360
You've justified my existence.

00:29.720 --> 00:35.360
All right, this is a long title, but I wanted to share some of the needs and projects

00:35.360 --> 00:39.080
and learnings that have led to a better documentation set up for our project.

00:39.080 --> 00:42.680
I'm Evan Rusakis, I work at a company called Preset.

00:42.680 --> 00:49.920
It's like a managed service version of Apache Superset and a PMC member, which is a project management

00:49.920 --> 00:55.520
committee person, and I work on the docs, and I'm tired of doing it, and I read them

00:55.520 --> 00:58.880
sometimes, but mostly lean on AI, like everybody else these days.

00:58.880 --> 01:03.360
So trying to make things easier for those training models on our behalf.

01:03.360 --> 01:08.840
There's my contact info if you want to follow up afterward, but very quickly, this is not

01:08.840 --> 01:13.120
a product talk, but I'll just tell you what Superset is real quick.

01:13.120 --> 01:18.560
It's a very advanced BI tool for data teams that want to just democratize access to their

01:18.560 --> 01:23.960
data and visualize it in sensible ways and explore and share the insights they find.

01:24.040 --> 01:29.320
By GitHub Stars, the Apache Software Foundation's biggest project, it's got a lot of contributors.

01:29.320 --> 01:34.920
There's been about 350 TRs this month, it's a very active project.

01:34.920 --> 01:38.600
And it's got a SQL workbench, which allows you to really connect to pretty much any data

01:38.600 --> 01:42.920
source under the sun, and write all kinds of queries, share them with your team, then you

01:42.920 --> 01:48.040
can drag and drop those columns and build visualizations, those visualizations, build dashboards,

01:48.040 --> 01:51.920
or do all kinds of drilling and filtering and cool stuff with.

01:52.720 --> 01:58.400
Then, of course, preset, who was kind enough to fly me here, is a managed version of that

01:58.400 --> 02:03.520
that adds a bunch of bells and whistles, and you can have multiple instances of Superset

02:03.520 --> 02:08.880
for all sorts of different purposes. If you want to try Superset, this isn't easy way to do it

02:08.880 --> 02:13.360
for free, not here to sell stuff, it's an open source conference. So let's get back to talking about the

02:13.360 --> 02:19.520
docs. You can check out the repo here, you can check out the docs themselves, it's all live on the site,

02:19.680 --> 02:26.080
obviously. And when we talk about rebuilding our documentation, this is one of the first questions

02:26.080 --> 02:32.800
that always comes up. Can't you just have AI do this? And we sure tried just for sport, but in the end,

02:32.800 --> 02:40.720
it did not go very well. You have AI that just kind of dumps out a whole bunch of garbage. Really,

02:40.720 --> 02:46.080
it's, you get all these fancy mermaid diagrams, but all the unimportant details kind of come

02:46.160 --> 02:50.320
front and center, and all the little nuance things that are very important to humans and administrators

02:50.320 --> 02:57.440
of this kind of product just get ditched, you know, somewhere, very deep down in the docs or

02:57.440 --> 03:02.080
they're hard to find. Sure, there's a lot of pages, a lot of words, but it doesn't really help people

03:02.080 --> 03:07.040
that much. So you're the right one, you know where the product is headed, you know what's important

03:07.040 --> 03:11.360
to people that use in the ministry of your product. So you should have a lot of say in how these things

03:11.360 --> 03:16.640
are built. So the hot take here is that, yeah, you're the one that knows what your docs should be,

03:16.640 --> 03:22.560
and what they should say, and machines are really the ones that are good at writing code. And the

03:22.560 --> 03:28.480
point of this talk is that you can have AI write code to write your docs. So I wanted to share some

03:28.480 --> 03:32.400
of the hackathon experiments that I've been working on to kind of prove the point to myself and the team.

03:34.000 --> 03:39.520
So it all started with a little bit of road mapping. We had a project that

03:40.480 --> 03:45.840
we had to reinvent the docs for, I'll get into it. But step one is assessing the mess. In our world,

03:45.840 --> 03:52.880
we had all kinds of stuff that was scattered all over. We had wikis, we had, you know, emails and third-party

03:52.880 --> 04:01.040
blogs, and just read me files in dozens of places on the repo. Everybody kind of created their own

04:01.040 --> 04:06.880
little scattered bit of historical information somewhere, and institutional knowledge was really

04:07.360 --> 04:13.360
what people would lean on, far too much. So the idea was to kind of clean all this stuff up,

04:13.360 --> 04:20.240
get it all under one roof, and make it better than it's ever been. The key problems to solve are

04:20.240 --> 04:27.360
that, you know, since if you can't find everything, the on-wrap for new contributors is incredibly

04:27.360 --> 04:34.160
difficult, or for new users. Search is limited. AI even doesn't have that great of a singular

04:34.160 --> 04:39.360
knowledge base to refer to when it's doing training runs. There's a lot of duplication of effort

04:39.360 --> 04:44.080
because if there's multiple places, things are being documented, guess what you've got to do. And then,

04:45.360 --> 04:49.760
the worst part as a contributor is whenever you write a new poll request and get some code merged,

04:49.760 --> 04:53.200
now you have to go write the docs for that thing, and that's a total drag. Nobody wants to do it.

04:53.920 --> 05:00.400
So this all just comes at giant mess, where the code base is a moving target, the docs can't keep up.

05:00.480 --> 05:07.680
So what do we do? Do you let that code base determine what you should do really? You get everything

05:07.680 --> 05:13.520
under one roof, federate all your content first, get everything cross referenced, and then you try

05:13.520 --> 05:19.040
to get the docs to build themselves. Your code is probably full of little implementation details

05:19.040 --> 05:25.600
and metadata and all this stuff that's really useful. So we'll capitalize on that. And then, of course,

05:25.600 --> 05:31.280
you want to optimize for humans that actually read the docs, and then, of course, for the AI training

05:31.280 --> 05:36.480
models that people are building. You know, these foundational models are trained on open source,

05:36.480 --> 05:43.680
and that's whole other topic. It's great. So I wrote up this big proposal. We call it a

05:43.680 --> 05:49.040
sip, a superset improvement proposal, and it was to build a new developer portal because we're building

05:49.040 --> 05:54.560
a whole new extension architecture and superset, which is awesome. But for all these new features,

05:54.560 --> 05:59.200
we want to make sure that developers are using them so we have to go make it easy to do.

05:59.200 --> 06:05.520
First up, find the right platform. Another good use of AI is doing your homework for you and finding

06:05.520 --> 06:09.840
all the platforms that exist and where they fall short for us that turned out that actually

06:09.840 --> 06:14.480
docu-saurus picked all the boxes when you add all the fancy plugins. So we rolled with that.

06:16.240 --> 06:22.640
And speaking of getting rolling, sweeping up. First thing you get to do, you've got old docs.

06:22.720 --> 06:28.160
So go ahead and let AI switch through and find all the spelling mistakes, add all the cross links.

06:28.880 --> 06:33.760
Just, it does all the heavy lifting of pulling your wiki over and all that very easily. So you

06:33.760 --> 06:38.880
could just kind of get organized and have a good. Here's all my stuff version of the documentation.

06:40.480 --> 06:47.760
And then you've got to look for the fun part. The opportunities to make your documentation build

06:47.760 --> 06:54.480
itself. The code in many places is self-documented. So you want to look for those repeating patterns

06:54.480 --> 06:59.760
and probably rewrite parts of your code itself so that it can be leveraged by your documentation.

07:02.400 --> 07:08.800
AI is really great at turning metadata into pages, but not just saying here's some metadata

07:08.800 --> 07:13.040
spit out a bunch of words that actually having it write the code for docu-saurus to render

07:13.040 --> 07:19.600
to those pages or any other documentation tool you're using. So let it write scripts because we all

07:20.400 --> 07:24.880
are probably using AI to write code every day. I know I hardly write code at all anymore be

07:24.880 --> 07:36.480
an honest. So just use it to do that. So first test, we have this mapping visualization

07:36.480 --> 07:41.920
in superset, one of a few different map visualizations where we have to have all of the

07:41.920 --> 07:47.600
countries of the world represented and it parses a bunch of geojson stuff. And if you go and

07:47.600 --> 07:53.200
mess with this gigantic Jupyter notebook, then you've got to go and update the actual plugin

07:53.200 --> 07:57.760
itself to add the country that you might have added and then you've got to go update the docs.

07:58.320 --> 08:06.560
So it's easy enough to actually have the Jupyter notebook update the code for the visualization

08:06.640 --> 08:12.080
plugin and update the docs. And I was like, oh, that's cool. Now the contributor could just

08:12.080 --> 08:17.120
do one little thing in a notebook and the product and the documentation just take care of themselves.

08:18.320 --> 08:24.240
So let's expand on that. Future flags are something that we're a nuisance.

08:25.040 --> 08:29.840
Previously we had this Markdown file on the repo and every time somebody added a future flag or

08:29.840 --> 08:36.960
changed its status or its default you have to go and update this thing and nobody over does or wants to.

08:38.000 --> 08:43.120
So it falls out of date all the time and led to funny bug reports and stuff.

08:44.000 --> 08:50.480
So I went in and I added a bunch of comments to the config file. So you've got to this meaningful

08:50.480 --> 08:55.120
stuff about what category the flag falls into, what is default status, what status it's in

08:55.120 --> 09:00.160
if it's future flag life cycle as we get rid of things and then it builds these pages.

09:00.720 --> 09:07.200
You've got a super long page all very organized of what is set, what way what it does, how long it's

09:07.200 --> 09:12.400
going to be there and that's very handy and you never have to touch that documentation file again.

09:13.280 --> 09:19.360
API docs. Everybody's got an API. Everybody's seen this thing, the swagger renderer.

09:20.000 --> 09:27.520
We've had that in our product forever, never loved it. So Wambam, DocuSource, magic,

09:27.520 --> 09:32.400
lots of plugins and all of a sudden you've got this very interactive stuff with code samples,

09:33.200 --> 09:38.720
all the response objects from your API, the parameters, all the good stuff developers actually need

09:38.720 --> 09:46.960
on a very interactive playground set sort of place. Now databases are something that we care a lot about.

09:47.120 --> 09:55.760
Superset connects to a whole bunch of stuff and what it does as a product is essentially just

09:55.760 --> 10:01.040
use some translation layers to send SQL to them from the database. You've got a SQL

10:01.040 --> 10:05.600
library, you're writing the queries and then you get data back and then we visualize it. It's

10:05.600 --> 10:10.720
actually pretty straightforward when you oversimplify it like that. But in the actual stack,

10:10.720 --> 10:14.960
there's this top layer called the DB engine spec that sits on top of SQL alchemy dialects

10:14.960 --> 10:20.080
and that's where we do the stuff that superset cares about like documenting time,

10:20.080 --> 10:26.000
granularity is another little peculiarities of databases that make them special and make them work

10:26.000 --> 10:33.680
with Apache superset. So this is our old documentation. This was hand-edited stuff on the Docs

10:33.680 --> 10:40.400
site until just a few days ago, honestly. It was out of date, some of these connection details

10:40.480 --> 10:45.520
were incorrect and nobody could ever answer the question of how many databases do we support

10:45.520 --> 10:52.640
and nobody wants to go and clean up these Docs. So we also had this logo wall on the home page of

10:52.640 --> 10:57.440
the site and just like how did we pick these databases to have logos on the logo wall? Why are these

10:57.440 --> 11:05.840
ones important? So what I did is then went through and added with AI a bunch of metadata to

11:05.840 --> 11:14.240
every one of the DB engine spec files that we use in superset. And that means we can all of a sudden

11:15.040 --> 11:21.360
also take advantage of these DB engine spec details about time greens and other features and

11:21.360 --> 11:26.960
all of the custom error messages that they respond. All of a sudden you get this lovely index page

11:26.960 --> 11:32.000
that tells you exactly how many databases you support and you can search and you can sort by the

11:32.000 --> 11:36.720
type they are or what features they support and all kinds of other stuff. So you get this lovely

11:36.720 --> 11:42.320
table that if you're looking for your database you can figure out which one might be a good fit for you.

11:43.280 --> 11:47.280
All of a sudden you get these great documents that have truthful and up to the date

11:48.000 --> 11:54.320
information on how to connect to them and even what all their little errors and peculiarities are.

11:56.240 --> 12:01.600
The newest one I just merged quite recently is about a re-entstory book. I don't know how to

12:01.680 --> 12:08.080
do your front end developers are here. But in our particular product we've got a Python back end

12:08.080 --> 12:14.160
react front end and we've got a million react components many of which are based on AntD but also

12:14.160 --> 12:19.920
several other libraries. We've had this react story book like so many people have seen sitting there

12:20.640 --> 12:24.960
collecting dust so to speak. Nobody ever actually does npm run story book to see what your

12:24.960 --> 12:28.960
components do. They just kind of go into the code and figure out what they could do the best they can.

12:29.680 --> 12:36.800
So if nobody's going to leverage it might as well build it into the docs. It turns out

12:36.800 --> 12:42.240
there's a whole bunch of plugins you can use to make this fancy and build it into docusaurus.

12:42.240 --> 12:45.520
You just have to have AI go and update all the real little story files.

12:47.040 --> 12:53.280
Then you have fully interactive examples just like story book but even better you get this live

12:53.280 --> 12:58.480
code editor. You can't do a story book as far as I've seen to just type some code and

12:58.480 --> 13:02.960
fill it with your components. You get all the props and everything you need to know how to import it

13:02.960 --> 13:09.120
and even links to edit the documentation that the story itself. Then of course the best thing you

13:09.120 --> 13:14.880
can do for open source is tell the world you use it. So we have this in the wild page which used to

13:14.880 --> 13:20.400
be a Markdown file. Nobody knew it existed. Therefore nobody updated it but why would you update it

13:20.400 --> 13:29.280
if nobody can find it. So I changed it from a Markdown file to a animal file with the help of AI

13:29.280 --> 13:34.400
and then a little docusaurus magic and all of a sudden we have this new in the wild page where

13:34.400 --> 13:41.680
you can slap some logos on it. You get the little user faces from GitHub and it gives it a very

13:41.680 --> 13:50.240
high profile page on the website and you even get a little crawling logo wall on the front

13:50.640 --> 13:57.760
page as well. So that's a nice bonus. Then this is one I'm halfway through right now which I'm

13:57.760 --> 14:02.800
dying to finish screenshot updates. Everybody has screenshots of their stuff in their docs and

14:02.800 --> 14:09.600
there's such a pain to update because you're constantly changing the UI on things. So we're using

14:09.600 --> 14:16.240
playwright to test stuff and superset right now and turns out playwright can take screenshots. So if you

14:16.320 --> 14:23.680
actually find the right part of the DOM on your site to take a screenshot at the right time and

14:23.680 --> 14:30.240
the right state you can just have the script run and take screenshots of all the things you need.

14:30.240 --> 14:35.680
Copy the files into your docusaurus site and then your screenshots will always be correct and

14:35.680 --> 14:40.800
we've added versioning to all of our sections of the docs. So whenever you cut a new version

14:40.800 --> 14:45.360
it copies all those old files over and they'll be locked at the right place in time. And then as

14:45.360 --> 14:53.840
you keep changing things your next version will always be up-to-date. So speaking of next, what are we

14:53.840 --> 15:00.160
doing? Superset now supports theming so you can make the product look like whatever you want,

15:00.160 --> 15:05.360
look like your brand great for embedded analytics and all sorts of purposes. So documenting those

15:05.360 --> 15:10.320
creating a playground, how to build them and leaning on all the libraries we're built around so

15:10.320 --> 15:14.160
that all of that documentation builds itself even when we upgrade all of these foundational

15:14.160 --> 15:22.080
packages that can be done. We've got this extension effort which is kind of a big deal. We've

15:22.080 --> 15:27.120
taken a lot of inspiration from VSCo where you can add plugins anywhere that do anything and we're

15:27.120 --> 15:32.480
actually kind of riffing on their architectural plan of how that works. So you can add a bunch of

15:32.480 --> 15:37.360
bells and whistles and a bunch of different places in Apache Superset coming so that's why we

15:37.360 --> 15:45.360
built this new developer portal. And the extensions are starting to happen. So these dots are kind

15:45.360 --> 15:52.560
of half human written, half AI written, but the real neat and potatoes of it for automation

15:52.560 --> 15:57.920
sake is actually the extensions themselves. People are publishing them on NPM and right now we have

15:57.920 --> 16:03.920
this little mark down table we're building because it's all very new. But obviously the extensions

16:03.920 --> 16:08.480
when they get loaded they have a JSON file kind of like a package JSON and we can put as much

16:08.480 --> 16:13.520
metadata in there as we want including your screenshots and descriptions and compatibility matrix

16:13.520 --> 16:18.880
and whatever other licensing and security details we start to care about and then this page which

16:18.880 --> 16:24.320
is right now hand edited will go away and be automatic. So as the ecosystem builds and suddenly

16:24.320 --> 16:29.440
we go from 10 extensions to thousands of them it's all just going to show up there and be up

16:29.520 --> 16:41.760
to date all the time. So yeah AI this is where open source has a huge advantage. It's almost

16:41.760 --> 16:50.720
unfair really. Open source is the the best substrate for using or training AI everything about

16:50.720 --> 16:55.920
your project. The people, the code, the design patterns, the history, the arguments that happen on

16:55.920 --> 17:00.400
get of all of that stuff has been just sitting there on the internet and they're drinking it up.

17:01.200 --> 17:08.240
So AI knows everything about you and now your job is to make the documentation and the public

17:08.240 --> 17:15.680
facing stuff regarding your project is comprehensive as possible so that the next training run will

17:15.680 --> 17:22.080
include more of it and be more useful to people. So you've got to help humans they need to know where to find

17:22.160 --> 17:29.040
things but you know you've got to make sure that things are always current for them and

17:30.240 --> 17:41.120
the goal is to not have to maintain as much as the code base grows. So we have a million

17:41.120 --> 17:46.800
little helpers on our repo right now because we're open source all these people are basically

17:46.800 --> 17:52.800
don't even their service to us for free which is fantastic and you have an AI chat on the home page

17:52.800 --> 17:57.440
itself which is very good but all of these things are actually training on the doc site and

17:57.440 --> 18:03.040
fine tuning constantly so the more we updated the more they know. So this stuff is already helping

18:03.040 --> 18:07.360
and by the way there's a talk tomorrow with the GitHub thing if anybody's coming to that but I'll

18:07.360 --> 18:11.760
be talking about these guys. They're actually starting to talk to each other which is a total trip.

18:11.760 --> 18:22.400
So yeah and conclusion here I guess the point of this story is to not let AI just run away

18:22.400 --> 18:28.720
and write a billion words about your product. That doesn't really do any service for AI that's

18:28.720 --> 18:32.880
going to train on that later. Doesn't really do any service for your users that are trying to

18:32.880 --> 18:40.400
read it and find the important parts and the nuance and the details. So yeah use AI but use it to

18:40.400 --> 18:46.960
write code and use it to change your code so it can build the documentation and just you know

18:47.920 --> 18:57.200
don't be too lazy about that. So ultimately if you put in the work and do these migrations

18:57.920 --> 19:05.120
the docs will build themselves more and more and your life will become easier. So that essentially

19:05.200 --> 19:12.720
is the crux of my time. I love some time so if anybody has ideas, questions, whatever I would love to

19:12.720 --> 19:18.320
hear about it. I've also got stickers for anybody to do on some.

19:20.720 --> 19:24.960
Cool. Five minutes if anybody's got burning questions or ideas or whatever.

19:35.680 --> 19:42.560
This kind of doc's rebuilding project. See the moment you had something you could close to the world.

19:42.560 --> 19:48.320
How long could approximately take it? You can publish something in one day. Some parts were harder

19:48.320 --> 19:54.800
than others. Doing the in the wild page where you could just seal the faces and logos all of a sudden

19:54.800 --> 20:01.040
took a couple hours. But doing the storybook thing where we have hundreds and hundreds of stories

20:01.040 --> 20:07.120
and they all need to be refactored in some way. That was a lot of you know monkey in the middle

20:07.120 --> 20:12.640
testing and nope that didn't fix it, nope that didn't fix it stuff. So it depends but you can just

20:12.640 --> 20:17.760
chip away at it. Those were a handful of projects or projects rather that I wanted to start with but

20:17.760 --> 20:23.680
there's plenty left and I'll just try to make as much of it as generated as possible in the near future.

20:31.040 --> 20:39.360
So far the feedback from the developer community has been. The question was about the feedback

20:39.360 --> 20:45.520
from the developer community was their pushback or general acceptance or excitement and so far

20:45.520 --> 20:51.040
the it's been excitement. We have answered questions we didn't have answers to before about

20:51.040 --> 20:59.440
what databases we support how many of them all that stuff. The optics in terms of partnership

20:59.520 --> 21:03.520
have gotten better because now we're surfacing logos for all of these different companies and

21:03.520 --> 21:09.920
different databases. They get a link. They get better SEO. We get better SEO because people are

21:09.920 --> 21:14.800
able to search for these things. Does superset connect to database x? Well yes it does. The answer is

21:14.800 --> 21:21.840
there. So being that much more comprehensive makes us much more findable as a project.

21:22.800 --> 21:29.360
It makes the site more comprehensive and pretty and you know it's almost like the old

21:30.240 --> 21:34.960
web rings. We just link to everything now. They link back to us. It's fruit to a cycle

21:34.960 --> 21:38.800
and it's all growing very quickly. But the developers are stoked because nobody has to maintain

21:38.800 --> 21:43.840
database docs anymore. Nobody has to maintain more and more parts of the docs.

21:44.880 --> 21:47.840
It's taking care of itself. Anyone else?

21:47.920 --> 21:54.480
I don't. You showed it somewhere where like to find out which database you can connect to

21:54.480 --> 22:01.840
the app. It's made up of it. Mm-hmm. And then make it discoverable in some form, right?

22:03.440 --> 22:10.240
It looked like it was stuck strictly from what you stated, but they're like they do

22:11.200 --> 22:19.040
languages. How actually did it? Is it like a study? An analysis over like a day?

22:20.800 --> 22:23.040
Okay. Yeah. Yeah. How has it actually worked?

22:23.040 --> 22:31.040
Yeah. And it's a link about for instance. Yeah. So the question was kind of about how

22:31.040 --> 22:36.240
it works. Like is it saw some type script? There's also some Python. So what languages

22:36.240 --> 22:41.760
it in and out? How does this transformation and build process work? And when you start up

22:41.760 --> 22:49.040
docusaurus, it gives you the ability to just run a bunch of scripts along with it. And so

22:50.320 --> 22:57.120
half our code base is in Python, half of it's in type script, running back in. So it can really

22:57.120 --> 23:04.160
merge any of that stuff. It's pulling in yamophiles and JSON files and Python files and all kinds of

23:04.160 --> 23:10.480
stuff. And it's just a collection of little scripts. So for the database, the DB engine specs,

23:10.480 --> 23:18.640
there's metadata and a Python file. For in the wild page, it's yamoph or maps, it's type script.

23:18.640 --> 23:23.440
And just each one of these little scripts has this job to just chew through all the metadata

23:23.440 --> 23:29.200
files, build an index, build the individual pages, put all the links and logos and everything in the

23:29.280 --> 23:39.280
right place. So it just runs all of them in sequence. How do you set up the review process?

23:39.280 --> 23:46.080
Like every AIDOM, you have like a review rounds? So yeah, the question was about the review process

23:46.080 --> 23:57.760
and how we manage that. And really the pull request has a preview build. We have one bit on the

23:57.840 --> 24:03.360
pull request. You actually get a preview build of the site. So you can just click it and click around on it.

24:03.360 --> 24:10.720
And I'll be adding some visual regression testing as well. But there's no AID involvement in these

24:10.720 --> 24:16.880
and when it builds actually because the contributor just edits some metadata and then what happens is

24:16.880 --> 24:24.480
deterministic. So the AID, I guess the crux of the argument is that you shouldn't let AID be building your

24:24.480 --> 24:30.240
docs. It doesn't in our case. It's really just that we're using AID to build more and more tools

24:30.240 --> 24:34.480
so that deterministically the docs are scripted to build themselves from the code base.

24:36.480 --> 24:41.360
Yeah, a bunch of import and layout builder scripts and all of that.

24:42.640 --> 24:45.040
All right, I think that's my time. Thank you all very much.

24:45.040 --> 24:53.600
Thank you very much. Thank you.

