WEBVTT

00:00.000 --> 00:10.000
Yeah, I'm going to skip this slide.

00:10.000 --> 00:25.000
So hello, everyone, at the next talk, we have Eric from Green Island, coming to talk to us about Ramalama.

00:25.000 --> 00:27.000
Eric, it's all yours.

00:27.000 --> 00:29.000
Thank you very much.

00:29.000 --> 00:34.000
Let me just clip this in, so hi, guys.

00:34.000 --> 00:40.000
I normally start this song with a phone a little jingle, but we've only 20 minutes, so I'm going to skip it this time.

00:40.000 --> 00:43.000
I'm a little sorry, guys.

00:43.000 --> 00:51.000
Okay, so yeah, this is a project I created with another engineer called Dan Oach.

00:52.000 --> 00:54.000
I'm a recruitment, I'm a self-engineer.

00:54.000 --> 01:07.000
I guess from the AI perspective, what I've worked on is co-created Ramalama, and I'm also a LAMSIPP upstream maintainer.

01:07.000 --> 01:20.000
So yeah, Ramalama is all about making working with AI models boring and simple for everyday users like you and me.

01:20.000 --> 01:26.000
Yeah, so as I said, this was, I'm going to fly through this stuff really fast.

01:26.000 --> 01:32.000
I want to give you a short demo and give you guys ask some questions or whatever.

01:32.000 --> 01:35.000
But this is our own goal, let's start.

01:35.000 --> 01:40.000
Now we're exploding on all sorts of different features, but the goal was to make AI usage boring.

01:40.000 --> 01:46.000
Boring is kind of a term I stole from another famous engineer called Jam Masters.

01:47.000 --> 01:51.000
Well, yeah, to allow the simple use cases to basically just work.

01:51.000 --> 01:59.000
So like Ramalama, let's say, over the weekend, this cool Chinese model gets released called Deepseek.

01:59.000 --> 02:08.000
You can just run this one, and it's dark man, and this one line run command with Deepseek, and essentially it'll just work.

02:09.000 --> 02:15.000
So some of the goals, some of the goals are to be OCI centric by default.

02:15.000 --> 02:21.000
So to provide user space runtime environments as containers.

02:21.000 --> 02:26.000
So there's a huge array of acceleration stacks for AI.

02:26.000 --> 02:30.000
And they're not actually the most straightforward to install.

02:30.000 --> 02:35.000
Some of them have lots of dependencies, some of them have not so many dependencies.

02:35.000 --> 02:44.000
Well, one of the things Ramalama will do is when you try and run it, it'll look for the primary accelerator on your machine.

02:44.000 --> 02:52.000
And based on that, it'll pull the correct runtime environment for that accelerator.

02:52.000 --> 02:54.000
So it'll just work.

02:54.000 --> 03:01.000
And sometimes maybe you want to choose your AI stack, maybe you're like, I want to play with Vulcan.

03:01.000 --> 03:05.000
You can manually specify what AI stack you want to use also.

03:05.000 --> 03:13.000
So another feature is to provide models as OCI artifacts, because I'm going to go through that later in slides,

03:13.000 --> 03:19.000
but basically that you can push and pull models as from OCI container registries,

03:19.000 --> 03:24.000
because OCI container registries are all over the place in public and in private networks.

03:24.000 --> 03:34.000
So it's just there well suited handle multi gigabytes files, which models tend to be in multi gigabytes files.

03:34.000 --> 03:37.000
So it's well suited for that use case.

03:37.000 --> 03:40.000
Some containers are not strictly required.

03:40.000 --> 03:42.000
You can actually run Ramalama with no containers.

03:42.000 --> 03:49.000
Sometimes I do that on my MacBook here, because if I want to try to run Ramalama natively on macOS,

03:49.000 --> 03:59.000
which doesn't have containers built into the macOS kernel, this makes life more easier.

03:59.000 --> 04:03.000
This is something I read and lamissy people before.

04:03.000 --> 04:07.000
So AI models aren't deterministic by design.

04:07.000 --> 04:13.000
So it's probably a good idea to isolate AI models anyway in either a container or a VM or something.

04:14.000 --> 04:17.000
And this was some advice I read on lamissy people.

04:17.000 --> 04:24.000
Documentation always execute untrusted models within a secure isolated environment, such as a sandbox.

04:24.000 --> 04:30.000
And you might say, well I trust my AI model, and another line I saw a couple of lines later,

04:30.000 --> 04:36.000
is that the trustworthyness of a model is not exactly binary anyway.

04:36.000 --> 04:42.000
So you should probably always encapsulate your AI workloads to a degree.

04:43.000 --> 04:46.000
So this was another goal to be deemless by default.

04:46.000 --> 04:51.000
So a lot of AI tools use client server approaches, which is fine.

04:51.000 --> 04:57.000
But by default, if you use Ramalama run, it just spins up one process, which is quite useful.

04:57.000 --> 05:02.000
And when you tear it down, it's using zero resources on your system.

05:02.000 --> 05:08.000
But of course, we can run Ramalama as a server to using commands like Ramalama server.

05:08.000 --> 05:10.000
So goal is accelerator support.

05:10.000 --> 05:17.000
There's a lot of AI projects out there, because the likes of a video dominate like 95% of the market.

05:17.000 --> 05:28.000
There's a lot of AI tools that say, we're only supporting Nvidia, or we're only supporting Kuda and Rakan from Andy.

05:28.000 --> 05:31.000
In Ramalama, we're actually quite different.

05:31.000 --> 05:35.000
If you're willing to do the enablement work and open a PR,

05:35.000 --> 05:39.000
we'll review it and get emerged run on your accelerator.

05:39.000 --> 05:47.000
There was the guy recently, he opened a PR to get Ramalama working on his Intel based AI accelerator.

05:47.000 --> 05:50.000
And yeah, he's having great joy with that.

05:50.000 --> 05:58.000
So yeah, and if you put in the work to auto detect your primary GPU, that's kind of useful as well.

05:58.000 --> 06:03.000
Because Ramalama is all about kind of usability.

06:03.000 --> 06:10.000
OS support, so we support Linux, whatever Linux this year, you run.

06:10.000 --> 06:17.000
Mac OS port natively, and also Mac OS via virtual machine called Padma Machine,

06:17.000 --> 06:24.000
which allows, which is basically Linux VM, which allows you to do all the cool containers of Mac OS.

06:24.000 --> 06:27.000
I also know guys running it on Docker desktop.

06:27.000 --> 06:31.000
Mac OS there haven't good successes as well.

06:31.000 --> 06:41.000
Our Windows port is via WSL too, and there's some documentation on that if you want to run it on Windows also.

06:41.000 --> 06:46.000
So this is another goal of Ramalama.

06:46.000 --> 06:53.000
So we want to be open multiple model registries as I call it.

06:53.000 --> 06:59.000
So you can push and pull whatever location you choose really.

06:59.000 --> 07:08.000
An interesting comparison here is the comparison between OS AI container registries and the alarm report call.

07:08.000 --> 07:13.000
Because I examine the alarm report call and the alarm report call is very interesting.

07:13.000 --> 07:17.000
The alarm is a company formed by some extra guys.

07:17.000 --> 07:23.000
And if you look at the protocol, they slightly fork the OS AI transport layer,

07:23.000 --> 07:30.000
but just enough so that it's only compatible with the alarm registry.

07:30.000 --> 07:42.000
And that's fine, but it just means if you're going to build up all your infrastructure based on Olamah,

07:42.000 --> 07:53.000
now like either you're going to surpass the alarm registry usage limits which I will introduce at some point and I'm speculating.

07:53.000 --> 08:03.000
And you're going to have to pay them lots of lots of money or else if you're deploying in the data center where you're not allowed to pull from external networks.

08:03.000 --> 08:11.000
You're probably going to have to spin up Olamah servers in your house and buy all those servers and change all your firewall rules.

08:11.000 --> 08:22.000
Yeah, the idea and all else you can just use your existing OS AI container registry and you don't have to pay for new servers or worry about surpassing usage limits.

08:22.000 --> 08:32.000
So what I'm trying to say is you can either use plain old OS AI container registries or you can pay Olamah ton of money and spin up new infrastructure in the house.

08:32.000 --> 08:46.000
So the goal, yes, one transport layer for all, so a lot of technologies are using OS AI container registries to push and pull all sorts of artifacts.

08:46.000 --> 08:52.000
For example, there's a product, there's a distribution from the company I work for.

08:52.000 --> 09:06.000
Red Hat calls really I and it's a booty based operating system and the main advantage is that is you can pull your system updates from your in-house OS AI container registries or under public internet.

09:06.000 --> 09:13.000
So you can use that same transport end point to pull your operating system.

09:14.000 --> 09:23.000
You can pull your AI user space run times from this same transport endpoint.

09:23.000 --> 09:31.000
You can pull, you know, your application containers and now you can push and pull models as well.

09:31.000 --> 09:41.000
So that's kind of the goal and it'll it'll simplify your infrastructure quite a bit because, yeah, you can serve all sorts of files on your.

09:41.000 --> 09:50.000
Both the registries which are well suited for multi gigabytes artifacts and.

09:50.000 --> 09:53.000
This is one of our newer commands.

09:53.000 --> 09:57.000
So let's say you found a cool.

09:57.000 --> 10:01.000
Modline line on either hogging face or alarm or whatever.

10:01.000 --> 10:12.000
You could run a command called Rammalama convert, convert to the OS AI format then you can push and pull to your local container registry rather than pulling from the internet the whole time.

10:12.000 --> 10:14.000
And this is.

10:15.000 --> 10:17.000
This is a new feature from Padmai.

10:17.000 --> 10:18.000
This is quite cool.

10:18.000 --> 10:19.000
It's called Padmai.

10:19.000 --> 10:20.000
And what Padmai.

10:20.000 --> 10:24.000
And what Padmai.

10:24.000 --> 10:25.000
And what Padmai.

10:25.000 --> 10:29.000
Can do it can push and pull files as OCI.

10:29.000 --> 10:32.000
So we're going to integrate that basically to.

10:32.000 --> 10:34.000
So you can push and pull model files.

10:34.000 --> 10:39.000
And another feature we support is switchable inferencing run times.

10:39.000 --> 10:42.000
So these are kind of two that are quite popular at the moment.

10:42.000 --> 10:45.000
So you can look quite well on Apple Silicon for example.

10:45.000 --> 10:48.000
So you can look caught away and test your models.

10:48.000 --> 10:51.000
You can learn the CPP for example.

10:51.000 --> 10:54.000
I know runs quite well and it spreads great.

10:54.000 --> 10:57.000
And video accelerators.

10:57.000 --> 11:00.000
So yeah, there's there's a flag that's actually on time.

11:00.000 --> 11:02.000
So you can basically choose your.

11:02.000 --> 11:04.000
Inferencing engine.

11:04.000 --> 11:07.000
And this is how the most basic usage of it works.

11:07.000 --> 11:09.000
So you run Rammalama run.

11:09.000 --> 11:11.000
That's an IBM model.

11:11.000 --> 11:14.000
It tries to detect your primary accelerator,

11:14.000 --> 11:16.000
pulls that inferencing run time,

11:16.000 --> 11:18.000
which in the diagram is CUDA.

11:18.000 --> 11:22.000
And then it pulls the model from.

11:22.000 --> 11:23.000
Some place.

11:23.000 --> 11:24.000
It could be hooking place.

11:24.000 --> 11:25.000
It could be alarm.

11:25.000 --> 11:26.000
It could be.

11:26.000 --> 11:27.000
Queda.io.

11:27.000 --> 11:28.000
Docker hope whatever.

11:28.000 --> 11:31.000
And then you start the container with the model.

11:31.000 --> 11:34.000
And yeah, everything just works.

11:34.000 --> 11:36.000
Here's some water commands.

11:36.000 --> 11:38.000
The first one is basically how you install.

11:38.000 --> 11:39.000
That's how you run.

11:39.000 --> 11:41.000
That's how you serve models.

11:41.000 --> 11:43.000
I'm showing switchable run times there.

11:43.000 --> 11:44.000
With VL 11, LAMS, CPP.

11:44.000 --> 11:46.000
You can generate quadlets.

11:46.000 --> 11:48.000
Or cubeless files.

11:48.000 --> 11:51.000
If you want to deploy this on Kubernetes.

11:51.000 --> 11:54.000
And that's where I run command is.

11:54.000 --> 11:56.000
I'll skip past that.

11:56.000 --> 11:57.000
Bramalama pull.

11:57.000 --> 11:58.000
That's how you pull models.

11:58.000 --> 12:01.000
This example is a hooking place model.

12:01.000 --> 12:07.000
And Rammalama list is basically how you list all your various models.

12:07.000 --> 12:09.000
So yeah, our community is growing.

12:09.000 --> 12:11.000
And I encourage you all to get involved.

12:11.000 --> 12:13.000
Like going back to the accelerator support.

12:13.000 --> 12:17.000
That's one of the areas where I find the community is super helpful.

12:17.000 --> 12:20.000
Because I don't have access to like thousands,

12:20.000 --> 12:23.000
all the thousand different types of GPUs in the world.

12:23.000 --> 12:26.000
And the community are great helping for that.

12:26.000 --> 12:28.000
Much of the solution is written in high level Python.

12:28.000 --> 12:30.000
So it's it's very accessible.

12:30.000 --> 12:32.000
And we're an open ecosystem.

12:32.000 --> 12:35.000
You know, just just open issues, PRs.

12:35.000 --> 12:41.000
And when we get time, we'll take a look at them.

12:41.000 --> 12:44.000
Yeah, make Rammalama boring AI better together.

12:44.000 --> 12:49.000
We are contribute our friendly.

12:49.000 --> 12:51.000
Yeah, there's a bunch of features.

12:51.000 --> 12:52.000
Because it's such a short talk.

12:52.000 --> 12:53.000
I didn't get to talk about.

12:53.000 --> 12:55.000
But we have a website coming out soon.

12:55.000 --> 12:58.000
We're looking at the Ragsport AI agent's course.

12:58.000 --> 12:59.000
And that's kind of this.

12:59.000 --> 13:02.000
I'm going to show you a quick demo.

13:02.000 --> 13:06.000
And I can take some Q and A, also then maybe.

13:06.000 --> 13:12.000
And just to give you an idea how it works.

13:12.000 --> 13:15.000
So here are various models I pull.

13:15.000 --> 13:21.000
These are all hogging face or Lama ones.

13:21.000 --> 13:24.000
This is a model I use for testing quite a bit from hogging face,

13:24.000 --> 13:26.000
because it's quite small.

13:26.000 --> 13:29.000
If you do a Rammalama wrong command,

13:29.000 --> 13:32.000
and it'll basically pull your model.

13:32.000 --> 13:35.000
And when it's done, it'll run.

13:35.000 --> 13:37.000
I'm going to run one at a one.

13:37.000 --> 13:44.000
I've downloaded already just in the interest of time.

13:44.000 --> 13:54.000
I'll do the deep sequence since that's the flavor of the month.

13:54.000 --> 13:56.000
I've downloaded the model.

13:56.000 --> 14:04.000
Why should I contribute to the open source AI projects?

14:04.000 --> 14:06.000
Rammalama.

14:06.000 --> 14:09.000
And yeah, that's basically running.

14:09.000 --> 14:12.000
And if we do and beat up,

14:12.000 --> 14:15.000
if you do and beat up,

14:15.000 --> 14:17.000
plus, this, this, this.

14:17.000 --> 14:20.000
Yeah, you'll see it's max and out my GPU there.

14:20.000 --> 14:21.000
It's 99%.

14:21.000 --> 14:23.000
And that's deep-seaing way.

14:23.000 --> 14:27.000
Rammalama is the best AI project I've ever created.

14:27.000 --> 14:29.000
Yeah, that's it.

14:29.000 --> 14:31.000
I can pick you an AI project.

14:31.000 --> 14:32.000
Thank you.

14:32.000 --> 14:33.000
Thank you very much.

14:33.000 --> 14:38.000
I've earned a question for Eric.

14:38.000 --> 14:42.000
There is a question.

14:42.000 --> 14:43.000
Shout it loud.

14:43.000 --> 14:53.000
Can you repeat?

14:53.000 --> 15:04.000
Out of this, Rammalama, Rammalama, Rammalama.

15:04.000 --> 15:09.000
So the question is how to run Rammalama with larger models

15:09.000 --> 15:12.000
with multiple nodes, multiple CPUs.

15:12.000 --> 15:17.000
Yeah, so Rammalama, it's, we're actually building upon the,

15:17.000 --> 15:22.000
what I say is we're building upon the shoulders of giants.

15:22.000 --> 15:25.000
So like, I know for example, a lot of people

15:25.000 --> 15:28.000
usually really large at GPUs run via VLLM,

15:28.000 --> 15:30.000
influencing engine.

15:30.000 --> 15:33.000
So yeah, it's, it's a simple as like,

15:33.000 --> 15:37.000
Rammalama sort of dash dash, run time via LLM model.

15:37.000 --> 15:39.000
And it, it should work.

15:40.000 --> 15:43.000
The hardware support is all in LLM,

15:43.000 --> 15:45.000
we just utilize it.

15:45.000 --> 15:46.000
So yeah.

15:50.000 --> 15:51.000
This guy.

15:57.000 --> 15:59.000
No, yeah.

15:59.000 --> 16:03.000
So this question, you can question the question.

16:03.000 --> 16:04.000
Oh yes, sorry.

16:04.000 --> 16:06.000
I'm forgetting a repeat question.

16:06.000 --> 16:11.000
The question is, do we support models switching in our API?

16:14.000 --> 16:15.000
Not yet.

16:15.000 --> 16:17.000
We have an issue open for that.

16:17.000 --> 16:20.000
I've described the design.

16:20.000 --> 16:22.000
It's, it's coming.

16:22.000 --> 16:24.000
That's a highly requested feature, yeah.

16:24.000 --> 16:28.000
Because, yeah, it's not something via LLM

16:28.000 --> 16:30.000
and LLM and LLM CPP support directly.

16:30.000 --> 16:34.000
So we're going to have to write some proxy to switch in and out to models.

16:34.000 --> 16:35.000
It's, it's a planned feature.

16:35.000 --> 16:36.000
It's not very yet, though.

16:40.000 --> 16:42.000
The more questions.

16:42.000 --> 16:43.000
All right.

16:43.000 --> 16:45.000
Thank you very much, Eric.

16:45.000 --> 16:46.000
Any.

16:46.000 --> 16:49.000
In three minutes, we start another talk with

16:49.000 --> 16:50.000
I'm very excited about.

