History

HuangHai 4edaa3e95f 'commit'		3 months ago
..
code	'commit'	3 months ago
extract	'commit'	3 months ago
format	'commit'	3 months ago
prompt	'commit'	3 months ago
README.md	'commit'	3 months ago
edoh-dataset-format-pep8.jsonl	'commit'	3 months ago
edoh-dataset-format.jsonl	'commit'	3 months ago
edoh-dataset.jsonl	'commit'	3 months ago
physics-01.jsonl	'commit'	3 months ago

README.md

Generative Manim Datasets & Data Collection Pipeline

Some of the techniques to create better prompt-to-code Manim models will need a guide for training. In order to achieve that we need to compile a dataset of prompts and the corresponding code.

Sources

Manim (Community)

Examples Gallery

Manim

Datasets

Custom Dataset

Now, the structure we need to follow is to create a dataset with the following columns:

prompt: Prompt to generate the code.
code: Corresponding code.
type: Type of media (video, image).

Altough we are focused on video generation, we should also consider images as a type of media, in order to train the model with vast examples that can be used in different scenarios.

Extract code examples from the Manim community.
Tag each code example with the corresponding type of media (if it uses self.add, it is an image, if it uses self.play, it is a video).
Write a prompt for each code example.

Dataset Generation Pipeline

💡 Using the code as the prompt word to generate the prompt text, in other words: let GPT summarize the manim code, the quality is better

Instead of relying on humans to write the prompt, we can also generate the prompt from the code itself via GPT models. This way we can have a more consistent dataset.

Create a Python script to generate the prompt from the code available in the scripts of /code.
Create a JSONL file with the dataset generated.

Dataset from `Edoh`

We can also use the dataset from Edoh to create a dataset of prompts and code.

Create Python script to extract the dataset from Edoh dataset.
Create JSONL file with the dataset.

Extracting the dataset of prompts and code from the Edoh Manim Python dataset we got edoh-dataset.jsonl.

The dataset contains 599 examples of prompts and code.

{"messages": [{"role": "system", "content": "Write Manim scripts for animations in Python. Generate code, not text."}, {"role": "user", "content": "Create a new scene named 'MyScene'."}, {"role": "assistant", "content": "from manim import * class MyScene(Scene): def construct(self): pass"}]}
{"messages": [{"role": "system", "content": "Write Manim scripts for animations in Python. Generate code, not text."}, {"role": "user", "content": "Add a circle with radius 2 and center at the origin to the scene."}, {"role": "assistant", "content": "from manim import * class MyScene(Scene): def construct(self): circle = Circle(radius=2, color=BLUE) self.add(circle)"}]}
{"messages": [{"role": "system", "content": "Write Manim scripts for animations in Python. Generate code, not text."}, {"role": "user", "content": "Create a text object saying 'Hello, World!' and add it to the scene."}, {"role": "assistant", "content": "from manim import * class MyScene(Scene): def construct(self): text = Text('Hello, World!') self.add(text)"}]}