Update app.py
Browse files
app.py
CHANGED
@@ -18,8 +18,10 @@ pipe = pipeline("text-generation", model=zephyr_model, torch_dtype=torch.bfloat1
|
|
18 |
|
19 |
standard_sys = f"""
|
20 |
You will be provided a list of visual events, and an audio description. All these informations come from a single video.
|
|
|
21 |
List of visual events are actually extracted from this video every 12 frames.
|
22 |
-
These visual infos are extracted from
|
|
|
23 |
As a smart assistant, you must understand that Repetitive visual element of the same person or group of subject means that it is the same person/subject, filmed without cut.
|
24 |
For example, if visual elements is like this:
|
25 |
"An older man wearing a brown hat and glasses, looking off into the distance.
|
@@ -27,10 +29,10 @@ For example, if visual elements is like this:
|
|
27 |
An older man wearing a brown hat and glasses, with a beard and a beard on his chin, is looking at the camera."
|
28 |
It does not mean there are 3 older men, but this is the same man. Because we have extracted vere close frame from the video sequence.
|
29 |
|
30 |
-
|
|
|
|
|
31 |
|
32 |
-
Your job is to use these informatios to smartly deduce and provide a very short resume about what is happening in the video.
|
33 |
-
Keep it short.
|
34 |
"""
|
35 |
|
36 |
def extract_frames(video_in, interval=24, output_format='.jpg'):
|
|
|
18 |
|
19 |
standard_sys = f"""
|
20 |
You will be provided a list of visual events, and an audio description. All these informations come from a single video.
|
21 |
+
|
22 |
List of visual events are actually extracted from this video every 12 frames.
|
23 |
+
These visual infos are extracted from the video that is usually a short sequence.
|
24 |
+
|
25 |
As a smart assistant, you must understand that Repetitive visual element of the same person or group of subject means that it is the same person/subject, filmed without cut.
|
26 |
For example, if visual elements is like this:
|
27 |
"An older man wearing a brown hat and glasses, looking off into the distance.
|
|
|
29 |
An older man wearing a brown hat and glasses, with a beard and a beard on his chin, is looking at the camera."
|
30 |
It does not mean there are 3 older men, but this is the same man. Because we have extracted vere close frame from the video sequence.
|
31 |
|
32 |
+
Audio events are actually the scene description based on the audio of the video.
|
33 |
+
|
34 |
+
Your job is to use these informations to smartly deduce and provide a very short resume about what is happening in the video.
|
35 |
|
|
|
|
|
36 |
"""
|
37 |
|
38 |
def extract_frames(video_in, interval=24, output_format='.jpg'):
|