pcuenq HF staff commited on
Commit
32d0394
1 Parent(s): c47bb42

Depth Anything V2 Small (Transformers version)

Browse files
Files changed (4) hide show
  1. README.md +108 -0
  2. config.json +53 -0
  3. model.safetensors +3 -0
  4. preprocessor_config.json +44 -0
README.md ADDED
@@ -0,0 +1,108 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ --
2
+ license: apache-2.0
3
+ tags:
4
+ - depth
5
+ - relative depth
6
+ pipeline_tag: depth-estimation
7
+ library: transformers
8
+ widget:
9
+ - inference: false
10
+ ---
11
+
12
+ # Depth Anything V2 Small – Transformers Version
13
+
14
+ Depth Anything V2 is trained from 595K synthetic labeled images and 62M+ real unlabeled images, providing the most capable monocular depth estimation (MDE) model with the following features:
15
+ - more fine-grained details than Depth Anything V1
16
+ - more robust than Depth Anything V1 and SD-based models (e.g., Marigold, Geowizard)
17
+ - more efficient (10x faster) and more lightweight than SD-based models
18
+ - impressive fine-tuned performance with our pre-trained models
19
+
20
+ This model checkpoint is compatible with the transformers library.
21
+
22
+ Depth Anything V2 was introduced in [the paper of the same name](https://arxiv.org/abs/2406.09414) by Lihe Yang et al. It uses the same architecture as the original Depth Anything release, but uses synthetic data and a larger capacity teacher model to achieve much finer and robust depth predictions. The original Depth Anything model was introduced in the paper [Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data](https://arxiv.org/abs/2401.10891) by Lihe Yang et al., and was first released in [this repository](https://github.com/LiheYoung/Depth-Anything).
23
+
24
+ [Online demo](https://huggingface.co/spaces/depth-anything/Depth-Anything-V2).
25
+
26
+ ## Model description
27
+
28
+ Depth Anything V2 leverages the [DPT](https://huggingface.co/docs/transformers/model_doc/dpt) architecture with a [DINOv2](https://huggingface.co/docs/transformers/model_doc/dinov2) backbone.
29
+
30
+ The model is trained on ~600K synthetic labeled images and ~62 million real unlabeled images, obtaining state-of-the-art results for both relative and absolute depth estimation.
31
+
32
+ <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/depth_anything_overview.jpg"
33
+ alt="drawing" width="600"/>
34
+
35
+ <small> Depth Anything overview. Taken from the <a href="https://arxiv.org/abs/2401.10891">original paper</a>.</small>
36
+
37
+ ## Intended uses & limitations
38
+
39
+ You can use the raw model for tasks like zero-shot depth estimation. See the [model hub](https://huggingface.co/models?search=depth-anything) to look for
40
+ other versions on a task that interests you.
41
+
42
+ ### How to use
43
+
44
+ Here is how to use this model to perform zero-shot depth estimation:
45
+
46
+ ```python
47
+ from transformers import pipeline
48
+ from PIL import Image
49
+ import requests
50
+
51
+ # load pipe
52
+ pipe = pipeline(task="depth-estimation", model="pcuenq/Depth-Anything-V2-Small-hf")
53
+
54
+ # load image
55
+ url = 'http://images.cocodataset.org/val2017/000000039769.jpg'
56
+ image = Image.open(requests.get(url, stream=True).raw)
57
+
58
+ # inference
59
+ depth = pipe(image)["depth"]
60
+ ```
61
+
62
+ Alternatively, you can use the model and processor classes:
63
+
64
+ ```python
65
+ from transformers import AutoImageProcessor, AutoModelForDepthEstimation
66
+ import torch
67
+ import numpy as np
68
+ from PIL import Image
69
+ import requests
70
+
71
+ url = "http://images.cocodataset.org/val2017/000000039769.jpg"
72
+ image = Image.open(requests.get(url, stream=True).raw)
73
+
74
+ image_processor = AutoImageProcessor.from_pretrained("pcuenq/Depth-Anything-V2-Small-hf")
75
+ model = AutoModelForDepthEstimation.from_pretrained("pcuenq/Depth-Anything-V2-Small-hf")
76
+
77
+ # prepare image for the model
78
+ inputs = image_processor(images=image, return_tensors="pt")
79
+
80
+ with torch.no_grad():
81
+ outputs = model(**inputs)
82
+ predicted_depth = outputs.predicted_depth
83
+
84
+ # interpolate to original size
85
+ prediction = torch.nn.functional.interpolate(
86
+ predicted_depth.unsqueeze(1),
87
+ size=image.size[::-1],
88
+ mode="bicubic",
89
+ align_corners=False,
90
+ )
91
+ ```
92
+
93
+ For more code examples, please refer to the [documentation](https://huggingface.co/transformers/main/model_doc/depth_anything.html#).
94
+
95
+
96
+ ### Citation
97
+
98
+ ```bibtex
99
+ @misc{yang2024depth,
100
+ title={Depth Anything V2},
101
+ author={Lihe Yang and Bingyi Kang and Zilong Huang and Zhen Zhao and Xiaogang Xu and Jiashi Feng and Hengshuang Zhao},
102
+ year={2024},
103
+ eprint={2406.09414},
104
+ archivePrefix={arXiv},
105
+ primaryClass={id='cs.CV' full_name='Computer Vision and Pattern Recognition' is_active=True alt_name=None in_archive='cs' is_general=False description='Covers image processing, computer vision, pattern recognition, and scene understanding. Roughly includes material in ACM Subject Classes I.2.10, I.4, and I.5.'}
106
+ }
107
+ ```
108
+
config.json ADDED
@@ -0,0 +1,53 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_commit_hash": null,
3
+ "architectures": [
4
+ "DepthAnythingForDepthEstimation"
5
+ ],
6
+ "backbone": null,
7
+ "backbone_config": {
8
+ "architectures": [
9
+ "Dinov2Model"
10
+ ],
11
+ "hidden_size": 384,
12
+ "image_size": 518,
13
+ "model_type": "dinov2",
14
+ "num_attention_heads": 6,
15
+ "out_features": [
16
+ "stage3",
17
+ "stage6",
18
+ "stage9",
19
+ "stage12"
20
+ ],
21
+ "out_indices": [
22
+ 3,
23
+ 6,
24
+ 9,
25
+ 12
26
+ ],
27
+ "patch_size": 14,
28
+ "reshape_hidden_states": false,
29
+ "torch_dtype": "float32"
30
+ },
31
+ "fusion_hidden_size": 64,
32
+ "head_hidden_size": 32,
33
+ "head_in_index": -1,
34
+ "initializer_range": 0.02,
35
+ "model_type": "depth_anything",
36
+ "neck_hidden_sizes": [
37
+ 48,
38
+ 96,
39
+ 192,
40
+ 384
41
+ ],
42
+ "patch_size": 14,
43
+ "reassemble_factors": [
44
+ 4,
45
+ 2,
46
+ 1,
47
+ 0.5
48
+ ],
49
+ "reassemble_hidden_size": 384,
50
+ "torch_dtype": "float32",
51
+ "transformers_version": null,
52
+ "use_pretrained_backbone": false
53
+ }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:3152477ce0d8d6978d76b995120de97cb5b928701fd0f817769f59e249a16b70
3
+ size 99173660
preprocessor_config.json ADDED
@@ -0,0 +1,44 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_valid_processor_keys": [
3
+ "images",
4
+ "do_resize",
5
+ "size",
6
+ "keep_aspect_ratio",
7
+ "ensure_multiple_of",
8
+ "resample",
9
+ "do_rescale",
10
+ "rescale_factor",
11
+ "do_normalize",
12
+ "image_mean",
13
+ "image_std",
14
+ "do_pad",
15
+ "size_divisor",
16
+ "return_tensors",
17
+ "data_format",
18
+ "input_data_format"
19
+ ],
20
+ "do_normalize": true,
21
+ "do_pad": false,
22
+ "do_rescale": true,
23
+ "do_resize": true,
24
+ "ensure_multiple_of": 14,
25
+ "image_mean": [
26
+ 0.485,
27
+ 0.456,
28
+ 0.406
29
+ ],
30
+ "image_processor_type": "DPTImageProcessor",
31
+ "image_std": [
32
+ 0.229,
33
+ 0.224,
34
+ 0.225
35
+ ],
36
+ "keep_aspect_ratio": true,
37
+ "resample": 3,
38
+ "rescale_factor": 0.00392156862745098,
39
+ "size": {
40
+ "height": 518,
41
+ "width": 518
42
+ },
43
+ "size_divisor": null
44
+ }