Pdf

by nickmuchi - opened Sep 20, 2022

Discussion

nickmuchi

Sep 20, 2022

Can one use a pdf instead of an image?

ankrgyl

Impira org Sep 20, 2022

Hi @nickmuchi absolutely! To clarify, this repo corresponds to the model (layoutlm-document-qa) which takes text + bounding boxes as input. We also created a library called DocQuery which includes tools for parsing images, PDFs, webpages, etc. and takes care of converting them into text + bounding boxes and feeding them into this model.

I'd recommend starting from DocQuery. Feel free to post any questions or issues you run into either here or on DocQuery's Github page as you explore.

ankrgyl

Impira org Sep 20, 2022

Oh and this space also lets you upload PDF files :). It uses DocQuery behind the scenes.

nickmuchi

Sep 21, 2022

I have tried the space and works very well but was hoping to replicate that in my notebook using a PDF as other options only take images. Will start with DocQuery as you suggested, thank you!

nickmuchi changed discussion status to closed Sep 21, 2022

nickmuchi

Sep 21, 2022

Tried pip installing DocQuery and running it in jupyter-lab but getting a weird Symlink error to switch on Developer Mode on Windows. I tried switching it on and restarted but still getting the same error

Not sure if you have seen this before

nickmuchi changed discussion status to open Sep 21, 2022

ankrgyl

Impira org Sep 21, 2022

I have not. That code is not part of DocQuery or this model, so I'm unfortunately less familiar with it. Judging by the stack trace you sent, it seems like it would occur under any circumstance where it fails to create a symlink. I would edit the source file (file_download.py) and add a bit more detail to that error message, e.g.

except OSError as e:
    if os.name == "nt":
        raise OSError(
            ...
            f"{str(e)}"
         )

to see what the actual error is. If possible, we should also move this discussion into DocQuery (https://github.com/impira/docquery/issues) so that if others encounter this, they can reference our discussion and the solution.

nickmuchi

Sep 21, 2022

Sure that sounds fine, ye I realised I am getting the same error when I try download any HuggingFace tokenizer.from_pretrained. Have reported it to the forum too and saw a few Windows users are experiencing the same.

nickmuchi changed discussion status to closed Sep 21, 2022

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment