These tools and metrics are designed to help AI actors develop and use trustworthy AI systems and applications that respect human rights and are fair, transparent, explainable, robust, secure and safe.
scikit-learn-videos
Optimize and deploy in production π€ Hugging Face Transformer models in a single command line.
=> Up to 10X faster inference! <=
Why this tool?
At Lefebvre Dalloz we run in production semantic search engines in the legal domain, in non-marketing language it's a re-ranker, and we based ours on Transformer.
In those setup, latency is key to provide good user experience, and relevancy inference is done online for hundreds of snippets per user query.
We have tested many solutions, and below is what we found:
Pytorch + FastAPI = π’
Most tutorials on Transformer deployment in production are built over Pytorch and FastAPI. Both are great tools but not very performant in inference (actual measures below).
Microsoft ONNX Runtime + Nvidia Triton inference server = οΈππ¨
Then, if you spend some time, you can build something over ONNX Runtime and Triton inference server. You will usually get from 2X to 4X faster inference compared to vanilla Pytorch. It's cool!
Nvidia TensorRT + Nvidia Triton inference server = β‘οΈππ¨π¨
However, if you want the best in class performances on GPU, there is only a single possible combination: Nvidia TensorRT and Triton. You will usually get 5X faster inference compared to vanilla Pytorch.
Sometimes it can rise up to 10X faster inference.
Buuuuttt... TensorRT can ask some efforts to master, it requires tricks not easy to come up with, we implemented them for you!
About the tool
You can click on the links to see the associated tools
Tool type(s):
Objective(s):
Purpose(s):
Country of origin:
Lifecycle stage(s):
Type of approach:
Maturity:
Target users:
Programming languages:
Github stars:
- 3596
Github forks:
- 2545
Use Cases
Would you like to submit a use case for this tool?
If you have used this tool, we would love to know more about your experience.
Add use case