OpenDataLab (@opendatalab_ai) Twitter Tweets • TwiCopy

OpenDataLab

2 years ago

🎉Upgrade Notice MinerU is a LLM-powered tool that converts PDFs into machine-readable formats. 0.7.1 is now available, which add new integration option of the paddle tablemaster table recognition model, enhancing table processing speed. 🚀 github.com/opendatalab/Mi…

thumb_up_off_alt2

chat_bubble_outline0

repeat0

shareShare

OpenDataLab

@opendatalab_ai

a year ago

We present DocLayout-YOLO, which is suitable for diverse document layout detection, including but not limited to papers, textbooks, test papers, slides and other document types. ✨github:github.com/opendatalab/Do… 📜paper:arxiv.org/abs/2410.12628 💻demo:huggingface.co/spaces/opendat…

thumb_up_off_alt1

chat_bubble_outline0

repeat0

shareShare

OpenDataLab

@opendatalab_ai

a year ago

📢MinerU New Year Update – January 2025 Highlights There are the new features: Brand Visual Revamp: A complete redesign of the MinerU brand, along with the official relaunch of our website, offering easy access to technical documentation. Visit us at: mineru.net

thumb_up_off_alt0

chat_bubble_outline0

repeat0

shareShare

OpenDataLab

@opendatalab_ai

a year ago

● Official Client Release: Download and use with no programming required. Simply drag and drop to quickly process multiple documents for extraction without the need for login.

thumb_up_off_alt0

chat_bubble_outline0

repeat0

shareShare

OpenDataLab

@opendatalab_ai

a year ago

● Online API Services & Demo: Aligned with the latest model capabilities, optimized resource scheduling strategies, and enhanced batch processing capabilities.

thumb_up_off_alt0

chat_bubble_outline0

repeat0

shareShare

OpenDataLab

@opendatalab_ai

a year ago

‼️ Important Notice 📷: The v2 and v3 version APIs are now discontinued. Please migrate to the new v4 API, available under the new domain, and create a new token for continued use.

thumb_up_off_alt0

chat_bubble_outline0

repeat0

shareShare

OpenDataLab

@opendatalab_ai

a year ago

"Wanjuan2.0" is a multilingual and multimodal #corpus that comprises four #data modalities: full text, image- text, video, and #SFT, totaling 11.5 million data entries, covering Russian, Arabic, Korean, Vietnamese, Thai, etc. Open-sourse link: opendatalab.com/applyMultiling…

thumb_up_off_alt1

chat_bubble_outline0

repeat0

shareShare

OpenDataLab

@opendatalab_ai

a year ago

MinerU's core function at a glance:

thumb_up_off_alt0

chat_bubble_outline0

repeat1

shareShare

OpenDataLab

@opendatalab_ai

a year ago

The multilingual and multimodal #corpus "Wanjuan2.0" was open-sourced on HuggingFace, with ultra-fine #data , and applicable to multiple scenarios, such as cultural tourism, commercial trade, science and technology education. FREE DOWNLOAD FROM: huggingface.co/datasets?sort=…

thumb_up_off_alt1

chat_bubble_outline0

repeat0

shareShare

OpenDataLab

@opendatalab_ai

a year ago

What is your ideal data processing #tool? Get #MinerU as a professional assistant to help you get #AI-READY #data . Find out the core function of MinerU as your wish!

thumb_up_off_alt1

chat_bubble_outline1

repeat1

shareShare

OpenDataLab

@opendatalab_ai

a year ago

Are you looking for a tool to help you labeling #data? You can try #LabelU, the flexible labeling tool, which is applicable to #CV, voice interaction and #AI-assisted labeling. 👉labelu.shlab.tech/tasks/

thumb_up_off_alt1

chat_bubble_outline0

repeat0

shareShare

OpenDataLab

@opendatalab_ai

a year ago

Document content analysis has been a crucial research area in computer vision. We present #MinerU, an open-source solution for high-precision document content extraction. Deep dive into MinerU via the technical report: mineru.site/Saaas%E6%9C%8D…

thumb_up_off_alt1

chat_bubble_outline0

repeat0

shareShare

OpenDataLab

@opendatalab_ai

a year ago

We are very pleased to know that one of our users just launched a website about #MinerU! The website has deployed open-source solutions for data processing, tutoring, sharing of usage experience, etc. Welcome to join the community : mineru.site

thumb_up_off_alt0

chat_bubble_outline0

repeat0

shareShare

OpenDataLab

@opendatalab_ai

a year ago

The open-source dataset WanJuanSiLu, designed to provide high-quality training corpora for low-resource languages, thereby advancing the research and development of multilingual models. WanJuanSiLu mainly consists of eight subsets: Thai, Russian, Arabic, Korean, Hungarian, etc.

thumb_up_off_alt0

chat_bubble_outline0

repeat0

shareShare

Andrew Ng

@andrewyng

a year ago

Agentic Document Extraction just got much faster! From previous 135sec median processing time down to 8sec. Extracts not just text but diagrams, charts, and form fields from PDFs to give LLM-ready output. Please see the video for details and some application ideas.

thumb_up_off_alt3,3K

chat_bubble_outline97

repeat600

shareShare