Knowledge Extraction from Scientific Docs

Here are some relevant papers on structured knowledge extraction from scientific documents:

"Into the Single Cell Multiverse: an End-to-End Dataset for Procedural Knowledge Extraction in Biomedical Texts" by Dannenfelser, R., Zhong, J., Zhang, R., & Yao, V.
- Abstract: This paper introduces FlaMBé, a collection of datasets for procedural knowledge extraction in biomedical texts. It focuses on capturing procedural knowledge from academic papers, particularly in the single cell research field. FlaMBé includes named entity recognition (NER) and disambiguation (NED) datasets for biological entities.
- Relevant Points: It discusses the challenges and potential in extracting procedural knowledge from scientific literature.
- Link to Paper
"WordScape: a Pipeline to extract multilingual, visually rich Documents with Layout Annotations from Web Crawl Data" by Weber, M., Siebenschuh, C., Butler, R., et al.
- Abstract: WordScape is a pipeline for creating multilingual corpora with document layout annotations. It parses the Open XML structure of Word documents from the web, providing layout-annotated document images and textual representations.
- Relevant Points: Discusses the challenges in using visually rich documents and the potential of WordScape in automating document understanding tasks.
- Link to Paper
"Revisiting Out-of-distribution Robustness in NLP: Benchmarks, Analysis, and LLMs Evaluations" by Yuan, L., Chen, Y., Cui, G., et al.
- Abstract: This paper reexamines out-of-distribution robustness in NLP and introduces a benchmark suite BOSS for evaluating it. It conducts experiments on pretrained language models to analyze OOD robustness.
- Relevant Points: Highlights the use of NLP models for extracting structured knowledge from diverse documents.
- Link to Paper
"The Harvard USPTO Patent Dataset: A Large-Scale, Well-Structured, and Multi-Purpose Corpus of Patent Applications" by Suzgun, M., Melas-Kyriazi, L., Sarkar, S., et al.
- Abstract: Introduces the Harvard USPTO Patent Dataset, a corpus of patent applications with rich structured data. It enables NLP tasks like multi-class classification, language modeling, and summarization.
- Relevant Points: Discusses the extraction and analysis of structured knowledge from patent documents.
- Link to Paper
"Thrust: Adaptively Propels Large Language Models with External Knowledge" by Zhao, X., Zhang, H., Pan, X., et al.
- Abstract: This paper introduces an instance-level adaptive propulsion of external knowledge for large language models, focusing on efficient knowledge retrieval and integration.
- Relevant Points: Addresses the challenges in extracting and integrating external knowledge for language models.
- Link to Paper
"CARE-MI: Chinese Benchmark for Misinformation Evaluation in Maternity and Infant Care" by Xiang, T., Li, L., Li, W., et al.
- Abstract: Presents CARE-MI, a benchmark for evaluating misinformation in Chinese language models, focusing on maternity and infant care. It introduces a new paradigm for building evaluation benchmarks.
- Relevant Points: Discusses the process of knowledge retrieval and evaluation in a specific context.
- Link to Paper

These papers cover a range of topics from procedural knowledge extraction in biomedical texts to robustness in NLP models and misinformation evaluation, highlighting the diverse approaches and challenges in structured knowledge extraction from scientific documents.