This repository contains the code used in the paper:
“Beyond Single Words: MWE Identification in Bioinformatics Research Articles and Dispersion Profiling Across IMRaD”
It supports two main workflows:
- Corpus building: filter bioinformatics articles from a JATS XML collection and segment each article into main IMRaD sections (Abstract, Introduction, Materials and Methods, Results, Discussion, Conclusions when available).
- MWE extraction: extract Multiword Expressions (MWEs) from the resulting corpus using:
- UD-based extraction (via dependency parsing)
- USAS-based extraction (via PyMUSAS / UCREL semantic tagging)
- (Optionally) additional list-based / terminology resources (MeSH, AFL/ARTES) stored in
mwes-lists/
The segmented corpus itself is distributed separately (see Corpus access below).
The corpus (already segmented into IMRaD sections) is available for download here: https://doi.org/10.6084/m9.figshare.31215955
.
├── corpus-building/ # Python scripts to collect, filter, and structure the corpus
├── mwe-extraction/ # Python scripts to extract MWEs (UD and USAS)
├── mwes-lists/ # lists of MWEs from AFL, ARTES and the MeSH controlled vocabulary thesaurus
├── LICENSE # Code license
├── README.md
└── requirements.txt
git clone git@github.com:jurgigi/BioMONO.gitpip install -r requirements.txtspaCy model:
python -m spacy download en_core_web_smBioMONO_en is derived from the PLOS allofplos collection (JATS XML). Articles belonging to the bioinformatics subject are filtered and then segmented into IMRaD sections using JATS section titles / tags, producing one plain-text file per section per article.
PLOS allofplos: https://github.com/PLOS/allofplos
MWEs are extracted using complementary automated methods:
- UD-based MWEs: extracted from dependency parses using relations commonly associated with multiword constructions:
compound(incl. nominal compounds)compound:prt(phrasal verbs)fixed(grammaticalized fixed expressions)flat(headless flat constructions)flat:foreign(foreign sequences)
Parsing is performed with Stanza: https://github.com/stanfordnlp/stanza
-
USAS-based MWEs: extracted via PyMUSAS, which exposes UCREL’s USAS semantic resources and includes MWE tagging support: https://github.com/UCREL/pymusas
-
MeSH / AFL / ARTES lists: stored in
mwes-lists/for optional list-based matching in downstream analyses.
MeSH: https://www.nlm.nih.gov/mesh/meshhome.html
AFL: https://www.eapfoundation.com/vocab/academic/afl/
ARTES: https://artes.app.univ-paris-diderot.fr/
Goal: from a folder of JATS XML files, keep only those whose subject contains “bioinformatics”, then extract IMRaD sections into section-specific output folders.
python corpus-building/corpus_build.py \
/path/to/xml_folder \
/path/to/output_imrad_txt \
--subject bioinformaticspython mwe-extraction/parse_txt_folder_to_conllu.py \
--input_dir /path/to/output_imrad_txt/Introduction \
--output_dir /path/to/conllu/Introduction \
--download_if_missing \
--use_gpuOptional: use a domain package (if available in your Stanza setup):
python mwe-extraction/parse_txt_folder_to_conllu.py \
--input_dir /path/to/output_imrad_txt/Introduction \
--output_dir /path/to/conllu/Introduction \
--biomed genia \
--download_if_missing \
--use_gpuPer-file JSON outputs (default):
python mwe-extraction/extract_mwes_from_conllu_folder.py \
--input_dir /path/to/conllu/Introduction \
--output_dir /path/to/ud_mwes_json/IntroductionSingle aggregated JSON for the folder:
python mwe-extraction/extract_mwes_from_conllu_folder.py \
--input_dir /path/to/conllu/Introduction \
--output_dir /path/to/ud_mwes_json/Introduction \
--aggregateThis extracts only MWEs detected by PyMUSAS from each input .txt.
Per-file JSON (default):
python mwe-extraction/pymusas_extract_mwes_txt_folder.py \
--input_dir /path/to/output_imrad_txt/Introduction \
--output_dir /path/to/usas_mwes_json/Introduction \
--use_gpuSingle aggregated JSON for the folder:
python mwe-extraction/pymusas_extract_mwes_txt_folder.py \
--input_dir /path/to/output_imrad_txt/Introduction \
--output_dir /path/to/usas_mwes_json/Introduction \
--aggregate \
--agg_name all_usas_mwes_introduction.json \
--use_gpuDispersion can be computed once MWEs are extracted. The paper reports:
- Document Frequency (DF) and DF%
- Gries’ DP, quantifying deviation from an equal-share baseline:
Where pᵢ is the observed proportion of an MWE’s occurrences in document i, and sᵢ is the expected proportion under the baseline (operationalized as the document’s share of tokens in the section).
@inproceedings{
giraud2026beyond,
title={Beyond Single Words: {MWE} Identification in Bioinformatics Research Articles and Dispersion Profiling Across {IMR}aD},
author={Giraud, Jurgi and Gargett, Andrew},
booktitle={22nd Workshop on Multiword Expressions (MWE 2026) @EACL2026},
year={2026},
url={https://openreview.net/forum?id=BHg9nM9DlC}
}