langextract It is from Google Developer This lets you simply flip messy, unstructured textual content into clear, structured information by leveraging LLM. Customers can present some few fewer shot examples together with customized schemas and get outcomes primarily based on that. It really works not solely with its personal LLMS (by way of Ollama), but additionally with its personal LLM.
A major quantity of healthcare information is unstructured, making these instruments a great space of profit. The scientific notes are lengthy and filled with abbreviations and contradictions. Vital particulars reminiscent of drug names, dosages, and particularly dangerous drug reactions (ADRs) are buried within the textual content. Due to this fact, on this article, we needed to see if Langextract may deal with dangerous drug response (ADR) detection in our scientific notes. Extra importantly, is it efficient? Let’s look into this text. Please word that Langextract is an open supply mission from Google builders, however it isn’t an formally supported Google product.
A fast word: It solely exhibits how Langextract works. I am not a physician and this isn’t medical recommendation.
▶§The small print are right here Kaggle Notebook comply with.
Why ADR extraction is necessary
an Drug adversarial reactions (ADR) is a dangerous and unintended consequence attributable to taking treatment. These vary from delicate unwanted effects reminiscent of nausea and dizziness to critical penalties that will require medical session.
Fast detection of them is necessary for affected person security Pharmacobigilance. The problem is that in scientific notes the ADR is buried together with previous circumstances, lab outcomes, and different contexts. In consequence, they’re troublesome to detect. Detecting ADR utilizing LLMS is an ongoing subject of analysis. Some Recent works LLM confirmed that it’s good at rising pink flags however unreliable. Due to this fact, the objective right here is to see if this library can discover unwanted effects between different entities in scientific notes reminiscent of drug remedy, dosage, severity, and so forth., so ADR extraction is an effective stress take a look at for Langextract.
How Langextract works
Earlier than diving into use, let’s break down the Langextract workflow. It is a easy three-step course of:
- Outline the extraction job By creating a transparent immediate specifying precisely what you need to extract.
- Present some prime quality examples To information the mannequin to the small print of the format and degree you anticipate.
- Ship enter textual content, choose the mannequin, and course of Langextract. Customers can Subsequent, both test the outcomes, visualize them, or cross them on to the downstream pipeline.
official GitHub Repository The software has detailed examples throughout a number of domains, starting from Shakespeare’s Romeo & Juliet entity extraction to drug identification in scientific notes and structuring scientific radiation reviews. Examine them out.
set up
First, you want to set up it LangExtract library. Doing this with is at all times a good suggestion Virtual environment To isolate mission dependencies.
pip set up langextract
Figuring out dangerous drug reactions in scientific notes with Langextract & gemini
Now, let’s go to our use instances. For this walkthrough, use Google Gemini 2.5 Flash Mannequin. You may as well use it Gemini Professional For extra advanced inference duties. You could first set the API key:
export LANGEXTRACT_API_KEY="your-api-key-here"
▶§The small print are right here Kaggle Notebook comply with.
Step 1: Outline the extraction job
Create a immediate to extract the drugs. You may as well ask for the severity degree if talked about.
immediate = textwrap.dedent("""
Extract treatment, dosage, adversarial response, and motion taken from the textual content.
For every adversarial response, embrace its severity as an attribute if talked about.
Use precise textual content spans from the unique textual content. Don't paraphrase.
Return entities within the order they seem.""")

Right here is an instance that may information your mannequin to the right format:
# 1) Outline the immediate
immediate = textwrap.dedent("""
Extract situation, treatment, dosage, adversarial response, and motion taken from the textual content.
For every adversarial response, embrace its severity as an attribute if talked about.
Use precise textual content spans from the unique textual content. Don't paraphrase.
Return entities within the order they seem.""")
# 2) Instance
examples = [
lx.data.ExampleData(
text=(
"After taking ibuprofen 400 mg for a headache, "
"the patient developed mild stomach pain. "
"They stopped taking the medicine."
),
extractions=[
lx.data.Extraction(
extraction_class="condition",
extraction_text="headache"
),
lx.data.Extraction(
extraction_class="medication",
extraction_text="ibuprofen"
),
lx.data.Extraction(
extraction_class="dosage",
extraction_text="400 mg"
),
lx.data.Extraction(
extraction_class="adverse_reaction",
extraction_text="mild stomach pain",
attributes={"severity": "mild"}
),
lx.data.Extraction(
extraction_class="action_taken",
extraction_text="They stopped taking the medicine"
)
]
)
]
Step 2: Present enter and carry out extraction
For enter, I take advantage of an actual scientific assertion from ADE Corpus V2 Hug face dataset.
input_text = (
"A 27-year-old man who had a historical past of bronchial bronchial asthma, "
"eosinophilic enteritis, and eosinophilic pneumonia offered with "
"fever, pores and skin eruptions, cervical lymphadenopathy, hepatosplenomegaly, "
"atypical lymphocytosis, and eosinophilia two weeks after receiving "
"trimethoprim (TMP)-sulfamethoxazole (SMX) therapy."
)
Subsequent, let’s run Langextract on the Gemini-2.5-Flash mannequin.
end result = lx.extract(
text_or_documents=input_text,
prompt_description=immediate,
examples=examples,
model_id="gemini-2.5-flash",
api_key=LANGEXTRACT_API_KEY
)
Step 3: View the outcomes
You possibly can view extracted entities in place
print(f"Enter: {input_text}n")
print("Extracted entities:")
for entity in end result.extractions:
position_info = ""
if entity.char_interval:
begin, finish = entity.char_interval.start_pos, entity.char_interval.end_pos
position_info = f" (pos: {begin}-{finish})"
print(f"• {entity.extraction_class.capitalize()}: {entity.extraction_text}{position_info}")

langextract accurately identifies Drug adversarial reactions With out complicated it with the affected person’s current situation, this is a crucial problem for any such job.
If you wish to visualize it, it should create this .jsonl file. You possibly can load it .jsonl Once you name a visualization perform and use the file, an HTML file is created.
lx.io.save_annotated_documents(
[result],
output_name="adr_extraction.jsonl",
output_dir="."
)
html_content = lx.visualize("adr_extraction.jsonl")
# Show the HTML content material immediately
show((html_content))

Use longer scientific notes
Precise scientific notes are sometimes for much longer than the examples above. For instance, beneath is an precise word from ADE-CORPUS-V2 Dataset Launched beneath MIT License. You possibly can entry it Hugging my face or Xenod.

To deal with longer textual content in Langextract, you retain the identical workflow, however add three parameters:
Extraction_Passes Run a number of paths on the textual content to catch particulars and enhance recall.
max_workers Controlling parallel processing permits you to course of bigger paperwork sooner.
max_char_buffer Cut up the textual content into small chunks. This helps the mannequin retains correct even when the enter may be very lengthy.
end result = lx.extract(
text_or_documents=input_text,
prompt_description=immediate,
examples=examples,
model_id="gemini-2.5-flash",
extraction_passes=3,
max_workers=20,
max_char_buffer=1000
)
That is the output. For brevity, solely a portion of the output is proven right here.

You may as well cross the doc URL immediately when you like text_or_documents parameter.
Utilizing Langextract utilizing native mannequin by way of Ollama
Langextract is just not restricted to its personal API. You may as well run it utilizing an area mannequin Orama. That is particularly helpful when utilizing privacy-sensitive scientific information that can’t go away a secure surroundings. You possibly can arrange your orama regionally, pull the mannequin of your alternative, and level it to langextract. Full directions can be found at Official Document.
Conclusion
In case you are constructing an info search system or an software that features a metadata extraction, Langextract can prevent a major quantity of preprocessing effort. In my ADR experiment, Langextract labored nicely and accurately recognized the treatment, dosage and response. What I observed is that the output immediately relies on the standard of the few shot examples offered by the consumer. So, LLM does heavy lifting, however people nonetheless stay an necessary a part of the loop. Though the outcomes had been encouraging, scientific information is excessive threat and requires wider and extra rigorous testing throughout various information units earlier than shifting to manufacturing use.

