Extracting entities from text¶
Marvin's extract
function is a robust tool for pulling lists of structured entities from text. It is designed to identify and retrieve many types of data, ranging from primitive data types like integers and strings to complex custom types and Pydantic models. It can also follow nuanced instructions, making it a highly versatile tool for a wide range of extraction tasks.
What it does
The extract
function pulls lists of structured entities from text.
Example
Extract product features from user feedback:
How it works
Marvin creates a schema from the provided type and instructs the LLM to use the schema to format its JSON response. Unlike casting, the LLM is told not to use the entire text, but rather to look for any mention that satisfies the schema and any additional instructions.
Supported targets¶
extract
supports almost all builtin Python types, plus Pydantic models, Python's Literal
, and TypedDict
. Pydantic models are especially useful for specifying specific features of the generated data, such as locations, dates, or more complex types. Builtin types are most useful in conjunction with instructions that provide more precise criteria for generation.
To specify the output type, pass it as the target
argument to extract
. The function will always return a list of matching items of the specified type. If no target type is provided, extract
will return a list of strings.
To extract multiple types in one call, use a Union
(or |
in Python 3.10+). Here's a simple example for combining float and int values, but you could do the same for any other types:
LLMs perform best with clear instructions, so compound types may require more guidance as the type itself isn't sending as clear a signal.
Note that extract
will always return a list of type you provide.
Instructions¶
When extracting entities, it is often necessary to give detailed guidance about either the criteria for extraction or the format of the output. For example, you may want to extract all numbers from a text, or you may want to extract all numbers that represent prices, or you may want to extract all numbers that represent prices greater than $100. You may want to extract all dates, or you may want to extract all dates that are in the future. You may want to extract all locations, or you may want to extract all locations that are in the United States.
For this purpose, extract accepts a instructions
argument, which is a natural language description of the desired output. The LLM will use these instructions, in addition to the provided type, to guide its extraction process. Instructions are especially important for types that are not self documenting, such as Python builtins like str
and int
.
Here are the above examples, illustrated with appropriate instructions. First, extracting different sets of numerical values:
text = "These shoes are normally $110, but I got 2 pairs for $80 each."
extract(text, float)
# [110.0, 2.0, 80.0]
extract(text, float, instructions='all numbers that represent prices')
# [110.0, 80.0]
extract(text, float, instructions='all numbers that represent prices greater than $100')
# [110.0]
Next, extracting specific dates:
from datetime import datetime
text = 'I will be out of the office from 9/1/2021 to 9/3/2021.'
extract(text, datetime)
# [datetime(2021, 9, 1, 0, 0), datetime(2021, 9, 3, 0, 0)]
extract(text, datetime, instructions=f'all dates after september 2nd')
# [datetime(2021, 9, 3, 0, 0)]
from pydantic import BaseModel
class Location(BaseModel):
city: str
country: str
text = 'I live in New York, but I am visiting London next week.'
extract(text, Location)
# [Location(city="New York", country="US"), Location(city="London", country="UK")]
extract(text, Location, instructions='all locations in the United States')
# [Location(city="New York", country="US")]
Sometimes the cast operation is obvious, as in the "big apple" example above. Other times, it may be more nuanced. In these cases, the LLM may require guidance or examples to make the right decision. You can provide natural language instructions
when calling cast()
in order to steer the output.
In a simple case, instructions can be used independent of any type-casting. Here, we want to keep the output a string, but get the 2-letter abbreviation of the state.
marvin.cast('California', to=str, instructions="The state's abbreviation")
# "CA"
marvin.cast('The sunshine state', to=str, instructions="The state's abbreviation")
# "FL"
marvin.cast('Mass.', to=str, instructions="The state's abbreviation")
# MA
Model parameters¶
You can pass parameters to the underlying API via the model_kwargs
argument of extract
. These parameters are passed directly to the API, so you can use any supported parameter.
Async support¶
If you are using Marvin in an async environment, you can use extract_async
:
result = await marvin.extract_async(
"I drove from New York to California.",
target=str,
instructions="2-letter state codes",
)
assert result == ["NY", "CA"]
Mapping¶
To extract from a list of inputs at once, use .map
:
inputs = [
"I drove from New York to California.",
"I took a flight from NYC to BOS."
]
result = marvin.extract.map(inputs, target=str, instructions="2-letter state codes")
assert result == [["NY", "CA"], ["NY", "MA"]]
(marvin.extract_async.map
is also available for async environments.)
Mapping automatically issues parallel requests to the API, making it a highly efficient way to work with multiple inputs at once. The result is a list of outputs in the same order as the inputs.