home << dhlab reference << dhlab.api.dhlab_api
document_corpus#
from dhlab.api.dhlab_api import document_corpus
- document_corpus(doctype=None, author=None, freetext=None, fulltext=None, from_year=None, to_year=None, from_timestamp=None, to_timestamp=None, title=None, ddk=None, subject=None, lang=None, limit=None, order_by=None)[source]#
Fetch a corpus based on metadata.
Call the API
BASE_URL
endpoint /build_corpus.- Parameters:
doctype (str) –
"digibok"
,"digavis"
,"digitidsskrift"
or"digistorting"
author (str) – Name of an author.
freetext (str) – any of the parameters, for example:
"digibok AND Ibsen"
.fulltext (str) – words within the publication.
from_year (int) – Start year for time period of interest.
to_year (int) – End year for time period of interest.
from_timestamp (int) – Start date for time period of interest. Format:
YYYYMMDD
, books haveYYYY0101
to_timestamp (int) – End date for time period of interest. Format:
YYYYMMDD
, books haveYYYY0101
title (str) – Name or title of a document.
ddk (str) – Dewey Decimal Classification identifier.
subject (str) – subject (keywords) of the publication.
lang (str) – Language of the publication, as a 3-letter ISO code. Example:
"nob"
or"nno"
limit (int) – number of items to sample.
order_by (str) – order of elements in the corpus object. Typically used in combination with a limit. Example
"random"
(random order, the slowest),"rank"
(ordered by relevance, faster) or"first"
(breadth-first, using the order in the database table, the fastest method)
- Returns:
a
pandas.DataFrame
with the corpus information.- Return type:
DataFrame