home << dhlab reference << dhlab.api.dhlab_api

document_corpus#

from dhlab.api.dhlab_api import document_corpus

document_corpus(doctype=None, author=None, freetext=None, fulltext=None, from_year=None, to_year=None, from_timestamp=None, to_timestamp=None, title=None, ddk=None, subject=None, lang=None, limit=None, order_by=None)[source]#

Fetch a corpus based on metadata.

Call the API BASE_URL endpoint /build_corpus.

Parameters:

doctype (str) – "digibok", "digavis", "digitidsskrift" or "digistorting"
author (str) – Name of an author.
freetext (str) – any of the parameters, for example: "digibok AND Ibsen".
fulltext (str) – words within the publication.
from_year (int) – Start year for time period of interest.
to_year (int) – End year for time period of interest.
from_timestamp (int) – Start date for time period of interest. Format: YYYYMMDD, books have YYYY0101
to_timestamp (int) – End date for time period of interest. Format: YYYYMMDD, books have YYYY0101
title (str) – Name or title of a document.
ddk (str) – Dewey Decimal Classification identifier.
subject (str) – subject (keywords) of the publication.
lang (str) – Language of the publication, as a 3-letter ISO code. Example: "nob" or "nno"
limit (int) – number of items to sample.
order_by (str) – order of elements in the corpus object. Typically used in combination with a limit. Example "random" (random order, the slowest), "rank" (ordered by relevance, faster) or "first" (breadth-first, using the order in the database table, the fastest method)

Returns:

a pandas.DataFrame with the corpus information.

Return type:

DataFrame