home << dhlab reference << dhlab.api.dhlab_api

document_corpus#

from dhlab.api.dhlab_api import document_corpus
document_corpus(doctype=None, author=None, freetext=None, fulltext=None, from_year=None, to_year=None, from_timestamp=None, to_timestamp=None, title=None, ddk=None, subject=None, lang=None, limit=None, order_by=None)[source]#

Fetch a corpus based on metadata.

Call the API BASE_URL endpoint /build_corpus.

Parameters:
  • doctype (str) – "digibok", "digavis", "digitidsskrift" or "digistorting"

  • author (str) – Name of an author.

  • freetext (str) – any of the parameters, for example: "digibok AND Ibsen".

  • fulltext (str) – words within the publication.

  • from_year (int) – Start year for time period of interest.

  • to_year (int) – End year for time period of interest.

  • from_timestamp (int) – Start date for time period of interest. Format: YYYYMMDD, books have YYYY0101

  • to_timestamp (int) – End date for time period of interest. Format: YYYYMMDD, books have YYYY0101

  • title (str) – Name or title of a document.

  • ddk (str) – Dewey Decimal Classification identifier.

  • subject (str) – subject (keywords) of the publication.

  • lang (str) – Language of the publication, as a 3-letter ISO code. Example: "nob" or "nno"

  • limit (int) – number of items to sample.

  • order_by (str) – order of elements in the corpus object. Typically used in combination with a limit. Example "random" (random order, the slowest), "rank" (ordered by relevance, faster) or "first" (breadth-first, using the order in the database table, the fastest method)

Returns:

a pandas.DataFrame with the corpus information.

Return type:

DataFrame