home << dhlab reference

Corpus#

from dhlab import Corpus
class Corpus(doctype=None, author=None, freetext=None, fulltext=None, from_year=None, to_year=None, from_timestamp=None, to_timestamp=None, title=None, ddk=None, subject=None, lang=None, limit=10, order_by='random')[source]#

Bases: DhlabObj

Class representing as DHLAB Corpus

Primary object for working with dhlab data. Contains references to texts in National Library’s collections and metadata about them. Use with coll, conc or freq to analyse using dhlab tools.

Create Corpus

Parameters:
  • doctype (str) – "digibok", "digavis", "digitidsskrift" or "digistorting"

  • author (str) – Name of an author.

  • freetext (str) – any of the parameters, for example: "digibok AND Ibsen".

  • fulltext (str) – words within the publication.

  • from_year (int) – Start year for time period of interest.

  • to_year (int) – End year for time period of interest.

  • from_timestamp (int) – Start date for time period of interest. Format: YYYYMMDD, books have YYYY0101

  • to_timestamp (int) – End date for time period of interest. Format: YYYYMMDD, books have YYYY0101

  • title (str) – Name or title of a document.

  • ddk (str) – Dewey Decimal Classification _ identifier.

  • subject (str) – subject (keywords) of the publication.

  • lang (str) – Language of the publication, as a 3-letter ISO code. Example: "nob" or "nno"

  • limit (int) – number of items to sample.

add(new_corpus)[source]#

Utility for appending Corpus or DataFrame to self

check_integrity()[source]#

Check the integrity of the corpus data.

coll(words=None, before=10, after=10, reference=None, samplesize=20000, alpha=False, ignore_caps=False)[source]#

Get collocations of words in corpus

conc(words, window=20, limit=500)[source]#

Get concodances of words in corpus

count(words=None)[source]#

Get word frequencies for corpus

freq(words=None)[source]#

Get word frequencies for corpus

classmethod from_csv(path)[source]#

Import corpus from csv

classmethod from_df(df, check_for_urn=False)[source]#

Typecast Pandas DataFrame to Corpus class

DataFrame most contain URN column

classmethod from_identifiers(identifiers)[source]#

Construct Corpus from list of identifiers

make_subcorpus(authors=None, title=None)[source]#

Make subcorpus based on author and title

Parameters:
  • authors (str, optional) – search for author field. Defaults to None.

  • title (str, optional) – search title field. Defaults to None.

Returns:

A subset of the original corpus

Return type:

Corpus

only_one_author()[source]#

Only select items with one author

only_one_language()[source]#

Only select items with one language

sample(n=5)[source]#

Create random subkorpus with n entries