Corpus#
from dhlab import Corpus
- class Corpus(doctype=None, author=None, freetext=None, fulltext=None, from_year=None, to_year=None, from_timestamp=None, to_timestamp=None, title=None, ddk=None, subject=None, lang=None, limit=10, order_by='random')[source]#
Bases:
DhlabObj
Class representing as DHLAB Corpus
Primary object for working with dhlab data. Contains references to texts in National Library’s collections and metadata about them. Use with
coll
,conc
orfreq
to analyse using dhlab tools.Create Corpus
- Parameters:
doctype (str) –
"digibok"
,"digavis"
,"digitidsskrift"
or"digistorting"
author (str) – Name of an author.
freetext (str) – any of the parameters, for example:
"digibok AND Ibsen"
.fulltext (str) – words within the publication.
from_year (int) – Start year for time period of interest.
to_year (int) – End year for time period of interest.
from_timestamp (int) – Start date for time period of interest. Format:
YYYYMMDD
, books haveYYYY0101
to_timestamp (int) – End date for time period of interest. Format:
YYYYMMDD
, books haveYYYY0101
title (str) – Name or title of a document.
ddk (str) –
Dewey Decimal Classification
_ identifier.subject (str) – subject (keywords) of the publication.
lang (str) – Language of the publication, as a 3-letter ISO code. Example:
"nob"
or"nno"
limit (int) – number of items to sample.
- coll(words=None, before=10, after=10, reference=None, samplesize=20000, alpha=False, ignore_caps=False)[source]#
Get collocations of
words
in corpus
- classmethod from_df(df, check_for_urn=False)[source]#
Typecast Pandas DataFrame to Corpus class
DataFrame most contain URN column