`dhlab.api.dhlab_api`#

Module Contents#

Functions#

`wildcard_search`
`images`	Retrive images from bokhylla
`ner_from_urn`	Get NER annotations for a text (`urn`) using a spacy `model`.
`pos_from_urn`	Get part of speech tags and dependency parse annotations for a text (`urn`) with a SpaCy `model`.
`show_spacy_models`	Show available SpaCy model names.
`get_places`	Look up placenames in a specific URN.
`geo_lookup`	From a list of places, return their geolocations
`get_dispersion`	Count occurrences of words in the given URN object.
`get_metadata`	Get metadata for a list of URNs.
`get_identifiers`	Convert a list of identifiers, oaiid, sesamid, urns or isbn10 to dhlabids
`get_chunks`	Get the text in the document `urn` as frequencies of chunks of the given `chunk_size`.
`get_chunks_para`	Fetch chunks and their frequencies from paragraphs in a document (`urn`).
`evaluate_documents`	Count and aggregate occurrences of topic `wordbags` for each document in a list of `urns`.
`get_reference`	Reference frequency list of the n most frequent words from a given corpus in a given period.
`find_urns`	Return a list of URNs from a collection of docids.
`_ngram_doc`	Count occurrences of one or more words over a time period.
`reference_words`	Collect reference data for a list of words over a time period.
`ngram_book`	Count occurrences of one or more words in books over a given time period.
`ngram_periodicals`	Get a time series of frequency counts for `word` in periodicals.
`ngram_news`	Get a time series of frequency counts for `word` in newspapers.
`get_document_frequencies`	Fetch frequency counts of `words` in documents (`urns`).
`get_word_frequencies`	Fetch frequency numbers for `words` in documents (`urns`).
`get_urn_frequencies`	Fetch frequency counts of documents as URNs or DH-lab ids.
`get_document_corpus`
`document_corpus`	Fetch a corpus based on metadata.
`urn_collocation`	Create a collocation from a list of URNs.
`totals`	Get aggregated raw frequencies of all words in the National Library’s database.
`concordance`	Get a list of concordances from the National Library’s database.
`concordance_counts`	Count concordances (keyword in context) for a corpus query (used for collocation analysis).
`konkordans`	Wrapper for :func:`concordance`.
`word_concordance`	Get a list of concordances from the National Library’s database.
`collocation`	Make a collocation from a corpus query.
`word_variant`	Find alternative `form` for a given `word` form.
`word_paradigm`	Find paradigms for a given `word` form.
`word_paradigm_many`	Find alternative forms for a list of words.
`word_form`	Look up the morphological feature specification of a `word` form.
`word_form_many`	Look up the morphological feature specifications for word forms in a `wordlist`.
`word_lemma`	Find the list of possible lemmas for a given `word` form.
`word_lemma_many`	Find lemmas for a list of given word forms.
`query_imagination_corpus`	Fetch data from imagination corpus

API#

dhlab.api.dhlab_api.wildcard_search(word, factor=2, freq_limit=10, limit=50)#

dhlab.api.dhlab_api.images(text=None, part=True)#

Retrive images from bokhylla

Parameters:

text – fulltext query expression for sqlite
part – if a number the whole page is shown … bug prevents these from going thru
delta – if part=True then show additional pixels around image

Parsm hits:

number of images

dhlab.api.dhlab_api.ner_from_urn(urn: str = None, model: str = None, start_page=0, to_page=0) → pandas.DataFrame#

Get NER annotations for a text (urn) using a spacy model.

Parameters:

urn (str) – uniform resource name, example: URN:NBN:no-nb_digibok_2011051112001
model (str) – name of a spacy model. Check which models are available with :func:show_spacy_models

Returns:

Dataframe with annotations and their frequencies

dhlab.api.dhlab_api.pos_from_urn(urn: str = None, model: str = None, start_page=0, to_page=0) → pandas.DataFrame#

Get part of speech tags and dependency parse annotations for a text (urn) with a SpaCy model.

Parameters:

urn (str) – uniform resource name, example: URN:NBN:no-nb_digibok_2011051112001
model (str) – name of a spacy model. Check which models are available with :func:show_spacy_models

Returns:

Dataframe with annotations and their frequencies

dhlab.api.dhlab_api.show_spacy_models() → List#: Show available SpaCy model names.

dhlab.api.dhlab_api.get_places(urn: str) → pandas.DataFrame#

Look up placenames in a specific URN.

Call the API :py:obj:~dhlab.constants.BASE_URL endpoint /places <https://api.nb.no/dhlab/#/default/post_places>_.

Parameters:: urn (str) – uniform resource name, example: URN:NBN:no-nb_digibok_2011051112001

dhlab.api.dhlab_api.geo_lookup(places: List, feature_class: str = None, feature_code: str = None, field: str = 'alternatename') → pandas.DataFrame#

From a list of places, return their geolocations

Parameters:

places (list) – a list of place names - max 1000
feature_class (str) – which GeoNames feature class to return. Example: P
feature_code (str) – which GeoNames feature code to return. Example: PPL
field (str) – which name field to match - default “alternatename”.

dhlab.api.dhlab_api.get_dispersion(urn: str = None, words: List = None, window: int = 300, pr: int = 100) → pandas.Series#

Count occurrences of words in the given URN object.

Call the API :py:obj:~dhlab.constants.BASE_URL endpoint /dispersion.

Parameters:

urn (str) – uniform resource name, example: URN:NBN:no-nb_digibok_2011051112001
words (list) – list of words. Defaults to a list of punctuation marks.
window (int) – The number of tokens to search through per row. Defaults to 300.
pr (int) – defaults to 100.

Returns:

a pandas.Series with frequency counts of the words in the URN object.

dhlab.api.dhlab_api.get_metadata(urns: List[str] = None) → pandas.DataFrame#

Get metadata for a list of URNs.

Calls the API :py:obj:~dhlab.constants.BASE_URL endpoint /get_metadata <https://api.nb.no/dhlab/#/default/post_get_metadata>_.

Parameters:: urns (list) – list of uniform resource name strings, for example: ["URN:NBN:no-nb_digibok_2008051404065", "URN:NBN:no-nb_digibok_2010092120011"]

dhlab.api.dhlab_api.get_identifiers(identifiers: list = None) → list#: Convert a list of identifiers, oaiid, sesamid, urns or isbn10 to dhlabids

dhlab.api.dhlab_api.get_chunks(urn: str = None, chunk_size: int = 300) → Union[Dict, List]#

Get the text in the document urn as frequencies of chunks of the given chunk_size.

Calls the API :py:obj:~dhlab.constants.BASE_URL endpoint /chunks.

Parameters:

urn (str) – uniform resource name, example: URN:NBN:no-nb_digibok_2011051112001
chunk_size (int) – Number of tokens to include in each chunk.

Returns:

list of dicts with the resulting chunk frequencies, or an empty dict

dhlab.api.dhlab_api.get_chunks_para(urn: str = None) → Union[Dict, List]#

Fetch chunks and their frequencies from paragraphs in a document (urn).

Calls the API :py:obj:~dhlab.constants.BASE_URL endpoint /chunks_para.

Parameters:: urn (str) – uniform resource name, example: URN:NBN:no-nb_digibok_2011051112001
Returns:: list of dicts with the resulting chunk frequencies, or an empty dict

dhlab.api.dhlab_api.evaluate_documents(wordbags: Dict = None, urns: List[str] = None) → pandas.DataFrame#

Count and aggregate occurrences of topic wordbags for each document in a list of urns.

Parameters:

wordbags (dict) – a dictionary of topic keywords and lists of associated words. Example: {"natur": ["planter", "skog", "fjell", "fjord"], ... }
urns (list) – uniform resource names, for example: ["URN:NBN:no-nb_digibok_2008051404065", "URN:NBN:no-nb_digibok_2010092120011"]

Returns:

a pandas.DataFrame with the topics as columns, indexed by the dhlabids of the documents.

dhlab.api.dhlab_api.get_reference(corpus: str = 'digavis', from_year: int = 1950, to_year: int = 1955, lang: str = 'nob', limit: int = 100000) → pandas.DataFrame#

Reference frequency list of the n most frequent words from a given corpus in a given period.

Call the API :py:obj:~dhlab.constants.BASE_URL endpoint /reference_corpus <https://api.nb.no/dhlab/#/default/get_reference_corpus>_.

Parameters:

corpus (str) – Document type to include in the corpus, can be either 'digibok' or 'digavis'.
from_year (int) – Starting point for time period of the corpus.
to_year (int) – Last year of the time period of the corpus.
lang (str) – Language of the corpus, can be one of 'nob,', 'nno,', 'sme,', 'sma,', 'smj', 'fkv'
limit (int) – Maximum number of most frequent words.

Returns:

A pandas.DataFrame with the results.

dhlab.api.dhlab_api.find_urns(docids: Union[Dict, pandas.DataFrame] = None, mode: str = 'json') → pandas.DataFrame#

Return a list of URNs from a collection of docids.

Call the API :py:obj:~dhlab.constants.BASE_URL endpoint /find_urn.

Parameters:

docids – dictionary of document IDs ({docid: URN}) or a pandas.DataFrame.
mode (str) – Default ‘json’.

Returns:

the URNs that were found, in a pandas.DataFrame.

dhlab.api.dhlab_api._ngram_doc(doctype: str = None, word: Union[List, str] = ['.'], title: str = None, period: Tuple[int, int] = None, publisher: str = None, lang: str = None, city: str = None, ddk: str = None, topic: str = None) → pandas.DataFrame#

Count occurrences of one or more words over a time period.

The type of document to search through is decided by the doctype. Filter the selection of documents with metadata. Use % as wildcard where appropriate - no wildcards in word or lang.

Args: doctype: API endpoint for the document type to get ngrams for. Can be 'book', 'periodicals', or 'newspapers'. word: Word(s) to search for. Can be several words in a single string, separated by comma, e.g. "ord,ordene,orda". title: Title of a specific document to search through. period: Start and end years or dates of a time period, given as (YYYY, YYYY)`` or (YYYYMMDD, YYYYMMDD). publisher: Name of a publisher. lang: Language as a 3-letter ISO code (e.g. “nob”or”nno”`) city: City of publication. ddk: Dewey Decimal Classification identifier. topic: Topic of the documents.

Returns: a pandas.DataFrame with the resulting frequency counts of the word(s), spread across years. One year per row.

dhlab.api.dhlab_api.reference_words(words: List = None, doctype: str = 'digibok', from_year: Union[str, int] = 1800, to_year: Union[str, int] = 2000) → pandas.DataFrame#

Collect reference data for a list of words over a time period.

Reference data are the absolute and relative frequencies of the words across all documents of the given doctype in the given time period (from_year - to_year).

Parameters:

words (list) – list of word strings.
doctype (str) –
type of reference document. Can be "digibok" or "digavis". Defaults to "digibok".

… note:: If any other string is given as the doctype, the resulting data is equivalent to what you get with doctype="digavis".
from_year (int) – first year of publication
to_year (int) – last year of publication

Returns:

a DataFrame with the words’ frequency data

dhlab.api.dhlab_api.ngram_book(word: Union[List, str] = ['.'], title: str = None, period: Tuple[int, int] = None, publisher: str = None, lang: str = None, city: str = None, ddk: str = None, topic: str = None) → pandas.DataFrame#

Count occurrences of one or more words in books over a given time period.

Call the API :py:obj:~dhlab.constants.BASE_URL endpoint /ngram_book.

Filter the selection of books with metadata. Use % as wildcard where appropriate - no wildcards in word or lang.

Parameters:

word (str or list of str) – Word(s) to search for. Can be several words in a single string, separated by comma, e.g. "ord,ordene,orda".
title (str) – Title of a specific document to search through.
period (tuple of ints) – Start and end years or dates of a time period, given as (YYYY, YYYY) or (YYYYMMDD, YYYYMMDD).
publisher (str) – Name of a publisher.
lang (str) – Language as a 3-letter ISO code (e.g. "nob" or "nno")
city (str) – City of publication.
ddk (str) – Dewey Decimal Classification <https://no.wikipedia.org/wiki/Deweys_desimalklassifikasjon>_ identifier.
topic (str) – Topic of the documents.

Returns:

a pandas.DataFrame with the resulting frequency counts of the word(s), spread across years. One year per row.

dhlab.api.dhlab_api.ngram_periodicals(word: Union[List, str] = ['.'], title: str = None, period: Tuple[int, int] = None, publisher: str = None, lang: str = None, city: str = None, ddk: str = None, topic: str = None, **kwargs) → pandas.DataFrame#

Get a time series of frequency counts for word in periodicals.

Call the API :py:obj:~dhlab.constants.BASE_URL endpoint /ngram_periodicals.

Parameters:

word (str or list of str) – Word(s) to search for. Can be several words in a single string, separated by comma, e.g. "ord,ordene,orda".
title (str) – Title of a specific document to search through.
period (tuple of ints) – Start and end years or dates of a time period, given as (YYYY, YYYY) or (YYYYMMDD, YYYYMMDD).
publisher (str) – Name of a publisher.
lang (str) – Language as a 3-letter ISO code (e.g. "nob" or "nno")
city (str) – City of publication.
ddk (str) – Dewey Decimal Classification <https://no.wikipedia.org/wiki/Deweys_desimalklassifikasjon>_ identifier.
topic (str) – Topic of the documents.

Returns:

a pandas.DataFrame with the resulting frequency counts of the word(s), spread across years. One year per row.

dhlab.api.dhlab_api.ngram_news(word: Union[List, str] = ['.'], title: str = None, period: Tuple[int, int] = None) → pandas.DataFrame#

Get a time series of frequency counts for word in newspapers.

Call the API :py:obj:~dhlab.constants.BASE_URL endpoint /ngram_newspapers.

Parameters:

word (str or list of str) – Word(s) to search for. Can be several words in a single string, separated by comma, e.g. "ord,ordene,orda".
title (str) – Title of a specific newspaper to search through.
period (tuple of ints) – Start and end years or dates of a time period, given as (YYYY, YYYY) or (YYYYMMDD, YYYYMMDD).

Returns:

a pandas.DataFrame with the resulting frequency counts of the word(s), spread across the dates given in the time period. Either one year or one day per row.

dhlab.api.dhlab_api.get_document_frequencies(urns: List[str] = None, cutoff: int = 0, words: List[str] = None) → pandas.DataFrame#

Fetch frequency counts of words in documents (urns).

Call the API :py:obj:~dhlab.constants.BASE_URL endpoint /frequencies.

Parameters:

urns (list) – list of uniform resource name strings, for example: ["URN:NBN:no-nb_digibok_2008051404065", "URN:NBN:no-nb_digibok_2010092120011"]
cutoff (int) – minimum frequency of a word to be counted
words (list) – a list of words to be counted - if left None, whole document is returned. If not None both the counts and their relative frequency is returned.

dhlab.api.dhlab_api.get_word_frequencies(urns: List[str] = None, cutoff: int = 0, words: List[str] = None) → pandas.DataFrame#

Fetch frequency numbers for words in documents (urns).

Call the API :py:obj:~dhlab.constants.BASE_URL endpoint /frequencies.

Parameters:

urns (list) – list of uniform resource name strings, for example: ["URN:NBN:no-nb_digibok_2008051404065", "URN:NBN:no-nb_digibok_2010092120011"]
cutoff (int) – minimum frequency of a word to be counted
words (list) – a list of words to be counted - should not be left None.

dhlab.api.dhlab_api.get_urn_frequencies(urns: List[str] = None, dhlabid: List = None) → pandas.DataFrame#

Fetch frequency counts of documents as URNs or DH-lab ids.

Call the API :py:obj:~dhlab.constants.BASE_URL endpoint /frequencies.

Parameters:

urns (list) – list of uniform resource name strings, for example: ["URN:NBN:no-nb_digibok_2008051404065", "URN:NBN:no-nb_digibok_2010092120011"]
dhlabid (list) – list of numbers for dhlabid: [1000001, 2000003]

dhlab.api.dhlab_api.get_document_corpus(**kwargs)#

dhlab.api.dhlab_api.document_corpus(doctype: str = None, author: str = None, freetext: str = None, fulltext: str = None, from_year: int = None, to_year: int = None, from_timestamp: int = None, to_timestamp: int = None, title: str = None, ddk: str = None, subject: str = None, lang: str = None, limit: int = None, order_by: str = None) → pandas.DataFrame#

Fetch a corpus based on metadata.

Call the API :py:obj:~dhlab.constants.BASE_URL endpoint /build_corpus <https://api.nb.no/dhlab/#/default/post_build_corpus>_.

Parameters:

doctype (str) – "digibok", "digavis", "digitidsskrift" or "digistorting"
author (str) – Name of an author.
freetext (str) – any of the parameters, for example: "digibok AND Ibsen".
fulltext (str) – words within the publication.
from_year (int) – Start year for time period of interest.
to_year (int) – End year for time period of interest.
from_timestamp (int) – Start date for time period of interest. Format: YYYYMMDD, books have YYYY0101
to_timestamp (int) – End date for time period of interest. Format: YYYYMMDD, books have YYYY0101
title (str) – Name or title of a document.
ddk (str) – Dewey Decimal Classification <https://no.wikipedia.org/wiki/Deweys_desimalklassifikasjon>_ identifier.
subject (str) – subject (keywords) of the publication.
lang (str) – Language of the publication, as a 3-letter ISO code. Example: "nob" or "nno"
limit (int) – number of items to sample.
order_by (str) – order of elements in the corpus object. Typically used in combination with a limit. Example "random" (random order, the slowest), "rank" (ordered by relevance, faster) or "first" (breadth-first, using the order in the database table, the fastest method)

Returns:

a pandas.DataFrame with the corpus information.

dhlab.api.dhlab_api.urn_collocation(urns: List = None, word: str = 'arbeid', before: int = 5, after: int = 0, samplesize: int = 200000) → pandas.DataFrame#

Create a collocation from a list of URNs.

Call the API :py:obj:~dhlab.constants.BASE_URL endpoint /urncolldist_urn.

Parameters:

urns (list) – list of uniform resource name strings, for example: ["URN:NBN:no-nb_digibok_2008051404065", "URN:NBN:no-nb_digibok_2010092120011"]
word (str) – word to construct collocation with.
before (int) – number of words preceding the given word.
after (int) – number of words following the given word.
samplesize (int) – total number of urns to search through.

Returns:

a pandas.DataFrame with distance (sum of distances and bayesian distance) and frequency for words collocated with word.

dhlab.api.dhlab_api.totals(top_words: int = 50000) → pandas.DataFrame#

Get aggregated raw frequencies of all words in the National Library’s database.

Call the API :py:obj:~dhlab.constants.BASE_URL endpoint /totals/{top_words} <https://api.nb.no/dhlab/#/default/get_totals__top_words_>_.

Parameters:: top_words (int) – The number of words to get total frequencies for.
Returns:: a pandas.DataFrame with the most frequent words.

dhlab.api.dhlab_api.concordance(urns: list = None, words: str = None, window: int = 25, limit: int = 100) → pandas.DataFrame#

Get a list of concordances from the National Library’s database.

Call the API :py:obj:~dhlab.constants.BASE_URL endpoint /conc <https://api.nb.no/dhlab/#/default/post_conc>_.

Parameters:

urns (list) – uniform resource names, for example: ["URN:NBN:no-nb_digibok_2008051404065", "URN:NBN:no-nb_digibok_2010092120011"]
words (str) – Word(s) to search for. Can be an SQLite fulltext query, an fts5 string search expression.
window (int) – number of tokens on either side to show in the collocations, between 1-25.
limit (int) – max. number of concordances per document. Maximum value is 1000.

Returns:

a table of concordances

dhlab.api.dhlab_api.concordance_counts(urns: list = None, words: str = None, window: int = 25, limit: int = 100) → pandas.DataFrame#

Count concordances (keyword in context) for a corpus query (used for collocation analysis).

Call the API :py:obj:~dhlab.constants.BASE_URL endpoint /conccount <https://api.nb.no/dhlab/#/default/post_conccount>_.

Parameters:

urns (list) – uniform resource names, for example: ["URN:NBN:no-nb_digibok_2008051404065", "URN:NBN:no-nb_digibok_2010092120011"]
words (str) – Word(s) to search for. Can be an SQLite fulltext query, an fts5 string search expression.
window (int) – number of tokens on either side to show in the collocations, between 1-25.
limit (int) – max. number of concordances per document. Maximum value is 1000.

Returns:

a table of counts

dhlab.api.dhlab_api.konkordans(urns: list = None, words: str = None, window: int = 25, limit: int = 100)#: Wrapper for :func:concordance.

dhlab.api.dhlab_api.word_concordance(urn: list = None, dhlabid: list = None, words: list = None, before: int = 12, after: int = 12, limit: int = 100, samplesize: int = 50000) → pandas.DataFrame#

Get a list of concordances from the National Library’s database.

Call the API :py:obj:~dhlab.constants.BASE_URL endpoint /conc <https://api.nb.no/dhlab/#/default/conc_word_urn>_.

Parameters:

urns (list) – dhlab serial ids. (server can take both urns and dhlabid but so we may rewrite this to)
words (str) – Word(s) to search for – must be a list
before (int) – between 0-24.
after (int) – between 0-24 (before + sum <= 24)
limit (int) – max. number of concordances per server process.
samplesize (int) – samples from urns.

Returns:

a table of concordances

dhlab.api.dhlab_api.collocation(corpusquery: str = 'norge', word: str = 'arbeid', before: int = 5, after: int = 0) → pandas.DataFrame#

Make a collocation from a corpus query.

Parameters:

corpusquery (str) – query string
word (str) – target word for the collocations.
before (int) – number of words prior to word
after (int) – number of words following word

Returns:

a dataframe with the resulting collocations

dhlab.api.dhlab_api.word_variant(word: str, form: str, lang: str = 'nob') → list#

Find alternative form for a given word form.

Call the API :py:obj:~dhlab.constants.BASE_URL endpoint /variant_form

Example: word_variant('spiste', 'pres-part')

Parameters:

word (str) – any word string
form (str) – a morphological feature tag from the Norwegian wordbank "Orbanken" <https://www.nb.no/sprakbanken/ressurskatalog/oai-nb-no-sbr-5/>_.
lang (str) – either “nob” or “nno”

dhlab.api.dhlab_api.word_paradigm(word: str, lang: str = 'nob') → list#

Find paradigms for a given word form.

Call the API :py:obj:~dhlab.constants.BASE_URL endpoint /paradigm

Example:

… code-block:: python

word_paradigm('spiste')
# [['adj', ['spisende', 'spist', 'spiste']],
# ['verb', ['spis', 'spise', 'spiser', 'spises', 'spist', 'spiste']]]

Parameters:

word (str) – any word string
lang (str) – either “nob” or “nno”

dhlab.api.dhlab_api.word_paradigm_many(wordlist: list, lang: str = 'nob') → list#: Find alternative forms for a list of words.

dhlab.api.dhlab_api.word_form(word: str, lang: str = 'nob') → list#: Look up the morphological feature specification of a word form.

dhlab.api.dhlab_api.word_form_many(wordlist: list, lang: str = 'nob') → list#: Look up the morphological feature specifications for word forms in a wordlist.

dhlab.api.dhlab_api.word_lemma(word: str, lang: str = 'nob') → list#: Find the list of possible lemmas for a given word form.

dhlab.api.dhlab_api.word_lemma_many(wordlist, lang='nob')#: Find lemmas for a list of given word forms.

dhlab.api.dhlab_api.query_imagination_corpus(category=None, author=None, title=None, year=None, publisher=None, place=None, oversatt=None)#: Fetch data from imagination corpus

dhlab.api.dhlab_api#

Module Contents#

Functions#

API#

`dhlab.api.dhlab_api`#