API Reference¶

`intake_spark.spark_sources.SparkRDD`(*args, …)	A reference to an RDD definition in Spark
`intake_spark.spark_sources.SparkDataFrame`(…)	A reference to a DataFrame definition in Spark
`intake_spark.spark_cat.SparkTablesCatalog`(…)	Intake automatically-generate catalog for tables stored in Spark

class intake_spark.spark_sources.SparkRDD(*args, **kwargs)[source]¶

A reference to an RDD definition in Spark

RDDs are list-of-things objects, evaluated lazily in Spark.

Examples

>>> args = [('textFile', ('text.*.files', )),
...         ('map', (len,))]
>>> context = {'master': 'spark://master.node:7077'}
>>> source = SparkRDD(args, context)

The output of source.to_spark() is an RDD object holding the lengths of the lines of the input files.

Attributes

cache
cache_dirs
cat
classname
description
dtype
entry
gui: Source GUI, with parameter selection and plotting
has_been_persisted
hvplot: Returns a hvPlot object to provide a high-level plotting API.
is_persisted
plot: Returns a hvPlot object to provide a high-level plotting API.
plots: List custom associated quick-plots
shape

Methods

`__call__`(**kwargs)	Create a new instance of this source with altered arguments
`close`()	Close open resources corresponding to this data source.
`configure_new`(**kwargs)	Create a new instance of this source with altered arguments
`describe`()	Description from the entry spec
`discover`()	Open resource and populate the source attributes.
`export`(path, **kwargs)	Save this data for sharing with other people
`get`(**kwargs)	Create a new instance of this source with altered arguments
`persist`([ttl])	Save data from this source to local persistent storage
`read`()	Materialise the whole RDD into a list of objects
`read_chunked`()	Return iterator over container fragments of data source
`read_partition`(i)	Returns one of the partitions of the RDD as a list of objects
`to_dask`()	Return a dask container for this data source
`to_spark`()	Return the spark object for this data, an RDD
`yaml`()	Return YAML representation of this data-source

get_persisted
set_cache_dir

read()[source]¶: Materialise the whole RDD into a list of objects

read_partition(i)[source]¶: Returns one of the partitions of the RDD as a list of objects

to_spark()[source]¶: Return the spark object for this data, an RDD

class intake_spark.spark_sources.SparkDataFrame(*args, **kwargs)[source]¶

A reference to a DataFrame definition in Spark

DataFrames are tabular spark objects containing a heterogeneous set of columns and potentially a large number of rows. They are similar in concept to Pandas or Dask data-frames. The Spark variety produced by this driver will be a handle to a lazy object, where computation will be managed by Spark.

Examples

>>> args = [
...    ['read', ],
...    ['format', ['csv', ]],
...    ['option', ['header', 'true']],
...    ['load', ['data.*.csv', ]]
...    ]
>>> context = {'master': 'spark://master.node:7077'}
>>> source = SparkDataFrame(args, context)

The output of source.to_spark() contains a spark object pointing to the parsed contents of the indicated CSV files

Attributes

cache
cache_dirs
cat
classname
description
dtype
entry
gui: Source GUI, with parameter selection and plotting
has_been_persisted
hvplot: Returns a hvPlot object to provide a high-level plotting API.
is_persisted
plot: Returns a hvPlot object to provide a high-level plotting API.
plots: List custom associated quick-plots
shape

Methods

`__call__`(**kwargs)	Create a new instance of this source with altered arguments
`close`()	Close open resources corresponding to this data source.
`configure_new`(**kwargs)	Create a new instance of this source with altered arguments
`describe`()	Description from the entry spec
`discover`()	Open resource and populate the source attributes.
`export`(path, **kwargs)	Save this data for sharing with other people
`get`(**kwargs)	Create a new instance of this source with altered arguments
`persist`([ttl])	Save data from this source to local persistent storage
`read`()	Read all of the data into an in-memory Pandas data-frame
`read_chunked`()	Return iterator over container fragments of data source
`read_partition`(i)	Returns one partition of the data as a pandas data-frame
`to_dask`()	Return a dask container for this data source
`to_spark`()	Return the Spark object for this data, a DataFrame
`yaml`()	Return YAML representation of this data-source

get_persisted
set_cache_dir

read()[source]¶: Read all of the data into an in-memory Pandas data-frame

read_partition(i)[source]¶: Returns one partition of the data as a pandas data-frame

to_spark()[source]¶: Return the Spark object for this data, a DataFrame

class intake_spark.spark_cat.SparkTablesCatalog(*args, **kwargs)[source]¶

Intake automatically-generate catalog for tables stored in Spark

This driver will query Spark’s Catalog object for any tables, and create an entry for each which, when accessed, will instantiate SparkDataFrame sources. Commonly, these table definitions will come from Hive.

Attributes

auth
cache
cache_dirs
cat
classname
description
dtype
entry
gui: Source GUI, with parameter selection and plotting
has_been_persisted
hvplot: Returns a hvPlot object to provide a high-level plotting API.
is_persisted
kwargs
plot: Returns a hvPlot object to provide a high-level plotting API.
plots: List custom associated quick-plots
shape

Methods

`__call__`(**kwargs)	Create a new instance of this source with altered arguments
`close`()	Close open resources corresponding to this data source.
`configure_new`(**kwargs)	Create a new instance of this source with altered arguments
`describe`()	Description from the entry spec
`discover`()	Open resource and populate the source attributes.
`export`(path, **kwargs)	Save this data for sharing with other people
`filter`(func)	Create a Catalog of a subset of entries based on a condition
`force_reload`()	Imperative reload data now
`from_dict`(entries, **kwargs)	Create Catalog from the given set of entries
`get`(**kwargs)	Create a new instance of this source with altered arguments
`items`()	Get an iterator over (key, source) tuples for the catalog entries.
`keys`()	Entry names in this catalog as an iterator (alias for __iter__)
`persist`([ttl])	Save data from this source to local persistent storage
`pop`(key)	Remove entry from catalog and return it
`read`()	Load entire dataset into a container and return it
`read_chunked`()	Return iterator over container fragments of data source
`read_partition`(i)	Return a part of the data corresponding to i-th partition.
`reload`()	Reload catalog if sufficient time has passed
`save`(url[, storage_options])	Output this catalog to a file as YAML
`serialize`()	Produce YAML version of this catalog.
`to_dask`()	Return a dask container for this data source
`to_spark`()	Provide an equivalent data object in Apache Spark
`values`()	Get an iterator over the sources for catalog entries.
`walk`([sofar, prefix, depth])	Get all entries in this catalog and sub-catalogs
`yaml`()	Return YAML representation of this data-source

get_persisted
search
set_cache_dir