API Reference¶
|
A reference to an RDD definition in Spark |
A reference to a DataFrame definition in Spark |
|
Intake automatically-generate catalog for tables stored in Spark |
- class intake_spark.spark_sources.SparkRDD(*args, **kwargs)[source]¶
A reference to an RDD definition in Spark
RDDs are list-of-things objects, evaluated lazily in Spark.
Examples
>>> args = [('textFile', ('text.*.files', )), ... ('map', (len,))] >>> context = {'master': 'spark://master.node:7077'} >>> source = SparkRDD(args, context)
The output of source.to_spark() is an RDD object holding the lengths of the lines of the input files.
- Attributes
- cache
- cache_dirs
- cat
- classname
- description
- dtype
- entry
gui
Source GUI, with parameter selection and plotting
- has_been_persisted
hvplot
Returns a hvPlot object to provide a high-level plotting API.
- is_persisted
plot
Returns a hvPlot object to provide a high-level plotting API.
plots
List custom associated quick-plots
- shape
Methods
__call__
(**kwargs)Create a new instance of this source with altered arguments
close
()Close open resources corresponding to this data source.
configure_new
(**kwargs)Create a new instance of this source with altered arguments
describe
()Description from the entry spec
discover
()Open resource and populate the source attributes.
export
(path, **kwargs)Save this data for sharing with other people
get
(**kwargs)Create a new instance of this source with altered arguments
persist
([ttl])Save data from this source to local persistent storage
read
()Materialise the whole RDD into a list of objects
read_chunked
()Return iterator over container fragments of data source
Returns one of the partitions of the RDD as a list of objects
to_dask
()Return a dask container for this data source
to_spark
()Return the spark object for this data, an RDD
yaml
()Return YAML representation of this data-source
get_persisted
set_cache_dir
- class intake_spark.spark_sources.SparkDataFrame(*args, **kwargs)[source]¶
A reference to a DataFrame definition in Spark
DataFrames are tabular spark objects containing a heterogeneous set of columns and potentially a large number of rows. They are similar in concept to Pandas or Dask data-frames. The Spark variety produced by this driver will be a handle to a lazy object, where computation will be managed by Spark.
Examples
>>> args = [ ... ['read', ], ... ['format', ['csv', ]], ... ['option', ['header', 'true']], ... ['load', ['data.*.csv', ]] ... ] >>> context = {'master': 'spark://master.node:7077'} >>> source = SparkDataFrame(args, context)
The output of source.to_spark() contains a spark object pointing to the parsed contents of the indicated CSV files
- Attributes
- cache
- cache_dirs
- cat
- classname
- description
- dtype
- entry
gui
Source GUI, with parameter selection and plotting
- has_been_persisted
hvplot
Returns a hvPlot object to provide a high-level plotting API.
- is_persisted
plot
Returns a hvPlot object to provide a high-level plotting API.
plots
List custom associated quick-plots
- shape
Methods
__call__
(**kwargs)Create a new instance of this source with altered arguments
close
()Close open resources corresponding to this data source.
configure_new
(**kwargs)Create a new instance of this source with altered arguments
describe
()Description from the entry spec
discover
()Open resource and populate the source attributes.
export
(path, **kwargs)Save this data for sharing with other people
get
(**kwargs)Create a new instance of this source with altered arguments
persist
([ttl])Save data from this source to local persistent storage
read
()Read all of the data into an in-memory Pandas data-frame
read_chunked
()Return iterator over container fragments of data source
Returns one partition of the data as a pandas data-frame
to_dask
()Return a dask container for this data source
to_spark
()Return the Spark object for this data, a DataFrame
yaml
()Return YAML representation of this data-source
get_persisted
set_cache_dir
- class intake_spark.spark_cat.SparkTablesCatalog(*args, **kwargs)[source]¶
Intake automatically-generate catalog for tables stored in Spark
This driver will query Spark’s Catalog object for any tables, and create an entry for each which, when accessed, will instantiate SparkDataFrame sources. Commonly, these table definitions will come from Hive.
- Attributes
- auth
- cache
- cache_dirs
- cat
- classname
- description
- dtype
- entry
gui
Source GUI, with parameter selection and plotting
- has_been_persisted
hvplot
Returns a hvPlot object to provide a high-level plotting API.
- is_persisted
- kwargs
plot
Returns a hvPlot object to provide a high-level plotting API.
plots
List custom associated quick-plots
- shape
Methods
__call__
(**kwargs)Create a new instance of this source with altered arguments
close
()Close open resources corresponding to this data source.
configure_new
(**kwargs)Create a new instance of this source with altered arguments
describe
()Description from the entry spec
discover
()Open resource and populate the source attributes.
export
(path, **kwargs)Save this data for sharing with other people
filter
(func)Create a Catalog of a subset of entries based on a condition
force_reload
()Imperative reload data now
from_dict
(entries, **kwargs)Create Catalog from the given set of entries
get
(**kwargs)Create a new instance of this source with altered arguments
items
()Get an iterator over (key, source) tuples for the catalog entries.
keys
()Entry names in this catalog as an iterator (alias for __iter__)
persist
([ttl])Save data from this source to local persistent storage
pop
(key)Remove entry from catalog and return it
read
()Load entire dataset into a container and return it
read_chunked
()Return iterator over container fragments of data source
read_partition
(i)Return a part of the data corresponding to i-th partition.
reload
()Reload catalog if sufficient time has passed
save
(url[, storage_options])Output this catalog to a file as YAML
serialize
()Produce YAML version of this catalog.
to_dask
()Return a dask container for this data source
to_spark
()Provide an equivalent data object in Apache Spark
values
()Get an iterator over the sources for catalog entries.
walk
([sofar, prefix, depth])Get all entries in this catalog and sub-catalogs
yaml
()Return YAML representation of this data-source
get_persisted
search
set_cache_dir