--- title: "Read in SAS data in parallel into Spark" author: "Jan Wijffels" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Read in SAS data in parallel into Spark} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- This R package allows R users to easily import large [SAS](https://www.sas.com) datasets into [Spark](https://spark.apache.org) tables in parallel. The package uses the [spark-sas7bdat Spark package](https://spark-packages.org/package/saurfang/spark-sas7bdat) in order to read a SAS dataset in Spark. That Spark package imports the data in parallel on the Spark cluster using the Parso library and this process is launched from R using the [sparklyr](https://github.com/sparklyr/sparklyr) functionality. More information about the spark-sas7bdat Spark package and sparklyr can be found at: - https://spark-packages.org/package/saurfang/spark-sas7bdat and https://github.com/saurfang/spark-sas7bdat - https://github.com/sparklyr/sparklyr ## Example The following example reads in a file called iris.sas7bdat in parallel in a table called sas_example in Spark. Do try this with bigger data on your cluster and look at the help of the [sparklyr](https://github.com/sparklyr/sparklyr) package to connect to your Spark cluster. ```{r, eval=FALSE} library(sparklyr) library(spark.sas7bdat) mysasfile <- system.file("extdata", "iris.sas7bdat", package = "spark.sas7bdat") sc <- spark_connect(master = "local") x <- spark_read_sas(sc, path = mysasfile, table = "sas_example") ``` The resulting pointer to a Spark table can be further used in dplyr statements. These will be executed in parallel using the Spark functionalities of the spark-sas7bdat package. ```{r, eval=FALSE} library(dplyr) library(magrittr) x %>% group_by(Species) %>% summarise(count = n(), length = mean(Sepal_Length), width = mean(Sepal_Width)) ``` ## Support in big data and Spark analysis Need support in big data and Spark analysis? Contact BNOSAC: http://www.bnosac.be