Wednesday, August 23, 2017

R and the Oracle database: Using dplyr / dbplyr with ROracle in Windows 10

R uses data extensively. Data often resides in a database. In this blog I will describe installing and using dplyr, dbplyr and ROracle on Windows 10 to access data from an Oracle database and use it in R.


Accessing the Oracle database from R

dplyr makes the most common data manipulation tasks in R easier. dplyr can use dbplyr. dbplyr provides a transformation from the dplyr verbs to SQL queries. dbplyr 1.1.0 is released 2017-06-27. See here. It uses the DBI (R Database Interface). See here. This interface is implemented by various drivers such as ROracle. ROracle is an Oracle driver based on OCI (Oracle Call Interface) which is a high performance native C interface to connect to the Oracle Database.

Installing ROracle on Windows 10

I encountered several errors when installing ROracle in Windows 10 on R 3.3.3. The steps to take to do this right in one go are the following:
  • Determine your R platform architecture. 32 bit or 64 bit. For me this was 64 bit 
  • Download and install the oracle instant client with the corresponding architecture (here). Download the basic and SDK files. Put the sdk file from the sdk zip in a subdirectory of the extracted basic zip (at the same level as vc14)
  • Download and install RTools (here)
  • Set the OCI_LIB64 or OCI_LIB32 variables to the instant client path
  • Set the PATH variable to include the location of oci.dll
  • Install ROracle (install.packages("ROracle") in R)
Encountered errors

Warning in install.packages :
  package ‘ROracle_1.3-1.zip’ is not available (for R version 3.3.3)

You probably tried to install the ROracle package which Oracle provides on an R version which is too new (see here). This will not work on R 3.3.3. You can compile ROracle on your own or use the (older) R version Oracle supports.

Package which is only available in source form, and may need compilation of C/C++/Fortran: ‘ROracle’ These will not be installed

This can be done by installing RTools (here). This will install all the tools required to compile sources on a Windows machine.

Next you will get the following question:

Package which is only available in source form, and may need compilation of C/C++/Fortran: ‘ROracle’
Do you want to attempt to install these from sources?
y/n:

If you say y, you will get the following error:

installing the source package ‘ROracle’

trying URL 'https://cran.rstudio.com/src/contrib/ROracle_1.3-1.tar.gz'
Content type 'application/x-gzip' length 308252 bytes (301 KB)
downloaded 301 KB

* installing *source* package 'ROracle' ...
** package 'ROracle' successfully unpacked and MD5 sums checked
ERROR: cannot find Oracle Client.
       Please set OCI_LIB64 to specify its location.

In order to fix this, you can download and install the Oracle Instant Client (the basic and SDK downloads).

Mind that when running a 64 bit version of R, you also need a 64 bit version of the instant client. You can check with the R version command. In my case: Platform: x86_64-w64-mingw32/x64 (64-bit). Next you have to set the OCI_LIB64 variable (for 64 bit else OCI_LIB32) to the specified path. After that you will get the error as specified below:

Next it will fail with something like:

Error in inDL(x, as.logical(local), as.logical(now), ...) :
  unable to load shared object 'ROracle.dll':
  LoadLibrary failure:  The specified module could not be found.

This is caused when oci.dll from the instant client is not in the path environment variable. Add it and it will work! (at least it did on my machine). The INSTALL file from the ROracle package contains a lot of information about different errors which can occur during installation. If you encounter any other errors, be sure to check it.

How a successful 64 bit compilation looks

> install.packages("ROracle")
Installing package into ‘C:/Users/maart_000/Documents/R/win-library/3.3’
(as ‘lib’ is unspecified)
Package which is only available in source form, and may need compilation of C/C++/Fortran: ‘ROracle’
Do you want to attempt to install these from sources?
y/n: y
installing the source package ‘ROracle’

trying URL 'https://cran.rstudio.com/src/contrib/ROracle_1.3-1.tar.gz'
Content type 'application/x-gzip' length 308252 bytes (301 KB)
downloaded 301 KB

* installing *source* package 'ROracle' ...
** package 'ROracle' successfully unpacked and MD5 sums checked
Oracle Client Shared Library 64-bit - 12.2.0.1.0 Operating in Instant Client mode.
found Instant Client C:\Users\maart_000\Desktop\instantclient_12_2
found Instant Client SDK C:\Users\maart_000\Desktop\instantclient_12_2/sdk/include
copying from C:\Users\maart_000\Desktop\instantclient_12_2/sdk/include
** libs
Warning: this package has a non-empty 'configure.win' file,
so building only the main architecture

c:/Rtools/mingw_64/bin/gcc  -I"C:/PROGRA~1/R/R-33~1.3/include" -DNDEBUG -I./oci    -I"d:/Compiler/gcc-4.9.3/local330/include"     -O2 -Wall  -std=gnu99 -mtune=core2 -c rodbi.c -o rodbi.o
c:/Rtools/mingw_64/bin/gcc  -I"C:/PROGRA~1/R/R-33~1.3/include" -DNDEBUG -I./oci    -I"d:/Compiler/gcc-4.9.3/local330/include"     -O2 -Wall  -std=gnu99 -mtune=core2 -c rooci.c -o rooci.o
c:/Rtools/mingw_64/bin/gcc -shared -s -static-libgcc -o ROracle.dll tmp.def rodbi.o rooci.o C:\Users\maart_000\Desktop\instantclient_12_2/oci.dll -Ld:/Compiler/gcc-4.9.3/local330/lib/x64 -Ld:/Compiler/gcc-4.9.3/local330/lib -LC:/PROGRA~1/R/R-33~1.3/bin/x64 -lR
installing to C:/Users/maart_000/Documents/R/win-library/3.3/ROracle/libs/x64
** R
** inst
** preparing package for lazy loading
** help
*** installing help indices
** building package indices
** testing if installed package can be loaded
* DONE (ROracle)

Testing ROracle

You can read the ROracle documentation here. Oracle has been so kind as to provide developer VM's to play around with the database. You can download them here. I used 'Database App Development VM'.

After installation of ROracle you can connect to the database and for example fetch employees from the EMP table. See for example below (make sure you also have DBI installed).

library("DBI")
library("ROracle")
drv <- dbDriver("Oracle")
host <- "localhost"
port <- "1521"
sid <- "orcl12c"
connect.string <- paste(
  "(DESCRIPTION=",
  "(ADDRESS=(PROTOCOL=tcp)(HOST=", host, ")(PORT=", port, "))",
  "(CONNECT_DATA=(SID=", sid, ")))", sep = "")

con <- dbConnect(drv, username = "system", password = "oracle", dbname = connect.string, prefetch = FALSE,
          bulk_read = 1000L, stmt_cache = 0L, external_credentials = FALSE,
          sysdba = FALSE)

dbReadTable(con, "EMP")

This will yield the data in the EMP table.

   EMPNO  ENAME      JOB  MGR            HIREDATE  SAL COMM DEPTNO
1   7698  BLAKE  MANAGER 7839 1981-05-01 00:00:00 2850   NA     30
2   7566  JONES  MANAGER 7839 1981-04-02 00:00:00 2975   NA     20
3   7788  SCOTT  ANALYST 7566 1987-04-19 00:00:00 3000   NA     20
4   7902   FORD  ANALYST 7566 1981-12-02 23:00:00 3000   NA     20
5   7369  SMITH    CLERK 7902 1980-12-16 23:00:00  800   NA     20
6   7499  ALLEN SALESMAN 7698 1981-02-19 23:00:00 1600  300     30
7   7521   WARD SALESMAN 7698 1981-02-21 23:00:00 1250  500     30
8   7654 MARTIN SALESMAN 7698 1981-09-27 23:00:00 1250 1400     30
9   7844 TURNER SALESMAN 7698 1981-09-08 00:00:00 1500    0     30
10  7876  ADAMS    CLERK 7788 1987-05-23 00:00:00 1100   NA     20
11  7900  JAMES    CLERK 7698 1981-12-02 23:00:00  950   NA     30

Using dplyr

dplyr uses dbplyr and it makes working with database data a lot easier. You can see an example here.

Installing dplyr and dbplyr in R is easy:

install.packages("dplyr")
install.packages("dbplyr")

Various functions are provides to work with data.frames, a popular R datatype in combination with data from the database. Also dplyr uses an abstraction above SQL which makes coding SQL for non-SQL coders more easy. You can compare it in some ways with Hibernate which makes working with databases from the Java object world more easy.

Some functions dplyr provides:

filter() to select cases based on their values.
arrange() to reorder the cases.
select() and rename() to select variables based on their names.
mutate() and transmute() to add new variables that are functions of existing variables.
summarise() to condense multiple values to a single value.
sample_n() and sample_frac() to take random samples.

I'll use the same example data as with the above sample which uses plain ROracle

library("DBI")
library("ROracle")
library("dplyr")

#below are required to make the translation done by dbplyr to SQL produce working Oracle SQL
sql_translate_env.OraConnection <- dbplyr:::sql_translate_env.Oracle
sql_select.OraConnection <- dbplyr:::sql_select.Oracle
sql_subquery.OraConnection <- dbplyr:::sql_subquery.Oracle 

drv <- dbDriver("Oracle")
host <- "localhost"
port <- "1521"
sid <- "orcl12c"
connect.string <- paste(
  "(DESCRIPTION=",
  "(ADDRESS=(PROTOCOL=tcp)(HOST=", host, ")(PORT=", port, "))",
  "(CONNECT_DATA=(SID=", sid, ")))", sep = "")

con <- dbConnect(drv, username = "system", password = "oracle", dbname = connect.string, prefetch = FALSE,
          bulk_read = 1000L, stmt_cache = 0L, external_credentials = FALSE,
          sysdba = FALSE)

emp_db <- tbl(con, "EMP")
emp_db

The output is something like:

# Source:   table<EMP> [?? x 8]
# Database: OraConnection
   EMPNO  ENAME       JOB   MGR            HIREDATE   SAL  COMM DEPTNO
   <int>  <chr>     <chr> <int>              <dttm> <dbl> <dbl>  <int>
 1  7839   KING PRESIDENT    NA 1981-11-16 23:00:00  5000    NA     10
 2  7698  BLAKE   MANAGER  7839 1981-05-01 00:00:00  2850    NA     30
 3  7782  CLARK   MANAGER  7839 1981-06-09 00:00:00  2450    NA     10
 4  7566  JONES   MANAGER  7839 1981-04-02 00:00:00  2975    NA     20
 5  7788  SCOTT   ANALYST  7566 1987-04-19 00:00:00  3000    NA     20
 6  7902   FORD   ANALYST  7566 1981-12-02 23:00:00  3000    NA     20
 7  7369  SMITH     CLERK  7902 1980-12-16 23:00:00   800    NA     20
 8  7499  ALLEN  SALESMAN  7698 1981-02-19 23:00:00  1600   300     30
 9  7521   WARD  SALESMAN  7698 1981-02-21 23:00:00  1250   500     30
10  7654 MARTIN  SALESMAN  7698 1981-09-27 23:00:00  1250  1400     30
# ... with more rows

If I now want to select specific records, I can do something like:

emp_db %>% filter(DEPTNO == "10")

Which will yield

# Source:   lazy query [?? x 8]
# Database: OraConnection
  EMPNO  ENAME       JOB   MGR            HIREDATE   SAL  COMM DEPTNO
  <int>  <chr>     <chr> <int>              <dttm> <dbl> <dbl>  <int>
1  7839   KING PRESIDENT    NA 1981-11-16 23:00:00  5000    NA     10
2  7782  CLARK   MANAGER  7839 1981-06-09 00:00:00  2450    NA     10
3  7934 MILLER     CLERK  7782 1982-01-22 23:00:00  1300    NA     10

A slightly more complex query:

emp_db %>% 
  group_by(DEPTNO) %>%
  summarise(EMPLOYEES = count())

Will result in the number of employees per department:

# Source:   lazy query [?? x 2]
# Database: OraConnection
  DEPTNO EMPLOYEES
   <int>     <dbl>
1     30         6
2     20         5
3     10         3

You can see the generated query by:

emp_db %>% 
  group_by(DEPTNO) %>%
  summarise(EMPLOYEES = count()) %>% show_query()

Will result in

<SQL>
SELECT "DEPTNO", COUNT(*) AS "EMPLOYEES"
FROM ("EMP") 
GROUP BY "DEPTNO"

If I want to take a random sample from the dataset to perform analyses on, I can do:

sample_n(as_data_frame(emp_db), 10)

Which could yield something like:

# A tibble: 10 x 8
   EMPNO  ENAME      JOB   MGR            HIREDATE   SAL  COMM DEPTNO
   <int>  <chr>    <chr> <int>              <dttm> <dbl> <dbl>  <int>
 1  7844 TURNER SALESMAN  7698 1981-09-08 00:00:00  1500     0     30
 2  7499  ALLEN SALESMAN  7698 1981-02-19 23:00:00  1600   300     30
 3  7566  JONES  MANAGER  7839 1981-04-02 00:00:00  2975    NA     20
 4  7654 MARTIN SALESMAN  7698 1981-09-27 23:00:00  1250  1400     30
 5  7369  SMITH    CLERK  7902 1980-12-16 23:00:00   800    NA     20
 6  7902   FORD  ANALYST  7566 1981-12-02 23:00:00  3000    NA     20
 7  7698  BLAKE  MANAGER  7839 1981-05-01 00:00:00  2850    NA     30
 8  7876  ADAMS    CLERK  7788 1987-05-23 00:00:00  1100    NA     20
 9  7934 MILLER    CLERK  7782 1982-01-22 23:00:00  1300    NA     10
10  7782  CLARK  MANAGER  7839 1981-06-09 00:00:00  2450    NA     10

Executing the same command again will result in a different sample.

Finally

There are multiple ways to get data to and from the Oracle database and perform actions on them. Oracle provides Oracle R Enterprise. Oracle R Enterprise is a component of the Oracle Advanced Analytics Option of Oracle Database Enterprise Edition. You can create R proxy objects in your R session from database-resident data. This allows you to work on database data in R while the database does most of the computations. Another feature of Oracle R Enterprise is an R script repository in the database and a feature to allow execution of R scripts from within the database (embedded), even within SQL statements. As you can imagine this is quite powerful. More on this in a later blog!

No comments:

Post a Comment