Magnus' database blog: 2014

Monday, February 17, 2014

OUGN14 - Multitenant with RAC and Data Guard

Here's the schedule for this years OUGN conference. Lots of interesting stuff here, I'm looking forward to hear Cary Millsap, Kellyn Pot'Vin and the ever present Doug Burns. There will also be some great Norwegian presenters, like Henry Hopkinson from Norgesgruppen on Enterprise Manager. I could continue to list names here, but I can't list all of the 5 tracks.

I'm also presenting again this year, with the title: "Live demo: Multitenant with RAC and Data Guard". I'm co-presenting with my good friend and ex-colleague Håkon Onsager, and we're on right after Cary Millsap on Friday morning. We will look at Multitenant from an operations point of view, and what will it look like in a Data Guard environment with or without RAC. We are also looking into how to get a flexible and stable platform for your future pluggable databases. It will be fun!

Wednesday, February 12, 2014

Exadata Smart Flash Table Caching

This feature is probably my favorite new feature in 11.2.3.3, because it allows Exadata to do smart scans of segments in flash. The examples below are run on a quarter rack X3-2 HC with storage sw version 12.1.1 and db version 11.2.0.4.

Here I have a simple query joining one small table of 19MB via a hash join with a larger table of ~150GB, with some filtering and a few aggregations.

SQL> explain plan for select /*+ monitoring*/ sum(extended_amt), count(extended_amt) from v_demo 
where sku_item_name like 'Pampers%' and (bsns_unit_key > 2012057 and bsns_unit_key < 2021070)

Explained.

SQL> select * from table(dbms_xplan.display);

PLAN_TABLE_OUTPUT
---------------------------------------------------------------------------------------------------------------------------------------------------------------
Plan hash value: 518373028
---------------------------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                         | Name                          | Rows  | Bytes | Cost (%CPU)| Time     | Pstart| Pstop |    TQ  |IN-OUT| PQ Distrib |
---------------------------------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                  |                               |     1 |    42 | 69534   (1)| 00:11:36 |       |       |        |      |            |
|   1 |  SORT AGGREGATE                   |                               |     1 |    42 |            |          |       |       |        |      |            |
|   2 |   PX COORDINATOR                  |                               |       |       |            |          |       |       |        |      |            |
|   3 |    PX SEND QC (RANDOM)            | :TQ10001                      |     1 |    42 |            |          |       |       |  Q1,01 | P->S | QC (RAND)  |
|   4 |     SORT AGGREGATE                |                               |     1 |    42 |            |          |       |       |  Q1,01 | PCWP |            |
|*  5 |      HASH JOIN                    |                               |  4668 |   191K| 69534   (1)| 00:11:36 |       |       |  Q1,01 | PCWP |            |
|   6 |       PX RECEIVE                  |                               |     1 |    26 |     6   (0)| 00:00:01 |       |       |  Q1,01 | PCWP |            |
|   7 |        PX SEND BROADCAST          | :TQ10000                      |     1 |    26 |     6   (0)| 00:00:01 |       |       |  Q1,00 | P->P | BROADCAST  |
|   8 |         PX BLOCK ITERATOR         |                               |     1 |    26 |     6   (0)| 00:00:01 |       |       |  Q1,00 | PCWC |            |
|*  9 |          TABLE ACCESS STORAGE FULL| DWR_SKU_ITEM                  |     1 |    26 |     6   (0)| 00:00:01 |       |       |  Q1,00 | PCWP |            |
|  10 |       PX BLOCK ITERATOR           |                               |   207M|  3172M| 69515   (1)| 00:11:36 |     1 |   128 |  Q1,01 | PCWC |            |
|* 11 |        TABLE ACCESS STORAGE FULL  | R_DWB_RTL_SLS_RETRN_LINE_ITEM |   207M|  3172M| 69515   (1)| 00:11:36 |     1 |   128 |  Q1,01 | PCWP |            |
---------------------------------------------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   5 - access("A"."SKU_ITEM_KEY"="B"."SKU_ITEM_KEY")
   9 - storage("SKU_ITEM_NAME" LIKE 'Pampers%')
       filter("SKU_ITEM_NAME" LIKE 'Pampers%')
  11 - storage("BSNS_UNIT_KEY"<2021070 AND "BSNS_UNIT_KEY">2012057 AND SYS_OP_BLOOM_FILTER(:BF0000,"A"."SKU_ITEM_KEY"))
       filter("BSNS_UNIT_KEY"<2021070 AND "BSNS_UNIT_KEY">2012057 AND SYS_OP_BLOOM_FILTER(:BF0000,"A"."SKU_ITEM_KEY"))

First I'll show an execution from Exadata storage sw version 11.2.3.2.1

SQL> @snap "select /*+ monitoring*/ sum(extended_amt), count(extended_amt) from v_demo 
where sku_item_name like 'Pampers%' and (bsns_unit_key > 2012057 and bsns_unit_key < 2021070)"

SUM(EXTENDED_AMT) COUNT(EXTENDED_AMT)
----------------- -------------------
         47905.43               31471

Elapsed: 00:00:06.27

NAME
---------------------------------------------------------------------------------------------------------------------------------------------------------
-- ExaSnapper v0.7 BETA by Tanel Poder @ Enkitec - The Exadata Experts ( http://www.enkitec.com )
---------------------------------------------------------------------------------------------------------------------------------------------------------

DB_LAYER_IO                    DB_PHYSIO_BYTES               |##################################################|         145469 MB          20708 MB/sec
DB_LAYER_IO                    DB_PHYSRD_BYTES               |##################################################|         145469 MB          20708 MB/sec
DB_LAYER_IO                    DB_PHYSWR_BYTES               |                                                  |              0 MB              0 MB/sec
AVOID_DISK_IO                  PHYRD_FLASH_RD_BYTES          |                                                  |              0 MB              0 MB/sec
AVOID_DISK_IO                  PHYRD_STORIDX_SAVED_BYTES     |#############################################     |         129787 MB          18476 MB/sec
REAL_DISK_IO                   SPIN_DISK_IO_BYTES            |#####                                             |          15682 MB           2232 MB/sec
REAL_DISK_IO                   SPIN_DISK_RD_BYTES            |#####                                             |          15682 MB           2232 MB/sec
REAL_DISK_IO                   SPIN_DISK_WR_BYTES            |                                                  |              0 MB              0 MB/sec
REDUCE_INTERCONNECT            PRED_OFFLOADABLE_BYTES        |##################################################|         145469 MB          20708 MB/sec
REDUCE_INTERCONNECT            TOTAL_IC_BYTES                |                                                  |            577 MB             82 MB/sec
REDUCE_INTERCONNECT            SMART_SCAN_RET_BYTES          |                                                  |            577 MB             82 MB/sec
REDUCE_INTERCONNECT            NON_SMART_SCAN_BYTES          |                                                  |              0 MB              0 MB/sec
CELL_PROC_DEPTH                CELL_PROC_DATA_BYTES          |#####                                             |          15685 MB           2233 MB/sec
CELL_PROC_DEPTH                CELL_PROC_INDEX_BYTES         |                                                  |              0 MB              0 MB/sec
CLIENT_COMMUNICATION           NET_TO_CLIENT_BYTES           |                                                  |              0 MB              0 MB/sec
CLIENT_COMMUNICATION           NET_FROM_CLIENT_BYTES         |                                                  |              0 MB              0 MB/sec

Here I'm using Tanel Poders Exasnapper to show a detailed view of where the I/O is done. In total the database should read ~145GB. 129GB is saved through the storage indexes and 15GB is read from the traditional HDD in the Exadata. The last column with MB/s is pretty correct here for the HDDs, we are reading 2232MB/s. A quarter rack is able to achieve faster throughput from disk, the reason why this is not higher is probably because it is not completely sequential I/O due to the storage indexes.

Next I'm running the same query on storage sw version 12.1.1.1

SQL> @snap "select /*+ monitoring*/ sum(extended_amt), count(extended_amt) from v_demo 
where sku_item_name like 'Pampers%' and (bsns_unit_key > 2012057 and bsns_unit_key < 2021070)";

SUM(EXTENDED_AMT) COUNT(EXTENDED_AMT)
----------------- -------------------
         47905.43               31471

Elapsed: 00:00:01.69

NAME
---------------------------------------------------------------------------------------------------------------------------------------------------------
-- ExaSnapper v0.7 BETA by Tanel Poder @ Enkitec - The Exadata Experts ( http://www.enkitec.com )
---------------------------------------------------------------------------------------------------------------------------------------------------------

DB_LAYER_IO                    DB_PHYSIO_BYTES               |##################################################|         145469 MB          67812 MB/sec
DB_LAYER_IO                    DB_PHYSRD_BYTES               |##################################################|         145469 MB          67812 MB/sec
DB_LAYER_IO                    DB_PHYSWR_BYTES               |                                                  |              0 MB              0 MB/sec
AVOID_DISK_IO                  PHYRD_FLASH_RD_BYTES          |#####                                             |          15212 MB           7091 MB/sec
AVOID_DISK_IO                  PHYRD_STORIDX_SAVED_BYTES     |#############################################     |         129790 MB          60503 MB/sec
REAL_DISK_IO                   SPIN_DISK_IO_BYTES            |                                                  |            467 MB            218 MB/sec
REAL_DISK_IO                   SPIN_DISK_RD_BYTES            |                                                  |            467 MB            218 MB/sec
REAL_DISK_IO                   SPIN_DISK_WR_BYTES            |                                                  |              0 MB              0 MB/sec
REDUCE_INTERCONNECT            PRED_OFFLOADABLE_BYTES        |##################################################|         145469 MB          67812 MB/sec
REDUCE_INTERCONNECT            TOTAL_IC_BYTES                |#                                                 |           1742 MB            812 MB/sec
REDUCE_INTERCONNECT            SMART_SCAN_RET_BYTES          |#                                                 |           1742 MB            812 MB/sec
REDUCE_INTERCONNECT            NON_SMART_SCAN_BYTES          |                                                  |              0 MB              0 MB/sec
CELL_PROC_DEPTH                CELL_PROC_DATA_BYTES          |#####                                             |          14523 MB           6770 MB/sec
CELL_PROC_DEPTH                CELL_PROC_INDEX_BYTES         |                                                  |              0 MB              0 MB/sec
CLIENT_COMMUNICATION           NET_TO_CLIENT_BYTES           |                                                  |              0 MB              0 MB/sec
CLIENT_COMMUNICATION           NET_FROM_CLIENT_BYTES         |                                                  |              0 MB              0 MB/sec

Elapsed time dropped from 6,27 seconds to 1,69 second, and now all I/O is done on flash as we can see from PHYRD_FLASH_RD_BYTES.

Earlier I had to use the alter table x storage(cell_flash_cache keep) to get smart scans to run from flash, now it just works! I have no experience with this in a real production system, if anyone has any real world experience with this please leave a comment!

Sunday, February 9, 2014

Enabling flash cache compression on Exadata X3-2

I have a plan to write a few posts about the new features in the storage software that was released in December 2013, first out is this one on enabling flash cache compression.

What you need to enable this feature

An Oracle Exadata of generation X3 or X4
Storage software 11.2.3.3 or 12.1.
Licenses for the Advanced Compression Option

Flash cache compression will allow the storage cells to compress and decompress data in the flash cards on the fly. The compression is done in the flash cards themselves, so it does not use CPU on the cell hosts and it is completely transparent to the database.

This procedure is explained in the release notes for the Storage cell software, for 11.2.3.3 that is doc id 1487339.1. I'm doing this with downtime, so all databases are down and the GI is down. See MOS if you want to do this rolling with the databases and GI up.

To enable the compression we need to drop the flash cache, the area reserved for smart flash logging and then all flashdisks. This is done as root on the first cellserver

[root@dm01cel01 ~]# cellcli -e alter flashcache all flush
Flash cache dm01cel01_FLASHCACHE altered successfully
[root@dm01cel01 ~]# cellcli -e drop flashcache all
Flash cache dm01cel01_FLASHCACHE successfully dropped
[root@dm01cel01 ~]# cellcli -e drop flashlog all
Flash log dm01cel01_FLASHLOG successfully dropped
[root@dm01cel01 ~]# cellcli -e drop celldisk all flashdisk
CellDisk FD_00_dm01cel01 successfully dropped
CellDisk FD_01_dm01cel01 successfully dropped
...

Because this is an X3-2 system we must enable support for compression and then enable the compression.

[root@dm01cel01 ~]# cellcli -e alter cell "flashCacheCompX3Support=TRUE"
Cell dm01cel01 successfully altered
[root@dm01cel01 ~]# cellcli -e "alter cell flashCacheCompress=TRUE"
Cell dm01cel01 successfully altered

Recreating celldisks, then the cache for smart flash logging and finally the flash cache itself.

[root@dm01cel01 ~]# cellcli -e "create celldisk all flashdisk"
CellDisk FD_00_dm01cel01 successfully created
CellDisk FD_01_dm01cel01 successfully created
...
[root@dm01cel01 ~]# cellcli -e "create flashlog all size=512M"
Flash log dm01cel01_FLASHLOG successfully created
[root@dm01cel01 ~]# cellcli -e create flashcache all
Flash cache dm01cel01_FLASHCACHE successfully created

This needs to be repeated on all storage cells. That's it! On a quarter rack this was done within half an hour.

[root@dm01db01 ~]#  dcli -g ~/cell_group -l root cellcli -e "list cell detail"|grep flash
dm01cel01: flashCacheMode:       WriteBack
dm01cel01: flashCacheCompress:   TRUE
dm01cel02: flashCacheMode:       writeback
dm01cel02: flashCacheCompress:   TRUE
dm01cel03: flashCacheMode:       writeback
dm01cel03: flashCacheCompress:   TRUE

Update:

@db_magnus Works fabulous on the X4-2 too. Did you notice that the flash cache doubled in size after enabling compression?
— Carol Dacko (@cdacko) February 10, 2014

Doubling of cache size? Sure thing! I didn't check this before the change, so I had to double-check the X3 datasheet, but it should have 1,6TB of flash pr storage cell.

[root@dm01cel01 ~]# cellcli -e list flashcache detail|grep -i Size
         effectiveCacheSize:     2978.75G
         size:                   2978.75G

My Oracle database notes...

Here's my blog focusing on Oracle Databases. There will probably be some pure database stuff, some Exadata and other stuff I stumble across focusing on databases. For me, just to structure what I do enough to post it on a blog will be a good for me... I have a habit of solving the same problem several times because I haven't documented how to do it. Maybe blogging will help?

Who am I? I'm Magnus, a former DBA, now working pre-sales at Oracle on Engineered Systems. On this blog everything I write is my own, and my views does not necessarily reflect those of Oracle.

I'd love to hear from YOU if you find anything I write useful. You can reach me on twitter @db_magnus, in the comments below or through my google+ icon on the right.