Important Takeaways for PostgreSQL Indexes

Below are the IMPORTANT takeaways I have currently from my research, for PostgreSQL Indexes –

1. When deploying New/Modified Indexes to Production Environment, use the CONCURRENTLY option with the CREATE Index command, this will allow “Writes” on the Database, to proceed seamlessly. There are conditions under which this command can be used and needs to be monitored closely during deployment as well, as failure in deployment may result in an invalid index which would need to be removed manually.

2. Indexes with multiple columns in their definition must be used as sparingly as possible. The PostgreSQL Planner uses the Leading column as the Index for the major “Filtering Criteria”, hence the presence of other columns in the definition, though will be used for inequality comparison or for fetching data, is mostly more of a maintenance overhead than a performance benefit.

3. We can design Indexes with a “WHERE CLAUSE” in the Index Definition, called as partial indexes. Saves on both space and time but needs to be used very carefully, only when we are absolutely sure that the condition will either be directly or arithmetically connected to the query, else the index can become a maintenance overhead with no performance benefit.

4. If your workload includes a mix of queries that sometimes involve only column x, sometimes only column y, and sometimes both x and y, you might choose to create two separate indexes on x and y, relying on index combinations to process the queries that use both columns. This would be a better approach than creating a multi-column index with x,y in above mentioned scenario.

5. We can create indexes on expressions of columns, such as (lower (col1)) ;(( first_name || ‘ ‘ || last_name)), etc. Index expressions are relatively expensive to maintain, because the derived expression(s) must be computed for each row upon insertion and whenever it is updated. Indexes on expressions are useful when retrieval speed is more important than insertion and update speed.

6. PostgreSQL supports index-only scans, which can answer queries from an index alone without any need for random heap access, given that it is primarily a B-tree index and the query must reference only columns stored in the index. It will be a win only if a significant fraction of the table’s heap pages have their all-visible map bits set. But tables in which a large fraction of the rows are unchanging are common enough to make this type of scan very useful in practice.

7. To make effective use of the index-only scan feature, you might choose to create a covering index, which is an index specifically designed to include the columns needed by a particular type of query that you run frequently. Since queries typically need to retrieve more columns than just the ones they search on, PostgreSQL allows you to create an index in which some columns are just “payload” and are not part of the search key. This is done by adding an INCLUDE clause listing the extra columns.

8. It’s wise to be conservative about adding non-key payload columns to an index, especially wide columns. If an index tuple exceeds the maximum size allowed for the index type, data insertion will fail. In any case, non-key columns duplicate data from the index’s table and bloat the size of the index, thus potentially slowing searches. And remember that there is little point in including payload columns in an index unless the table changes slowly enough that an index-only scan is likely to not need to access the heap. If the heap tuple must be visited anyway, it costs nothing more to get the column’s value from there.

9. Suffix truncation removes non-key columns from upper B-Tree levels. As payload columns, they are never used to guide index scans. The truncation process also removes one or more trailing key column(s) when the remaining prefix of key column(s) happens to be sufficient to describe tuples on the lowest B-Tree level. In practice, covering indexes without an INCLUDE clause often avoid storing columns that are effectively payload in the upper levels. However, explicitly defining payload columns as non-key columns reliably keeps the tuples in upper levels small.

10. In principle, index-only scans can be used with expression indexes. However, PostgreSQL’s planner is currently not very smart about such cases. It considers a query to be potentially executable by index-only scan only when all columns needed by the query are available from the index. For example, for a query searching on f(x), x is not needed except in the context f(x), but the planner does not notice that and concludes that an index-only scan is not possible. If an index-only scan seems sufficiently worthwhile, this can be worked around by adding x as an included column. Partial indexes also have support index-only scans.

11. An index can support only one collation per index column. If multiple collations are of interest, multiple indexes may be needed. The index automatically uses the collation of the underlying column.

12. Always run ANALYZE first before examining index usage. This command collects statistics about the distribution of the values in the table. This information is required to estimate the number of rows returned by a query, which is needed by the planner to assign realistic costs to each possible query plan. In absence of any real statistics, some default values are assumed, which are almost certain to be inaccurate.

13. It is fatal to use very small test data sets to check index usage. While selecting 1000 out of 100000 rows could be a candidate for an index, selecting 1 out of 100 rows will hardly be, because the 100 rows probably fit within a single disk page, and there is no plan that can beat sequentially fetching 1 disk page.

That completes my look at Indexes in PostgreSQL in this series. Next Series on PostgreSQL Performance.

T-SQL Programming Guidelines

In this article, I share the 20 Programming Guidelines that I follow when working with T-SQL (MS SQL SERVER).

These guidelines have been developed over 10+ Years while working with a multi-terabyte database system.

1. Filter Condition – When joining physical tables in a query, there must be at least 1 filter condition (WHERE col1 = @par1) on any of the joining tables, to narrow down the result set. We must not get all the data from the table/tables, only get the data that is currently needed by the application.

2. Data Type Mismatch – When creating temporary objects in our stored procedures (# or @), it is important that the data types of the columns in the object, are the same as the data types of columns from which the data will be fetched. The same must be followed for local variables as well. Even for Table Joins, the columns on either side of the equality, must have matching datatypes, else it causes the Optimizer to convert implicitly, which affects performance negatively.

3. SET V/S SELECT –

We must use SET when:

Assigning a value to a variable from a query, as, if multiple values are returned from the query, SET throws an exception, whereas SELECT assigns the first returned value which may not be the right one.

We must use SELECT when:

Assigning values to multiple variables from a query. This should be used only when this assignment is not being made in a loop or multiple times in a stored procedure, as, if the current value becomes NULL, SELECT retains the previous value, which would be incorrect, in such cases use SET.

4. Placement of Filter Conditions in Outer Joins – While using an outer join, if the filter condition is placed in the ON Clause, it will apply the filter condition and then add the outer rows, whereas if the condition is placed in the WHERE Clause, the filter condition is applied after the outer rows have already been added. Hence it is important to show extreme precaution when using Outer Joins with filter conditions.

 5. Temp Table v/s Table Variable – In our procedures, for temporary processing purposes if temporary objects have been created then the general rule is to use temporary tables (# tables) if we are going to store more than 50 rows in that object and Table Variables (@ tables) for lesser number of rows to ensure correct estimation in execution plans. A more generalized approach is for storing lookup data temporarily – use table variables and for transactional data – use temporary tables.

 6. SCOPE_IDENTITY () V/S @@IDENTITY – When trying to fetch the last inserted identity value we must use SCOPE_IDENTITY () and not @@IDENTITY. Also when trying to find the number of rows inserted in the previous statement, we must use @@ROWCOUNT and not “Select Count () from”. @@RowCount must be immediately used after the statement for which we need to find the affected row count.

7. IN v/s EXISTS – It is recommended to use exists/not exists instead of in/not in, as it prevents scanning of the internally called table. IN can be used for static comparisons like – IN (1, 2, 3…).

8. Index Check While Joining Tables – When joining tables, the joining columns or the filtering column must have the relevant index. Also we must keep an eye on new tables/ columns created when reviewing code to verify if relevant indexes have been created for them according to use cases. 

9. Scalar Valued Functions – We must avoid calling scalar valued functions in select statements, if the scalar valued function joins several high data tables internally, instead this should be converted to table valued or better in to a stored procedure. This is because Scalar valued function, operate row-by-agonizing-row instead of on a set.

 10. Dynamic SQL and sp_executesql – We must avoid dynamic SQL as much as we can. If unavoidable, the dynamic SQL must be parameterized, this ensure, plan caching and reuse.

 11. ANSI Joins – While joining tables in a statement, we must use ANSI standard joins (INNER JOIN or OUTER JOIN), and must not join the tables in WHERE clause (old Microsoft-style join). In other words, we must not list the tables in FROM clause separated by comma and must define the join conditions in WHERE clause, as the former will not be supported in future SQL Server Versions.

 12. Insert with Column Names – In INSERT statements, we must list the column names being inserted, even when inserting values to all columns in that table.

 13. Order By and Distinct – Use Order By and Distinct only if unavoidable. The order by columns must preferably be in the index definition. If a result has to be ordered then the final SELECT statement which retrieves the result set must be explicitly ordered using the ORDER BY clause. As must as possible, do the ordering at the application, as database sorting is a performance overhead.

 14. XML Parameters – We must not use XML parameters. Use TVPs instead. Processing of XMLs consume more memory and time, also adds code complexity.

 15. SARG predicates – The conditions within WHERE clause must be SARG (searchable argument). Manipulation of physical columns (ISNULL (Col1) = @par1), must be avoided, as that renders the query NON – SARG’able and such queries can then cannot use underlying indexes.

16. Table Hints and Index Hints – We must not use hints in queries as plans change according to data, and we must let sql server decide the indexes to be used.

17. Nested SP Calls – While calling another stored procedure within a stored procedure, we must use the @Param = Value format. This will avoid any wrong matching of parameters and values in case of any change in the parameter list in the procedure being called.

18. Avoid Recompile – While Declaring Temporary objects in our procedure, we must keep all declarations at the beginning of the procedure, else it may cause a SP recompile hence affecting performance.

19. Transaction Handling – Transactions should be handled at the application level, hence our procedures must not have transaction handling within them (Begin Tran…. Commit Tran…. Rollback Tran).

20. Naming Convention and SET NOCOUNT ON – If the object(Procedure, Function, View) is newly created then we can also check the below:

a.  The stored procedure names must start with “usp” and functions with “udf”.

b.  Stored Procedures must have SET NOCOUNT ON in the beginning, this reduces unnecessary informatory messages from SQL Server.