Refreshing Dremio Assets


Dremio refreshes datasets in the interval we setup at the data source level. There are two sections for metadata refresh.

  1. Dataset Discovery: Refresh interval for top-level source object names such as names of DBs and tables. This is a lightweight operation.
  2. Dataset Details: Metadata Dremio needs for query planning such as information on fields, types, shards, statistics and locality.


Any new data ingested into the source should be available instantaneously. However, there can be some changes in the DDL based on use cases. On that case, until the Dataset Details are refreshed by Dremio, the new records will not appear. Refreshing this setting frequently can be a overhead.

So, in case of any particular datasets need to be refreshed frequently better we can refresh those on-demand rather than making the whole data source busier.

Dremio provides few ways to refresh datasets (physical data sets aka PDS). Let's see.

REST API

Fetch the Entity Id

pds = 'Hive.demoschema.demotable'
baseURL = 'https://{{dremio url}}'
loginURL = '/apiv2/login'
catalogURL = '/api/v3/catalog'

# Generate the dremio token
dremioToken = {{follow the code from writing-dremio-client}}

# Construct the entity path
catalogTargetURLbyPath = baseURL + catalogURL + '/by-path/' + pds
headers = {'content-type':'application/json', 'authorization':'{authToken}'.format(authToken=dremioToken)}
response = requests.get(url, headers=headers)
data = json.loads(response.text)

# Extract the entity id
entityId = data['id']
  

Refresh PDS using the Entity Id

# Construct the entity refresh url
catalogRefreshURL = baseURL + catalogURL + '/' + entityId + '/' + 'refresh'

headers = {'content-type':'application/json', 'authorization':'{authToken}'.format(authToken=dremioToken)}
# Refresh the url
response = requests.post(catalogRefreshURL, headers=headers)#, data=createWikiRequest)
response_status = response.status_code  

  
Alternatively we can use Dremio SQL API to submit ALTER PDS (see below for the SQL). This request will return Job ID. We can monitor Job ID for completion i.e. jobState = COMPLETED.

ODBC

import pyodbc, pandas
asset_path = '''"Hive"."demoschema"."demotable"'''
host = "{{host name}}"
pwd = "{{token}}"
port = "{{dremio port i.e. 31010}}"
uid = "{{user id}}"
driver = "{{driver name}}"

cnxn = pyodbc.connect\
			("Driver={};\
			ConnectionType=Direct;\
			HOST={};PORT={};\
			AuthenticationType=Plain;\
			SSL=1;UID={};PWD={};\
			DisableCertificateVerification=1"\
			.format(driver, host,port,uid,pwd),\
			autocommit=True)
			
sql = '''ALTER PDS ''' + pds + ''' REFRESH METADATA LAZY UPDATE'''
cnxn.execute (sql)
  
References:
Dremio Catalog API
Python integration using Dremio ODBC Drivers

Comments

Popular posts from this blog

Writing a Dremio client

Azure Purview - Search By Glossary Terms using Atlas API