Refreshing Dremio Assets
- Dataset Discovery: Refresh interval for top-level source object names such as names of DBs and tables. This is a lightweight operation.
- Dataset Details: Metadata Dremio needs for query planning such as information on fields, types, shards, statistics and locality.
Any new data ingested into the source should be available instantaneously. However, there can be some changes in the DDL based on use cases. On that case, until the Dataset Details are refreshed by Dremio, the new records will not appear. Refreshing this setting frequently can be a overhead.
So, in case of any particular datasets need to be refreshed frequently better we can refresh those on-demand rather than making the whole data source busier.
Dremio provides few ways to refresh datasets (physical data sets aka PDS). Let's see.
REST API
Fetch the Entity Id
pds = 'Hive.demoschema.demotable'
baseURL = 'https://{{dremio url}}'
loginURL = '/apiv2/login'
catalogURL = '/api/v3/catalog'
# Generate the dremio token
dremioToken = {{follow the code from writing-dremio-client}}
# Construct the entity path
catalogTargetURLbyPath = baseURL + catalogURL + '/by-path/' + pds
headers = {'content-type':'application/json', 'authorization':'{authToken}'.format(authToken=dremioToken)}
response = requests.get(url, headers=headers)
data = json.loads(response.text)
# Extract the entity id
entityId = data['id']
Refresh PDS using the Entity Id
# Construct the entity refresh url
catalogRefreshURL = baseURL + catalogURL + '/' + entityId + '/' + 'refresh'
headers = {'content-type':'application/json', 'authorization':'{authToken}'.format(authToken=dremioToken)}
# Refresh the url
response = requests.post(catalogRefreshURL, headers=headers)#, data=createWikiRequest)
response_status = response.status_code
Alternatively we can use Dremio SQL API to submit ALTER PDS (see below for the SQL). This request will return Job ID. We can monitor Job ID for completion i.e. jobState = COMPLETED.
ODBC
import pyodbc, pandas
asset_path = '''"Hive"."demoschema"."demotable"'''
host = "{{host name}}"
pwd = "{{token}}"
port = "{{dremio port i.e. 31010}}"
uid = "{{user id}}"
driver = "{{driver name}}"
cnxn = pyodbc.connect\
("Driver={};\
ConnectionType=Direct;\
HOST={};PORT={};\
AuthenticationType=Plain;\
SSL=1;UID={};PWD={};\
DisableCertificateVerification=1"\
.format(driver, host,port,uid,pwd),\
autocommit=True)
sql = '''ALTER PDS ''' + pds + ''' REFRESH METADATA LAZY UPDATE'''
cnxn.execute (sql)
References:
Dremio Catalog API
Python integration using Dremio ODBC Drivers
Comments
Post a Comment