Modin dataframes and IBM Cloud Object Storage
Modin is a Python framework capable to efficiently scale Pandas dataframe. To achieve this Modin uses a high performance distributed Ray framework. This short post explains how to use Modin and read data objects from IBM Cloud Object Storage.
access_key = 'ACCESS KEY'
secret_key = 'SECRET KEY'
Requirements
IBM Cloud Object Storage account
If you doesn't have one already, navigate to IBM Cloud and choose IBM Cloud Object Storage. Using dashboard, create a new bucket and upload some CSV objects there. You will need to obtain HMAC credentials for the bucket, just follow simple steps as described here
Python and dependencies
I used Python 3.6 but i assume other versions will work as well. Install the following packages: IBM COS SDK for Python, smart_open (we will use smart_open to access IBM Cloud Object Storage) and modinExample
import modin.pandas as pd
import ibm_boto3
import smart_open
if __name__ == '__main__':
access_key = 'ACCESS KEY'
secret_key = 'SECRET KEY'
# service endpoint is the URL where your bucket was created. Please use without http prefix
# for example:
service_edndpoint = 's3-api.us-geo.objectstorage.softlayer.net'
session = ibm_boto3.session.Session(aws_access_key_id=access_key,
aws_secret_access_key=secret_key)
aws_secret_access_key=secret_key)
# access bucket "mybucket" and read an object "dataset/data.csv"
my_df = pd.read_csv(smart_open.smart_open('s3://mybucket/dataset/data.csv',
host = service_endpoint, s3_session = session))
print (my_df.columns)
Comments
Post a Comment