Modin dataframes and IBM Cloud Object Storage

- October 12, 2018

Modin is a Python framework capable to efficiently scale Pandas dataframe. To achieve this Modin uses a high performance distributed Ray framework. This short post explains how to use Modin and read data objects from IBM Cloud Object Storage.

Requirements

IBM Cloud Object Storage account

If you doesn't have one already, navigate to IBM Cloud and choose IBM Cloud Object Storage. Using dashboard, create a new bucket and upload some CSV objects there. You will need to obtain HMAC credentials for the bucket, just follow simple steps as described here

Python and dependencies

I used Python 3.6 but i assume other versions will work as well. Install the following packages: IBM COS SDK for Python, smart_open (we will use smart_open to access IBM Cloud Object Storage) and modin

Example

import modin.pandas as pd

import ibm_boto3

import smart_open

if __name__ == '__main__':

access_key = 'ACCESS KEY'
secret_key = 'SECRET KEY'

   # service endpoint is the URL where your bucket was created. Please use without http prefix

   # for example:
   service_edndpoint = 's3-api.us-geo.objectstorage.softlayer.net'

   session = ibm_boto3.session.Session(aws_access_key_id=access_key,

             aws_secret_access_key=secret_key)

   # access bucket "mybucket" and read an object "dataset/data.csv"

   my_df = pd.read_csv(smart_open.smart_open('s3://mybucket/dataset/data.csv',

                            host = service_endpoint, s3_session = session))

   print (my_df.columns)

Search This Blog

Thoughts on cloud and beyond