Modin dataframes and IBM Cloud Object Storage

Modin is a  Python framework capable to efficiently scale Pandas dataframe.  To achieve this Modin uses a high performance distributed Ray framework. This short post explains how to use  Modin and read data objects from IBM Cloud Object Storage.

Requirements

IBM Cloud Object Storage account

If you doesn't have one already, navigate to IBM Cloud and choose IBM Cloud Object Storage. Using dashboard, create a new bucket and upload some CSV objects there. You will need to obtain HMAC credentials for the bucket, just follow simple steps as described here

Python and dependencies

I used Python 3.6 but i assume other versions will work as well.  Install the following packages:  IBM COS SDK for Python, smart_open (we will use smart_open to access IBM Cloud Object Storage) and modin

Example


import modin.pandas as pd
import ibm_boto3
import smart_open

if __name__ == '__main__':

     access_key = 'ACCESS KEY'
   secret_key = 'SECRET KEY'
   # service endpoint is the URL where your bucket was created. Please use without http prefix
   # for example:
   service_edndpoint = 's3-api.us-geo.objectstorage.softlayer.net'

   session = ibm_boto3.session.Session(aws_access_key_id=access_key,
             aws_secret_access_key=secret_key)
   # access bucket "mybucket" and read an object "dataset/data.csv"
   my_df = pd.read_csv(smart_open.smart_open('s3://mybucket/dataset/data.csv',
                            host = service_endpoint, s3_session = session))
   print (my_df.columns)

Comments

Popular posts from this blog

Brew your beer at home - cooling the wort

What is Big Data and is it really so Big?