Skilled challenger to our Teradata suit
Hopsworks keeps us as a big organisation to be on our toes, in terms of what is possible and how fast it can be implemented. Nice to have options of on-premise and cloud, and extremely fast to add new libraries and other custom made wished from the data scientists.
Sometimes as it's rather flexible and fast to implement it can be hard to place it and understand it's belonging in our overall data architecture.
We recently analysed terabytes of cancer sequencing data and estimated what proportions of the cancers might be preventable in the future by vaccination. The paper is about to publish any day now and this study would have been impossible without Hopsworks.
As a data scientist, I am mostly focusing on developing big data processing pipelines as well as feature engineering for which Hopsworks is probably one of the best platforms. It makes it very easy to run Spark or PySpark applications to process vast amount of data with available resources. Installing python libraries for Jupiter notebook is also quite straightforward which solves many painful problems for a data scientist.
One thing I would try to improve is to get better visibility of the logs after a job is completed.
Hopsworks trial review
- project oriented dat plateform
- huge set of data analytics features (feature store, automl, lakehouse, kafka, spark, flink, ....)
- devsecops oriented product (security, stretched cluster, model serving, scm, ...)
- on-premise, cloud , multi-cloud (potential) and hybrid-cloud (potential) platform
- easiness of use
- european company
- open source based product
- gdpr compliancy
- deep learning compliancy (gpu, tpu(?), pytorch, tensorflow)
- openess (databricks, sagemake, ..., connectors)
- Hudi instead of deltalake
- lack of connection between deltalake (as provided with Hudi) and the feature store
- unability to use the ELK or influxdb included tools
- lack of connectors with AzureML, driveless ai, powerbi, azure datablob storage, snowflake, ...
- lack of managed platform on azure or gcp as the one provided for aws
- how to handle staging environment especially for the data sharing?
- ability to deploy on a kubernetes environment outside hopsworks
- R connection along with Python
- difficulties to knwo the arguments for choosing hopsworks instead of dtabacricks, azureml, sagemaker, ....
- lack of prepackaged, ready-to-use managed platform on an appliance for including intot a private datacenter
- I don't know the pricing policy, and I'm not capable of comaring it with the competitor ones
- no graphical etl-like tools enabling to create quickly a data engineering process and deploy it (à la dremio or dataiku)