Big Data Analytics & Data Lake Architecture
Organizations face several challenges with Enterprise Data Warehouses such as time to market, immediate access to data, quality issues, and lack of flexibility related to Analytics tools for deriving insight. Over the last few years, the Hadoop based Data Lake has emerged as an effective addition to the Analytics landscape for addressing a number of restrictions associated with a Data Warehouse. However, effectively managing data in a Data Lake requires an in-depth understanding of issues around data governance, metadata management, lineage tracking, indexing/searching, security, and others.
At Hadoolytics, we have the deep experience as well as advanced technical expertise to architect, deploy, and maintain your Data Lake implementation. We have worked on the architecture, design, and implementation of complete frameworks that are used to jump start a Data Lake implementation and as a result, bring the following expertise to our engagements:
- Implementation of Raw & Refined zones for governance related to ingestion, data promotion, and data reprocessing.
- Business meta-data management including importing from external systems.
- Cataloging of data from existing data sources including meta-data indexing to enable searching by data scientists for datasets in the lake.
- Ingestion from structured and unstructured data sources into the lake.
- Self-Serve marketplace functions including search, export & sandbox provisioning for pulling into reporting & analytics tools.
- Integration with external tools for data processing including de-duplication, cleansing, and profiling.
- Implementation of information rich dashboards to highlight insights gained from the analysis of data as outlined above.
- Lineage tracking of datasets from origination through propagation.
- Hadoop Processing Tools
- Apache Solr (indexing and search)
- Apache Falcon/Titan (lineage tracking)
- Spark/Spark Streaming
- Cloudera CDH
- Hortonworks HDP
- MS Azure HDInsight