Leave a comment

Performance improvement of map reduce through new Hadoop block placement algorithm

Cloud Avenue

HDFS estimates the network bandwidth between two nodes by their distance. The distance from a node to its parent node is assumed to be one. A shorter distance between two nodes means that the greater bandwidth they can utilize to transfer data.

The placement of replica is critical to HDFS data reliability and read/write performance. A good replica placement policy should improve data reliability, availability and network bandwidth utilization. Currently HDFS provides a configurable block placement policy interface so that the users and researchers can experiment and test any policy that’s optimal for applications.

The default HDFS block placement policy tries to maintain a tradeoff between minimizing the write cost and maximizing data reliability, availability and aggregate read bandwidth. Upon the creation of a new block, the first replica is placed on the node where the writer is located, the second and the third replicas on two different nodes in…

View original post 1,219 more words

Leave a comment

Azure SQL Data Sync Service

Cloud Avenue

Azure SQL Data Sync is a service built on Azure SQL Database that allows to synchronize data from one or many disparate SQL data sources either on-premises or Azure SQL Databases to a single Azure SQL Database (called as Hub) and vice versa. The data can be synced manually or automatically on a set schedule, anywhere from once every five minutes to once a month.

Data Sync is based around the concept of a Sync Group. A Sync Group is a group of databases that you want to synchronize.

The following properties of Sync Group has to be set to setup the sync:

  • Sync Schema: The tables and columns to be sync.
  • Sync Direction: Direction can be uni-directional or bi-directional. That is, the Sync Direction can be Hub to Memberor Member to Hub, or both.
  • Sync Interval:How often synchronization occurs (anywhere from once every five minutes…

View original post 622 more words

Leave a comment

Migrating data from different sources to DocumentDB

Source: Migrating data from different sources to DocumentDB

Leave a comment

Hadoop Distributed File System (HDFS)

Source: Hadoop Distributed File System (HDFS)

Leave a comment

Apache Hadoop

Source: Apache Hadoop

Leave a comment

HDFS Pipelining to minimize inter-node network traffic

Source: HDFS Pipelining to minimize inter-node network traffic

Leave a comment

HDFS Balancer to balance disk space usage on the cluster

Source: HDFS Balancer to balance disk space usage on the cluster

Leave a comment

Microsoft Announced Azure Database Migration Service

Source: Microsoft Announced Azure Database Migration Service

Leave a comment

Relevance of decision-based thinking in statistical analysis

Cloud Avenue

We can break up any statistical problem into three steps:

  1. Data collection and Sampling.
  2. Data analysis.
  3. Decision making.

It is well understood that step 1 typically requires some thought of steps 2 and 3. It is only when you have a sense of what you will do with your data, that you can make decisions about where, when, and how accurately to take your measurements.

However, the relevance for step 3 to step 2 is perhaps not understood so well. In many statistics textbooks, the steps of data analysis and decision-making are kept separate: we first discuss how to analyze the data, with the general goal being the production of some inferences that can be applied into any decision analysis.But your decision plans may very well influence your analysis.

Here are two ways this can happen:

  • Precision. If you know ahead of time you only need to estimate a parameter…

View original post 121 more words

Leave a comment

InterQuartile Range (IQR)

Cloud Avenue

The interquartile range of an observation variable is the difference of its upper and lower quartiles. It is a measure of how far apart the middle portion of data spreads in value.

When a data set has outliers or extreme values, we summarize a typical value using the median as opposed to the mean.  When a data set has outliers, variability is often summarized by a statistic called the interquartile range, which is the difference between the first and third quartiles. The first quartile, denoted Q1, is the value in the data set that holds 25% of the values below it. The third quartile, denoted Q3, is the value in the data set that holds 25% of the values above it. The quartiles can be determined following the same approach that we used to determine the median, but we now consider each half of the data set separately.

View original post 259 more words