Love, and to be Loved.

我愛你,你是自由的。

The anatomy of an architecture to bring data science into production.

For more on complex data analysis at scale, check out Mikio Braun’s “Scalable Machine Learning“ video training.

Data science has become widely accepted across a broad range of industries in the past few years. Originally more of a research topic, data science has early roots in scientists efforts to understand human intelligence and create artificial intelligence; it has since proven that it can add real business value.As an example, we can look at the company I work for: Zalando, one of Europe’s biggest fashion retailers, where data science is heavily used to provide data-driven recommendations, among other things. Recommendations are provided as a back-end service in many places, including product pages, catalogue pages, newsletters, and for retargeting.

Read More...


Random forest is a highly versatile machine learning method with numerous applications ranging from marketing to healthcare and insurance. It can be used to model the impact of marketing on customer acquisition, retention, and churn or to predict disease risk and susceptibility in patients.

Random forest is capable of regression and classification. It can handle a large number of features, and it’s helpful for estimating which of your variables are important in the underlying data being modeled.

Read More...


Introduction

The air pollution is one of the main causes of death in the world. Several cities are on the radar of WHO, which are about to touch the dangerous level. Sadly, India is one of the countries with maximum number of most polluted cities in the world.

Especially, on the onset of Diwali, the air quality index of DelhiNCR soars to new heights. This year the air quality index has already crossed last year’s post Diwali index.

To know the intricacies of the problem, we decided to do an analytical study for the factors that contribute most to air pollution in New Delhi.

Read More...


《检测伪数据科学家的20个问题》在1月获得了最多的阅读量。但作者并没有提供这些问题的答案,所以KDnuggets的编辑们聚在一起解答了这些问题。我也额外增加了一个通常容易被忽略的问题。下面是这些问题的回答。

Read More...


總的來說,四種JOIN的使用/區別可以描述為:

left join 會從左表(shop)那裡返回所有的記錄,即使在右表(sale_detail)中没有匹配的行。

right outer join 右連結,返回右表中的所有記錄,即使在左表中没有記錄與它匹配。

full outer join 全連結,返回左右表中的所有記錄。

在表中存在至少一個匹配時,inner join 返回行。 關鍵字inner可省略。

具體可以看stackoverflow上,Difference between Inner Join & Full join這個問題,說得蛮清楚的,我就搬運一下這個問題的答案好了。

Read More...


PART I

This post is the first in a two-part series on stock data analysis using Python, based on a lecture I gave on the subject for MATH 3900 (Data Science) at the University of Utah. In these posts, I will discuss basics such as obtaining the data from Yahoo! Finance using *pandas**, visualizing stock data, moving averages, developing a moving-average crossover strategy, backtesting, and benchmarking. The final post will include practice problems. This first post discusses topics up to introducing moving averages.

Read More...


If you ask any Python programmer to tell about the strengths of Python, he will quote brevity and high readability as the most influencing ones. In this Python tutorial, we’ll cover many essential Python tips and tricks that will authenticate the above two points.

We’ve been collecting these useful shortcuts (tips & tricks) since we started using Python. And what’s best than sharing something we know and which could benefit others as well.

Read More...


目录

1 特征工程是什么?
2 数据预处理
  2.1 无量纲化
    2.1.1 标准化
    2.1.2 区间缩放法
    2.1.3 标准化与归一化的区别
  2.2 对定量特征二值化
  2.3 对定性特征哑编码
  2.4 缺失值计算
  2.5 数据变换
  2.6 回顾

Read More...


I Come, I Live, I Experience.