이번 회사에서 팀 프로젝트를 진행하면서 실시간 분석 프로젝트를 하게 되었다.
처음으로 알아볼 오픈소스 시스템은 Storm이다.
시간이 날때마다 공식 사이트 메뉴얼을 번역해보고자 한다. 아직 번역 실력이 허접함..
Home(서문)
Storm is a distributed realtime computation system. Similar to how Hadoop provides a set of general primitives for doing batch processing, Storm provides a set of general primitives for doing realtime computation. Storm is simple, can be used with any programming language, is used by many companies,
and is a lot of fun to use!
Storm은 분산 처리 시스템이다. Hadoop에서 제공하는 일반적인 기본 배치 프로세싱과 흡사하며, Storm은 실시간 처리를 제공한다. Storm은 간단하며, 다양한 언어가 지원되며, 많은 회사에서 재밌는 일에 사용하고 있다.
The past decade has seen a revolution in data processing.
지난 10년간 데이터 처리는 많은 발전을 이뤄왔다.
MapReduce, Hadoop, and related technologies have made it possible to store and process data at scales previously unthinkable.
MapReduce, Hadoop과 같은 기술들은 이전에는 상상도 할 수 없는 데이터양을 처리하고 저장하는 것을 가능하게 해주었다.
Unfortunately, these data processing technologies are not realtime systems, nor are they meant to be.
하지만 안타깝게도, 이러한 데이터 처리 기술은 실시간 시스템들이 아니였다.
There's no hack that will turn Hadoop into a realtime system; realtime data processing has a fundamentally different set of requirements than batch processing.
Hadoop은 실시간 처리보다는 배치 처리기술에 가까웠다.
However, realtime data processing at massive scale is becoming more and more of a requirement for businesses.
점점 더 실시간 처리 기술은 사업에서 필요한 기술이 되었다.
The lack of a "Hadoop of realtime" has become the biggest hole in the data processing ecosystem.
Hadoop의 실시간 처리의 공백은 점점 더 커져만 갔다.
Storm fills that hole.
Storm은 이 공백을 채워준다.
Before Storm, you would typically have to manually build a network of queues and workers to do realtime processing.
Storm 이전에 기술들은 대부분 실시간 처리를 위한 네트워크 큐와 워커를 생성한다.
Workers would process messages off a queue, update databases, and send new messages to other queues for further processing.
워커는 큐를 처리하고 DB를 업데이트하고 새로운 메시지를 앞 큐에 전달한다.
Unfortunately, this approach has serious limitations:
이 방식에는 심각한 한계가 존재한다.
1. Tedious: You spend most of your development time configuring where to send messages, deploying workers, and deploying intermediate queues. The realtime processing logic that you care about corresponds to a relatively small percentage of your codebase.
1. 따분함: 당신은
2. Brittle: There's little fault-tolerance. You're responsible for keeping each worker and queue up.
2. 불안정성:
3. Painful to scale: When the message throughput get too high for a single worker or queue, you need to partition how the data is spread around. You need to reconfigure the other workers to know the new locations to send messages. This introduces moving parts and new pieces that can fail.
3. 확장의 불편함:
Although the queues and workers paradigm breaks down for large numbers of messages, message processing is clearly the fundamental paradigm for realtime computation. The question is: how do you do it in a way that doesn't lose data, scales to huge volumes of messages, and is dead-simple to use and operate?
비록 큐와 워커 패러다임이 많은 수로 메시지를 분산되더라도, 실시간 처리를 위한 기본 패러다임은 완벽하게 수행되어야 한다. 여기에서 질문: 당신이라면 데이터를 잃지 않고, 큰 데이터를 어떻게 간단하게 사용하며 운용할 것인가?
Storm satisfies these goals.
스톰은 이러한 목적을 만족시켜줄 것이다.
Why Storm is important(왜 스톰을 사용하여하는가?)
Storm exposes a set of primitives for doing realtime computation. Like how MapReduce greatly eases the writing of parallel batch processing, Storm's primitives greatly ease the writing of parallel realtime computation.
The key properties of Storm are:
- Extremely broad set of use cases: Storm can be used for processing messages and updating databases (stream processing), doing a continuous query on data streams and streaming the results into clients (continuous computation), parallelizing an intense query like a search query on the fly (distributed RPC), and more. Storm's small set of primitives satisfy a stunning number of use cases.
- Scalable: Storm scales to massive numbers of messages per second. To scale a topology, all you have to do is add machines and increase the parallelism settings of the topology. As an example of Storm's scale, one of Storm's initial applications processed 1,000,000 messages per second on a 10 node cluster, including hundreds of database calls per second as part of the topology. Storm's usage of Zookeeper for cluster coordination makes it scale to much larger cluster sizes.
- Guarantees no data loss: A realtime system must have strong guarantees about data being successfully processed. A system that drops data has a very limited set of use cases. Storm guarantees that every message will be processed, and this is in direct contrast with other systems like S4.
- Extremely robust: Unlike systems like Hadoop, which are notorious for being difficult to manage, Storm clusters just work. It is an explicit goal of the Storm project to make the user experience of managing Storm clusters as painless as possible.
- Fault-tolerant: If there are faults during execution of your computation, Storm will reassign tasks as necessary. Storm makes sure that a computation can run forever (or until you kill the computation).
- Programming language agnostic: Robust and scalable realtime processing shouldn't be limited to a single platform. Storm topologies and processing components can be defined in any language, making Storm accessible to nearly anyone.