您好,欢迎光临本网站![请登录][注册会员]  
文件名称: spark-in-practice
  所属分类: Java
  开发工具:
  文件大小: 663kb
  下载次数: 0
  上传时间: 2016-04-16
  提 供 者: u0140*****
 详细说明: Workshop spark-in-practice In this workshop the exercises are focused on using the Spark core and Spark Streaming APIs, and also the dataFrame on data processing. Exercises are available both in Java and Scala on my github account (here in java). You just have t o clone the project and go! If you need help, take a look at the solution branch. To help you to implement each class, unitair tests are in. Frameworks used: Spark 1.4.0 Java 8 maven jUnit All exercises runs in local mode as a standalone program. To work on the hands-on, retrieve the code via the following command line: $ git clone https://github.com/nivdul/spark-in-practice.git Then you can import the project in IntelliJ or Eclipse. If you want to use the interactive spark-shell (only scala/python), you need to download a binary Spark distribution. But you need to load scala 2.10.x because Spark 1.4.0 works with this version. Go to the Spark directory $ cd /spark-1.4.0 First build the project $ build/mvn -DskipTests clean package Launch the spark-shell $ ./bin/spark-shell scala> Part 1: Spark core API To be more familiar with the Spark API, you will start by implementing the wordcount example (Ex0). After that we use reduced tweets as the data along a json format for data mining (Ex1-Ex3). In these exercises you will have to: Find all the tweets by user Find how many tweets each user has Find all the persons mentioned on tweets Count how many times each person is mentioned Find the 10 most mentioned persons Find all the hashtags mentioned on a tweet Count how many times each hashtag is mentioned Find the 10 most popular Hashtags The last exercise (Ex4) is a way more complicated: the goal is to build an inverted index knowing that an inverted is the data structure used to build search engines. Assuming #spark is a hashtag that appears in tweet1, tweet3, tweet39, the inverted index will be a Map that contains a (key, value) pair as (#spark, List(tweet1,tweet3, tweet39)). Part 2: streaming analytics with Spark Streaming Spark Streaming is a component of Spark to process live data streams in a scalable, high-throughput and fault-tolerant way. Spark Streaming In fact Spark Streaming receives live input data streams and divides the data into batches, which are then processed by the Spark engine to generate the final stream of results in batches. The abstraction, which represents a continuous stream of data is the DStream (discretized stream). In the workshop, Spark Streaming is used to process a live stream of Tweets using twitter4j, a library for the Twitter API. To be able to read the firehose, you will need to create a Twitter application at http://apps.twitter.com, get your credentials, and add it in the StreamUtils class. In this exercise you will have to: Print the status of each tweet Find the 10 most popular Hashtag Part 3: structured data with the DataFrame A DataFrame is a distributed collection of data organized into named columns. It is conceptually equivalent to a table in a relational database or a data frame in R/Python, but with richer optimizations under the hood. DataFrames can be constructed from different sources such as: structured data files, tables in Hive, external databases, or existing RDDs. DataFrame In the exercise you will have to: Print the dataframe Print the schema of the dataframe Find people who are located in Paris Find the user who tweets the more Conclusion If you find better way/implementation, do not hesitate to send a pull request or open an issue. Here are some useful links around Spark and its ecosystem: Apache Spark website Spark Streaming documentation Spark SQL and DataFrame documentation Databricks blog Analyze data from an accelerometer using Spark, Cassandra and MLlib ...展开收缩
(系统自动生成,下载前可以参看下载内容)

下载文件列表

相关说明

  • 本站资源为会员上传分享交流与学习,如有侵犯您的权益,请联系我们删除.
  • 本站是交换下载平台,提供交流渠道,下载内容来自于网络,除下载问题外,其它问题请自行百度
  • 本站已设置防盗链,请勿用迅雷、QQ旋风等多线程下载软件下载资源,下载后用WinRAR最新版进行解压.
  • 如果您发现内容无法下载,请稍后再次尝试;或者到消费记录里找到下载记录反馈给我们.
  • 下载后发现下载的内容跟说明不相乎,请到消费记录里找到下载记录反馈给我们,经确认后退回积分.
  • 如下载前有疑问,可以通过点击"提供者"的名字,查看对方的联系方式,联系对方咨询.
 相关搜索: spark
 输入关键字,在本站1000多万海量源码库中尽情搜索: