相关文章推荐
有胆有识的椰子  ·  Pandas ...·  1 周前    · 
谦逊的针织衫  ·  HttpContext.User ...·  1 年前    · 
耍酷的茶叶  ·  python - Warning ...·  1 年前    · 
玉树临风的钢笔  ·  angularJS ...·  1 年前    · 
import kafka.javaapi.producer.Producer; import kafka.producer.KeyedMessage; import kafka.producer.ProducerConfig; public class MyProducer { public static void main(String[] args) { Properties props = new Properties(); props.setProperty( "metadata.broker.list","localhost:9092" ); props.setProperty( "serializer.class","kafka.serializer.StringEncoder" ); props.put( "request.required.acks","1" ); ProducerConfig config = new ProducerConfig(props); // 创建生产这对象 Producer<String, String> producer = new Producer<String, String> (config); // 生成消息 KeyedMessage<String, String> data1 = new KeyedMessage<String, String>("top1","test kafka" ); KeyedMessage <String, String> data2 = new KeyedMessage<String, String>("top2","hello world" ); try { int i =1 ; while (i < 1000 ){ // 发送消息 producer.send(data1); producer.send(data2); System.out.println( "put in kafka " + i); i ++ ; Thread.sleep( 1000 ); } catch (Exception e) { e.printStackTrace(); producer.close();

在SparkStreaming中接收指定话题的数据,对单词进行统计

package com.sf;
import java.util.HashMap;
import java.util.Iterator;
import java.util.Map;
import java.util.regex.Pattern;
import org.apache.spark.*;
import org.apache.spark.api.java.function.*;
import org.apache.spark.streaming.*;
import org.apache.spark.streaming.api.java.JavaDStream;
import org.apache.spark.streaming.api.java.JavaPairDStream;
import org.apache.spark.streaming.api.java.JavaPairReceiverInputDStream;
import org.apache.spark.streaming.api.java.JavaStreamingContext;
import org.apache.spark.streaming.kafka.KafkaUtils;
import scala.Tuple2;
import com.google.common.collect.Lists;
public class KafkaStreamingWordCount {
    public static void main(String[] args) throws InterruptedException {
        //设置匹配模式,以空格分隔
        final Pattern SPACE = Pattern.compile(" ");
        //接收数据的地址和端口
        String zkQuorum = "localhost:2181";
        //话题所在的组
        String group = "1";
        //话题名称以“,”分隔
        String topics = "top1,top2";
        //每个话题的分片数
        int numThreads = 2;    
        //Spark Streaming程序以StreamingContext为起点,其内部维持了一个SparkContext的实例。
        // 这里我们创建一个带有两个本地线程(local[2])的StreamingContext,并设置批处理间隔为1秒
        SparkConf sparkConf = new SparkConf().setAppName("KafkaWordCount").setMaster("local[2]");
        JavaStreamingContext jssc = new JavaStreamingContext(sparkConf, new Duration(10000));
        // 在一个Spark应用中默认只允许有一个SparkContext,默认地spark-shell已经为我们创建好了
        // SparkContext,名为sc。因此在spark-shell中应该以下述方式创建StreamingContext,以
        // 避免创建再次创建SparkContext而引起错误:
        // val ssc = new StreamingContext(sc, Seconds(1))
        //jssc.checkpoint("checkpoint"); //设置检查点
        //存放话题跟分片的映射关系
        Map<String, Integer> topicmap = new HashMap<>();
        String[] topicsArr = topics.split(",");
        int n = topicsArr.length;
        for(int i=0;i<n;i++){
            topicmap.put(topicsArr[i], numThreads);
        //从Kafka中获取数据转换成RDD
        JavaPairReceiverInputDStream<String, String> lines = KafkaUtils.createStream(jssc, zkQuorum, group, topicmap);
        //从话题中过滤所需数据
        JavaDStream<String> words = lines.flatMap(new FlatMapFunction<Tuple2<String, String>, String>() {
            @Override
            public Iterator<String> call(Tuple2<String, String> arg0)
                    throws Exception {
                return Lists.newArrayList(SPACE.split(arg0._2)).iterator();
        //对其中的单词进行统计
        JavaPairDStream<String, Integer> wordCounts = words.mapToPair(
              new PairFunction<String, String, Integer>() {
                @Override
                public Tuple2<String, Integer> call(String s) {
                  return new Tuple2<String, Integer>(s, 1);
              }).reduceByKey(new Function2<Integer, Integer, Integer>() {
                @Override
                public Integer call(Integer i1, Integer i2) {
                  return i1 + i2;
        //打印结果
        wordCounts.print();
        // 执行完上面代码,Spark Streaming并没有真正开始处理数据,而只是记录需在数据上执行的操作。
        // 当我们设置好所有需要在数据上执行的操作以后,我们就可以开始真正地处理数据了。如下:
        jssc.start();                  // 开始计算
        jssc.awaitTermination();      // 等待计算终止
017-01-18 18:32:27,800 WARN  org.apache.spark.storage.BlockManager.logWarning(Logging.scala:66) - Block input-0-1484735547600 replicated to only 0 peer(s) instead of 1 peers
2017-01-18 18:32:28,801 WARN  org.apache.spark.storage.BlockManager.logWarning(Logging.scala:66) - Block input-0-1484735548600 replicated to only 0 peer(s) instead of 1 peers
2017-01-18 18:32:29,801 WARN  org.apache.spark.storage.BlockManager.logWarning(Logging.scala:66) - Block input-0-1484735549600 replicated to only 0 peer(s) instead of 1 peers
-------------------------------------------
Time: 1484735550000 ms
-------------------------------------------
(hello,10)
(kafka,10)
(test,10)
(world,10)

master URL

配置conf/spark-env.sh 是配置spark的standalone环境,类似于hadoop配置hdfs环境一样。但是部署程序时仍然需要指定master的位置。
如果选择的部署模式是standalone且部署到你配置的这个集群上,可以指定 MASTER=spark://ubuntu:7070

下面解答spark在那里指定master URL的问题:
1.通过spark shell,执行后进入交互界面
MASTER=spark://IP:PORT ./bin/spark-shell

2.程序内指定(可以通过参数传入)

val conf = new SparkConf()
.setMaster(...)
val sc = new SparkContext(conf)

传递给spark的master url可以有如下几种:

local 本地单线程
local[K] 本地多线程(指定K个内核)
local[*] 本地多线程(指定所有可用内核)
spark://HOST:PORT 连接到指定的 Spark standalone cluster master,需要指定端口。
mesos://HOST:PORT 连接到指定的 Mesos 集群,需要指定端口。
yarn-client客户端模式 连接到 YARN 集群。需要配置 HADOOP_CONF_DIR。
yarn-cluster集群模式 连接到 YARN 集群 。需要配置 HADOOP_CONF_DIR。

spark1.0起的版本在提交程序到集群有很大的不同,需要注意:

./bin/spark-submit \
--class <main-class>
--master <master-url> \
--deploy-mode <deploy-mode> \
... # other options
<application-jar> \
[application-arguments]
# Run application locally on 8 cores
./bin/spark-submit \
--class org.apache.spark.examples.SparkPi \
--master local[8] \
/path/to/examples.jar \
# Run on a Spark standalone cluster
./bin/spark-submit \
--class org.apache.spark.examples.SparkPi \
--master spark://207.184.161.138:7077 \
--executor-memory 20G \
--total-executor-cores 100 \
/path/to/examples.jar \
# Run on a YARN cluster
export HADOOP_CONF_DIR=XXX
./bin/spark-submit \
--class org.apache.spark.examples.SparkPi \
--master yarn-cluster \ # can also be `yarn-client` for client mode
--executor-memory 20G \
--num-executors 50 \
/path/to/examples.jar \
# Run a Python application on a cluster
./bin/spark-submit \
--master spark://207.184.161.138:7077 \
examples/src/main/python/pi.py \

更多关于spark-submit见《spark提交模式